develooper Front page | perl.perl5.porters | Postings from August 2010

Re: [PATCH] Re: [perl #19078] wrong match order inside replacement

Thread Previous
August 20, 2010 08:14
Re: [PATCH] Re: [perl #19078] wrong match order inside replacement
Message ID:
On 15 August 2010 22:24, Father Chrysostomos <> wrote:
> On Aug 10, 2010, at 12:57 AM, demerphq wrote:
>> On 8 August 2010 21:24, Father Chrysostomos <> wrote:
>>> On Aug 4, 2010, at 7:27 AM, demerphq wrote:
>>>> On 4 August 2010 15:47, karl williamson <> wrote:
>>>>> demerphq wrote:
>>>>>> On 1 August 2010 21:19, Father Chrysostomos <> wrote:
>>>>>>> A slightly simpler case:
>>>>>>> $ perl -le '$_="CCCGGG"; s!.!@a{print("[$&]"),/./}!g'
>>>>>>> [C]
>>>>>>> [C]
>>>>>>> [C]
>>>>>>> [C]
>>>>>>> [C]
>>>>>>> [C]
>>>>>>> What’s happening is that the s/// does not reset PL_curpm for each
>>>>>>> iteration, because it doesn’t usually have to.
>>>>>>> The RHS’s scoping takes care of it most of the time. This happens with
>>>>>>> the /e modifier and with @{...}.
>>>>>>> In this example, though, we have a subscript, not a block. This subscript
>>>>>>> is in the same scope as the s/// itself.
>>>>>>> The assumption that the substitution operator will never have to reset
>>>>>>> PL_curpm itself appears to be incorrect. The attached patch fixes it.
>>>>>> Hi, thanks for the patch. Im a little concerned that this might have
>>>>>> deeper consequences than at first meet the eye, could we hold off on
>>>>>> this one until we know more?
>>>>> How will we find out more?
>>>> The problem is that there is a lot of code around that tries to deal
>>>> with various logical inconsistancies related to qr// and PL_curpm, and
>>>> the code is not obvious or intuitive.
>>>> So, for instance I am wondering what happens with this patch and
>>>> things like recursive patterns on the LHS, or things like that.
>>>> PL_curpm and everything related to it really deserves some deep
>>>> analysis and some love.
>>> Could you explain to me what the two PL_curpm assignments in regexec.c are for? One is in S_regtry; the other in restore_pos. When are those two functions called?
>>> (All other assignments to PL_curpm look fine to me and do not conflict with the patch.)
>> regtry() is a setup wrapper around regmatch(). Essentailly matching
>> consists of calling regtry at each successive character point in the
>> string until we get a match, possibly with the start points filtered
>> out by logic in find_byclass. It handles the book keeping tasks such
>> as setting up the capture buffer and things like that.
>> restore_pos() is called prior to exiting p_regexec().
> I’ve had a look, and I see that S_regtry assigns PL_curpm to PL_reg_oldcurpm before clobbering the former.
> restore_pos restores to PL_curpm the value saved in PL_reg_oldcurpm.
> That all happens within the call to pregexec, so the value of PL_curpm should always be the same after pregexec as it was before.
>> To explain a bit....
>> Originally regexes were basically stored entirely in the opcode, there
>> were no recursive patterns and there were no qr objects. The regex
>> struct included allocation for the capture buffer offsets  PL_curpm
>> was the most recent regex op that succeeded. Since the capture buffer
>> offsets were effectively stored in the opcode this all made sense.
> That would explain the scoping bug.
> $u = ",echle etn sJ";
> $t = "\nrka rPrhoatu";
> $_ = $u.$t;
> sub foo { s/(.)//s or return; bar(); print chop $$1 }
> sub bar { s/(.)//s or return; foo(); print chop $$1 }
> foo
>> At some point in the evolution of the regex engine we ended up with
>> both qr// objects (so now regex objects arent so strongly associated
>> with PMOP's) AND recursive patterns of the form (??{ $qr_object }).
>> This is where things get hairy, because this means that a regex object
>> can be PL_curpm in two different scopes but /against different
>> strings/. Yet each regex object has only one capture buffer associated
>> with it. Now, back in the day some of this was solved by some gnarly
>> stuff on the stack, some of it was solved with a hack I did called a
>> swap buffer (this is/was used to prevent clobbering the results of a
>> successful match when attempting a new match), and some of it was
>> solved with the idea of a "lite copy" of regex (see the mother_re
>> field etc) which effectively provides a new capture buffer with the
>> results.
> I know about the light copies, since I hacked around them to fix bug #70764. :-)
>> Anyway, the reason why I asked to hold off on this, and I apologize
>> for taking so long to explain, is that the above is (obviously) a big
>> mess. The basic design is arguably broken at multiple levels, and I
>> fear that even slight changes to how it works can have deep and
>> surprising effects. We need to be very careful here, and we also need
>> to really consider somebody looking into it all quite carefully.
>> Currently we have to consider that match results, regex objects, and
>> PMOP's are all independent objects, yet they started off being one
>> object and we should consider redoing how we manage them so that it is
>> simpler and provably correct.
> Even if this is cleaned up (I’d like to do it, but not any time soon, as I need more time to digest the source code), I think my patch to pp_substcont would still apply, as pp_substcont is effectively ‘outside’ the regexp engine. In fact, if PL_curpm is replaced with some other global, it will still need to be set in pp_substcont, so in a sense my patch marks the right spot for future reference.
> I’ve not been able to come up with a test case that my patch breaks.

Well if you have reviewed the source aware of my general concerns and
concluded its safe then im satisfied for sure.


perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About