develooper Front page | perl.perl5.porters | Postings from August 2010

Re: [PATCH] Re: [perl #19078] wrong match order inside replacement

Thread Previous | Thread Next
From:
demerphq
Date:
August 10, 2010 00:57
Subject:
Re: [PATCH] Re: [perl #19078] wrong match order inside replacement
Message ID:
AANLkTimyKM2sOCPH1kGL+dSdgFeMxUOY1=v8DZUqBObW@mail.gmail.com
On 8 August 2010 21:24, Father Chrysostomos <sprout@cpan.org> wrote:
>
> On Aug 4, 2010, at 7:27 AM, demerphq wrote:
>
>> On 4 August 2010 15:47, karl williamson <public@khwilliamson.com> wrote:
>>> demerphq wrote:
>>>>
>>>> On 1 August 2010 21:19, Father Chrysostomos <sprout@cpan.org> wrote:
>>>>>
>>>>> A slightly simpler case:
>>>>>
>>>>> $ perl -le '$_="CCCGGG"; s!.!@a{print("[$&]"),/./}!g'
>>>>> [C]
>>>>> [C]
>>>>> [C]
>>>>> [C]
>>>>> [C]
>>>>> [C]
>>>>>
>>>>> What’s happening is that the s/// does not reset PL_curpm for each
>>>>> iteration, because it doesn’t usually have to.
>>>>>
>>>>> The RHS’s scoping takes care of it most of the time. This happens with
>>>>> the /e modifier and with @{...}.
>>>>>
>>>>> In this example, though, we have a subscript, not a block. This subscript
>>>>> is in the same scope as the s/// itself.
>>>>>
>>>>> The assumption that the substitution operator will never have to reset
>>>>> PL_curpm itself appears to be incorrect. The attached patch fixes it.
>>>>
>>>> Hi, thanks for the patch. Im a little concerned that this might have
>>>> deeper consequences than at first meet the eye, could we hold off on
>>>> this one until we know more?
>>>
>>> How will we find out more?
>>
>> The problem is that there is a lot of code around that tries to deal
>> with various logical inconsistancies related to qr// and PL_curpm, and
>> the code is not obvious or intuitive.
>>
>> So, for instance I am wondering what happens with this patch and
>> things like recursive patterns on the LHS, or things like that.
>>
>> PL_curpm and everything related to it really deserves some deep
>> analysis and some love.
>
> Could you explain to me what the two PL_curpm assignments in regexec.c are for? One is in S_regtry; the other in restore_pos. When are those two functions called?
>
> (All other assignments to PL_curpm look fine to me and do not conflict with the patch.)

regtry() is a setup wrapper around regmatch(). Essentailly matching
consists of calling regtry at each successive character point in the
string until we get a match, possibly with the start points filtered
out by logic in find_byclass. It handles the book keeping tasks such
as setting up the capture buffer and things like that.

restore_pos() is called prior to exiting p_regexec().

To explain a bit....

Originally regexes were basically stored entirely in the opcode, there
were no recursive patterns and there were no qr objects. The regex
struct included allocation for the capture buffer offsets  PL_curpm
was the most recent regex op that succeeded. Since the capture buffer
offsets were effectively stored in the opcode this all made sense.

At some point in the evolution of the regex engine we ended up with
both qr// objects (so now regex objects arent so strongly associated
with PMOP's) AND recursive patterns of the form (??{ $qr_object }).
This is where things get hairy, because this means that a regex object
can be PL_curpm in two different scopes but /against different
strings/. Yet each regex object has only one capture buffer associated
with it. Now, back in the day some of this was solved by some gnarly
stuff on the stack, some of it was solved with a hack I did called a
swap buffer (this is/was used to prevent clobbering the results of a
successful match when attempting a new match), and some of it was
solved with the idea of a "lite copy" of regex (see the mother_re
field etc) which effectively provides a new capture buffer with the
results.

Anyway, the reason why I asked to hold off on this, and I apologize
for taking so long to explain, is that the above is (obviously) a big
mess. The basic design is arguably broken at multiple levels, and I
fear that even slight changes to how it works can have deep and
surprising effects. We need to be very careful here, and we also need
to really consider somebody looking into it all quite carefully.
Currently we have to consider that match results, regex objects, and
PMOP's are all independent objects, yet they started off being one
object and we should consider redoing how we manage them so that it is
simpler and provably correct.

cheers,
Yves






-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About