Front page | perl.perl5.porters |
Postings from August 2010
Re: [PATCH] Re: [perl #19078] wrong match order inside replacement
Thread Previous
|
Thread Next
From:
Father Chrysostomos
Date:
August 15, 2010 13:25
Subject:
Re: [PATCH] Re: [perl #19078] wrong match order inside replacement
Message ID:
EC29A4BC-B1A5-44EA-9FAF-0498D0D8E808@cpan.org
On Aug 10, 2010, at 12:57 AM, demerphq wrote:
> On 8 August 2010 21:24, Father Chrysostomos <sprout@cpan.org> wrote:
>>
>> On Aug 4, 2010, at 7:27 AM, demerphq wrote:
>>
>>> On 4 August 2010 15:47, karl williamson <public@khwilliamson.com> wrote:
>>>> demerphq wrote:
>>>>>
>>>>> On 1 August 2010 21:19, Father Chrysostomos <sprout@cpan.org> wrote:
>>>>>>
>>>>>> A slightly simpler case:
>>>>>>
>>>>>> $ perl -le '$_="CCCGGG"; s!.!@a{print("[$&]"),/./}!g'
>>>>>> [C]
>>>>>> [C]
>>>>>> [C]
>>>>>> [C]
>>>>>> [C]
>>>>>> [C]
>>>>>>
>>>>>> What’s happening is that the s/// does not reset PL_curpm for each
>>>>>> iteration, because it doesn’t usually have to.
>>>>>>
>>>>>> The RHS’s scoping takes care of it most of the time. This happens with
>>>>>> the /e modifier and with @{...}.
>>>>>>
>>>>>> In this example, though, we have a subscript, not a block. This subscript
>>>>>> is in the same scope as the s/// itself.
>>>>>>
>>>>>> The assumption that the substitution operator will never have to reset
>>>>>> PL_curpm itself appears to be incorrect. The attached patch fixes it.
>>>>>
>>>>> Hi, thanks for the patch. Im a little concerned that this might have
>>>>> deeper consequences than at first meet the eye, could we hold off on
>>>>> this one until we know more?
>>>>
>>>> How will we find out more?
>>>
>>> The problem is that there is a lot of code around that tries to deal
>>> with various logical inconsistancies related to qr// and PL_curpm, and
>>> the code is not obvious or intuitive.
>>>
>>> So, for instance I am wondering what happens with this patch and
>>> things like recursive patterns on the LHS, or things like that.
>>>
>>> PL_curpm and everything related to it really deserves some deep
>>> analysis and some love.
>>
>> Could you explain to me what the two PL_curpm assignments in regexec.c are for? One is in S_regtry; the other in restore_pos. When are those two functions called?
>>
>> (All other assignments to PL_curpm look fine to me and do not conflict with the patch.)
>
> regtry() is a setup wrapper around regmatch(). Essentailly matching
> consists of calling regtry at each successive character point in the
> string until we get a match, possibly with the start points filtered
> out by logic in find_byclass. It handles the book keeping tasks such
> as setting up the capture buffer and things like that.
>
> restore_pos() is called prior to exiting p_regexec().
I’ve had a look, and I see that S_regtry assigns PL_curpm to PL_reg_oldcurpm before clobbering the former.
restore_pos restores to PL_curpm the value saved in PL_reg_oldcurpm.
That all happens within the call to pregexec, so the value of PL_curpm should always be the same after pregexec as it was before.
>
> To explain a bit....
>
> Originally regexes were basically stored entirely in the opcode, there
> were no recursive patterns and there were no qr objects. The regex
> struct included allocation for the capture buffer offsets PL_curpm
> was the most recent regex op that succeeded. Since the capture buffer
> offsets were effectively stored in the opcode this all made sense.
>
That would explain the scoping bug.
$u = ",echle etn sJ";
$t = "\nrka rPrhoatu";
$_ = $u.$t;
sub foo { s/(.)//s or return; bar(); print chop $$1 }
sub bar { s/(.)//s or return; foo(); print chop $$1 }
foo
> At some point in the evolution of the regex engine we ended up with
> both qr// objects (so now regex objects arent so strongly associated
> with PMOP's) AND recursive patterns of the form (??{ $qr_object }).
> This is where things get hairy, because this means that a regex object
> can be PL_curpm in two different scopes but /against different
> strings/. Yet each regex object has only one capture buffer associated
> with it. Now, back in the day some of this was solved by some gnarly
> stuff on the stack, some of it was solved with a hack I did called a
> swap buffer (this is/was used to prevent clobbering the results of a
> successful match when attempting a new match), and some of it was
> solved with the idea of a "lite copy" of regex (see the mother_re
> field etc) which effectively provides a new capture buffer with the
> results.
>
I know about the light copies, since I hacked around them to fix bug #70764. :-)
> Anyway, the reason why I asked to hold off on this, and I apologize
> for taking so long to explain, is that the above is (obviously) a big
> mess. The basic design is arguably broken at multiple levels, and I
> fear that even slight changes to how it works can have deep and
> surprising effects. We need to be very careful here, and we also need
> to really consider somebody looking into it all quite carefully.
> Currently we have to consider that match results, regex objects, and
> PMOP's are all independent objects, yet they started off being one
> object and we should consider redoing how we manage them so that it is
> simpler and provably correct.
>
Even if this is cleaned up (I’d like to do it, but not any time soon, as I need more time to digest the source code), I think my patch to pp_substcont would still apply, as pp_substcont is effectively ‘outside’ the regexp engine. In fact, if PL_curpm is replaced with some other global, it will still need to be set in pp_substcont, so in a sense my patch marks the right spot for future reference.
I’ve not been able to come up with a test case that my patch breaks.
Thread Previous
|
Thread Next