develooper Front page | perl.perl5.porters | Postings from August 2010

Re: [PATCH] Re: [perl #19078] wrong match order inside replacement

Thread Previous | Thread Next
Father Chrysostomos
August 15, 2010 13:25
Re: [PATCH] Re: [perl #19078] wrong match order inside replacement
Message ID:

On Aug 10, 2010, at 12:57 AM, demerphq wrote:

> On 8 August 2010 21:24, Father Chrysostomos <> wrote:
>> On Aug 4, 2010, at 7:27 AM, demerphq wrote:
>>> On 4 August 2010 15:47, karl williamson <> wrote:
>>>> demerphq wrote:
>>>>> On 1 August 2010 21:19, Father Chrysostomos <> wrote:
>>>>>> A slightly simpler case:
>>>>>> $ perl -le '$_="CCCGGG"; s!.!@a{print("[$&]"),/./}!g'
>>>>>> [C]
>>>>>> [C]
>>>>>> [C]
>>>>>> [C]
>>>>>> [C]
>>>>>> [C]
>>>>>> What’s happening is that the s/// does not reset PL_curpm for each
>>>>>> iteration, because it doesn’t usually have to.
>>>>>> The RHS’s scoping takes care of it most of the time. This happens with
>>>>>> the /e modifier and with @{...}.
>>>>>> In this example, though, we have a subscript, not a block. This subscript
>>>>>> is in the same scope as the s/// itself.
>>>>>> The assumption that the substitution operator will never have to reset
>>>>>> PL_curpm itself appears to be incorrect. The attached patch fixes it.
>>>>> Hi, thanks for the patch. Im a little concerned that this might have
>>>>> deeper consequences than at first meet the eye, could we hold off on
>>>>> this one until we know more?
>>>> How will we find out more?
>>> The problem is that there is a lot of code around that tries to deal
>>> with various logical inconsistancies related to qr// and PL_curpm, and
>>> the code is not obvious or intuitive.
>>> So, for instance I am wondering what happens with this patch and
>>> things like recursive patterns on the LHS, or things like that.
>>> PL_curpm and everything related to it really deserves some deep
>>> analysis and some love.
>> Could you explain to me what the two PL_curpm assignments in regexec.c are for? One is in S_regtry; the other in restore_pos. When are those two functions called?
>> (All other assignments to PL_curpm look fine to me and do not conflict with the patch.)
> regtry() is a setup wrapper around regmatch(). Essentailly matching
> consists of calling regtry at each successive character point in the
> string until we get a match, possibly with the start points filtered
> out by logic in find_byclass. It handles the book keeping tasks such
> as setting up the capture buffer and things like that.
> restore_pos() is called prior to exiting p_regexec().

I’ve had a look, and I see that S_regtry assigns PL_curpm to PL_reg_oldcurpm before clobbering the former.

restore_pos restores to PL_curpm the value saved in PL_reg_oldcurpm.

That all happens within the call to pregexec, so the value of PL_curpm should always be the same after pregexec as it was before.

> To explain a bit....
> Originally regexes were basically stored entirely in the opcode, there
> were no recursive patterns and there were no qr objects. The regex
> struct included allocation for the capture buffer offsets  PL_curpm
> was the most recent regex op that succeeded. Since the capture buffer
> offsets were effectively stored in the opcode this all made sense.
That would explain the scoping bug.

$u = ",echle etn sJ";
$t = "\nrka rPrhoatu";
$_ = $u.$t;
sub foo { s/(.)//s or return; bar(); print chop $$1 }
sub bar { s/(.)//s or return; foo(); print chop $$1 }

> At some point in the evolution of the regex engine we ended up with
> both qr// objects (so now regex objects arent so strongly associated
> with PMOP's) AND recursive patterns of the form (??{ $qr_object }).
> This is where things get hairy, because this means that a regex object
> can be PL_curpm in two different scopes but /against different
> strings/. Yet each regex object has only one capture buffer associated
> with it. Now, back in the day some of this was solved by some gnarly
> stuff on the stack, some of it was solved with a hack I did called a
> swap buffer (this is/was used to prevent clobbering the results of a
> successful match when attempting a new match), and some of it was
> solved with the idea of a "lite copy" of regex (see the mother_re
> field etc) which effectively provides a new capture buffer with the
> results.
I know about the light copies, since I hacked around them to fix bug #70764. :-)

> Anyway, the reason why I asked to hold off on this, and I apologize
> for taking so long to explain, is that the above is (obviously) a big
> mess. The basic design is arguably broken at multiple levels, and I
> fear that even slight changes to how it works can have deep and
> surprising effects. We need to be very careful here, and we also need
> to really consider somebody looking into it all quite carefully.
> Currently we have to consider that match results, regex objects, and
> PMOP's are all independent objects, yet they started off being one
> object and we should consider redoing how we manage them so that it is
> simpler and provably correct.
Even if this is cleaned up (I’d like to do it, but not any time soon, as I need more time to digest the source code), I think my patch to pp_substcont would still apply, as pp_substcont is effectively ‘outside’ the regexp engine. In fact, if PL_curpm is replaced with some other global, it will still need to be set in pp_substcont, so in a sense my patch marks the right spot for future reference.

I’ve not been able to come up with a test case that my patch breaks.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About