Front page | perl.perl5.porters |
Postings from November 2010
Re: [perl #68564] /g failure with zero-width patterns
Thread Previous
|
Thread Next
From:
demerphq
Date:
November 3, 2010 02:41
Subject:
Re: [perl #68564] /g failure with zero-width patterns
Message ID:
AANLkTinRS1_wFJXQtjOnXT2vHEiRreQH5bJAapBO_Vts@mail.gmail.com
On 10 October 2010 23:27, Father Chrysostomos via RT
<perlbug-followup@perl.org> wrote:
> On Sun Aug 16 01:15:58 2009, ikegami@adaelis.com wrote:
>> A regression was introduced into 5.10.0 concerning /g and zero-width
>> patterns. The demo speaks for itself:
>>
>> >c:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
>> abc
>> bc
>> c
>>
>> >c:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
>> abc
>
> This was broken by commit 07be1b83a6b2d24b492356181ddf70e1c7917ae3,
> which extended stclass optimisations to (?=).
>
> I tried following the code paths for + and for {1,} (which are meant to
> be identical, but only {1,} was working). I noticed they diverged as a
> result of + having PREGf_SKIP set.
>
> So I fixed it by not setting PREGf_SKIP if the + is inside a (?=).
>
> I really don’t understand this code, and would much appreciate any
> feedback as to whether this fix (or ‘fix’) will break anything else.
I have looked into this more deeply and applied a modified version of
your patch, which IMO was more or less exactly correct.
Here is the commit message I wrote:
commit e7f38d0fe17e7a846c0ed55e71ebb120a336b887
Author: Yves Orton <demerphq@gmail.com>
Date: Wed Nov 3 10:23:00 2010 +0100
fix 68564: /g failure with zero-width patterns
This is based on a patch by Father Chrysostomos <sprout@cpan.org>
The start class optimisation has two modes, "try every valid start
position" (doevery) and "flip flop mode" (!doevery) where it trys
only the first valid start position in a sequence.
Consider /(\d+)X/ and the string "123456Y", now we know that if we fail
to match X after matching "123456" then we will also fail to match after
"23456" (assuming no evil tricks are in place, which disable the
optimisation anyway), so we know we can skip forward until the check
/fails/ and only then start looking for a real match. This is flip-flop
mode.
Now consider the case with zero-width lookahead under /g: /(?=(\d+)X)/.
In this case we have an additional failure mode, that is failure when
we match a zero-width string twice at the same pos(). So now, the
"flip-flop" logic breaks as it /is/ possible that we could match at
"23456" when we couldn't match at "123456" because of the zero-length
twice at the same pos() rule. For instance:
print $1 for "123"=~/(?=(\d+))/g
should first match "123". Since $& is zero length, pos() is not
incremented. We then match again, successfully, except that the match
is rejected despite technical-success because its $& is also zero
length and pos() has not advanced. If the flip-flop mode is enabled
we wont retry until we find a failing character first.
The point here is that it makes perfect sense to disable the
"flip-flop" mode optimisation when the start class is inside
a lookahead as it really doesnt apply.
IMO your patch was quite right, although I had to dig fairly deep to
understand why.
Thanks for the patch.
BTW, I am a bit curious if there are any other flaws in the flip-flop logic.
I tried reasonably hard to make it fail without the zero-width
lookahead, and was
unable to find a failure case, but I still kinda feel like there might
be some interesting
edge case
Cheers,
yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread Previous
|
Thread Next