develooper Front page | perl.perl5.porters | Postings from November 2010

Re: [perl #68564] /g failure with zero-width patterns

Thread Previous | Thread Next
From:
demerphq
Date:
November 3, 2010 02:41
Subject:
Re: [perl #68564] /g failure with zero-width patterns
Message ID:
AANLkTinRS1_wFJXQtjOnXT2vHEiRreQH5bJAapBO_Vts@mail.gmail.com
On 10 October 2010 23:27, Father Chrysostomos via RT
<perlbug-followup@perl.org> wrote:
> On Sun Aug 16 01:15:58 2009, ikegami@adaelis.com wrote:
>> A regression was introduced into 5.10.0 concerning /g and zero-width
>> patterns. The demo speaks for itself:
>>
>> >c:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
>> abc
>> bc
>> c
>>
>> >c:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
>> abc
>
> This was broken by commit 07be1b83a6b2d24b492356181ddf70e1c7917ae3,
> which extended stclass optimisations to (?=).
>
> I tried following the code paths for + and for {1,} (which are meant to
> be identical, but only {1,} was working). I noticed they diverged as a
> result of + having PREGf_SKIP set.
>
> So I fixed it by not setting PREGf_SKIP if the + is inside a (?=).
>
> I really don’t understand this code, and would much appreciate any
> feedback as to whether this fix (or ‘fix’) will break anything else.

I have looked into this more deeply and applied a modified version of
your patch, which IMO was more or less exactly correct.

Here is the commit message I wrote:

commit e7f38d0fe17e7a846c0ed55e71ebb120a336b887
Author: Yves Orton <demerphq@gmail.com>
Date:   Wed Nov 3 10:23:00 2010 +0100

    fix 68564: /g failure with zero-width patterns

    This is based on a patch by Father Chrysostomos <sprout@cpan.org>

    The start class optimisation has two modes, "try every valid start
    position" (doevery) and "flip flop mode" (!doevery) where it trys
    only the first valid start position in a sequence.

    Consider /(\d+)X/ and the string "123456Y", now we know that if we fail
    to match X after matching "123456" then we will also fail to match after
    "23456" (assuming no evil tricks are in place, which disable the
    optimisation anyway), so we know we can skip forward until the check
    /fails/ and only then start looking for a real match. This is flip-flop
    mode.

    Now consider the case with zero-width lookahead under /g: /(?=(\d+)X)/.
    In this case we have an additional failure mode, that is failure when
    we match a zero-width string twice at the same pos(). So now, the
    "flip-flop" logic breaks as it /is/ possible that we could match at
    "23456" when we couldn't match at "123456" because of the zero-length
    twice at the same pos() rule. For instance:

      print $1 for "123"=~/(?=(\d+))/g

    should first match "123". Since $& is zero length, pos() is not
    incremented. We then match again, successfully, except that the match
    is rejected despite technical-success because its $& is also zero
    length and pos() has not advanced. If the flip-flop mode is enabled
    we wont retry until we find a failing character first.

    The point here is that it makes perfect sense to disable the
    "flip-flop" mode optimisation when the start class is inside
    a lookahead as it really doesnt apply.

IMO your patch was quite right, although I had to dig fairly deep to
understand why.

Thanks for the patch.

BTW, I am a bit curious if there are any other flaws in the flip-flop logic.
I tried reasonably hard to make it fail without the zero-width
lookahead, and was
unable to find a failure case, but I still kinda feel like there might
be some interesting
edge case


Cheers,
yves


-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About