develooper Front page | perl.perl5.porters | Postings from July 2018

Re: [perl #133352] Ancient Regex Regression

Thread Previous | Thread Next
From:
David Nicol
Date:
July 17, 2018 19:54
Subject:
Re: [perl #133352] Ancient Regex Regression
Message ID:
28033_1531857291_5B4E498A_28033_85_1_CAFwScO-NND-EK0bOpnFDrhLZQT=dBm_GX-aOPtH-vuA3Tyw6Yw@mail.gmail.com
On Tue, Jul 17, 2018 at 12:56 PM Deven T. Corzine <deven@ties.org> wrote:

> On the other hand, my patch does allow this example to match:
>>
>>      print "matched\n" if "ABCDA" =~ /^ (?: (.)B | CD )* \1 $/x;
>>
>> Without my patch, this matches instead:
>>
>>      print "matched\n" if "ABCDC" =~ /^ (?: (.)B | CD )* \1 $/x;
>>
>>
>>
>> That optimization doesn't cause the bug, it's the attempt to match the
> (.) again against "CD" that causes it -- the (.) matches, but the "D"
> doesn't, and it doesn't restore the original capture.
>
> Deven
>

Thank you, I misunderstood. So in the original demonstration, the "b" got
into $2 before the branch failed because the b was not followed by "foo",
not due to $2 being internally tracked as an offset, and as that branch had
succeeded, the capture was assignable.


As the current documentation (the section on "Capture Groups" in
https://perldoc.perl.org/perlre.html, accessed just now) states "If a group
did not match, the associated backreference won't match either. (This can
happen if the group is optional, or in a different branch of an
alternation.) " there is clearly a bug somewhere. On the other hand, as
CDCDC fails to match the test while CBCDC (unsurprisingly, but for
surprising reason) does, so there is some kind of "did this match"
knowledge happening, otherwise CDCDC would set \1 to C before failing to
match the B, and the implementation could be interpreted as conformant with
the documentation's "if a group did not match" but it takes a lot of
squinting.

The current documentation (that section) contains no guidance concerning
capture groups within repeating constructs. Honestly, before today I
expected regex constructions like

               "abcdef" =~ /(?:(.))+/

to magically create $1 through $6 and load them all. This was erroneous!
That's not how it works! The documentation is silent on the matter.

As an opinionated person, I'm in favor of fixing the regression and
including

  # we don't clobber capture groups with data from failed alternate
branches (although we used to)
* ( **"ABCD" =~ /^(?:(.)B|CD)*$/ and $1 eq **( $] ge '5.027' ? 'A' : 'C' ))*

into the test suite and documenting how captures into buffers in
alternations that passed in earlier iterations but not the most recent one
used to work, in perldoc perlre.

...
After looking at
https://rt.perl.org/Public/Ticket/Attachment/1566563/824618/perl-133136-test1.patch
I wonder if it might be possible to defer the assignment into the capture
buffers until after branches have succeeded, rather than resetting them.
This approach might require making a set of provisional capture buffers at
every juncture that could become a descent into an iterating subregex
containing captures, but wouldn't be vulnerable to only operating correctly
at the first level. But maybe the engine already stacks these things so
with the patch

$ perl -e ' "ABCDAFCDAD" =~ /(?:(?:(.)B|CD)+|(?:(.)D|A(.))*)+/ and print
"$1 > $2 > $3"'
C > A > F

will do the right thing, whatever that is.

Thank you



-- 
"At this point, given the limited available data, certainty about only a
very small number of things can be achieved." -- Plato, and others

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About