develooper Front page | perl.perl5.porters | Postings from July 2018

Re: [perl #133352] Ancient Regex Regression

Thread Previous | Thread Next
From:
Deven T. Corzine
Date:
July 22, 2018 07:52
Subject:
Re: [perl #133352] Ancient Regex Regression
Message ID:
7032_1532245926_5B5437A5_7032_98_1_CAFVdu0RrXg6-738JpFk8O3o6BR_kXoducDXa2p6+xRMcOFG6YQ@mail.gmail.com
On Tue, Jul 17, 2018 at 3:54 PM, David Nicol <davidnicol@gmail.com> wrote:
>
>
> On Tue, Jul 17, 2018 at 12:56 PM Deven T. Corzine <deven@ties.org> wrote:
>>>
>>> On the other hand, my patch does allow this example to match:
>>>
>>>      print "matched\n" if "ABCDA" =~ /^ (?: (.)B | CD )* \1 $/x;
>>>
>>> Without my patch, this matches instead:
>>>
>>>      print "matched\n" if "ABCDC" =~ /^ (?: (.)B | CD )* \1 $/x;
>>>
>>>
>>>
>> That optimization doesn't cause the bug, it's the attempt to match the (.)
>> again against "CD" that causes it -- the (.) matches, but the "D" doesn't,
>> and it doesn't restore the original capture.
>>
>> Deven
>
>
> Thank you, I misunderstood. So in the original demonstration, the "b" got
> into $2 before the branch failed because the b was not followed by "foo",
> not due to $2 being internally tracked as an offset, and as that branch had
> succeeded, the capture was assignable.

Exactly!  Now you've got it.

For reference, my original example was: "afoobar" =~ /((.)foo|bar)*/

On the second iteration, (.) does successfully match the "b" in "bar",
so it saves that capture in $2 before the "foo" attempts to match
against the remaining "ar", which obviously fails.  Since the branch
fails as a whole, the capture of "b" is invalid, yet it gets returned
anyhow.  This is the bug.

> As the current documentation (the section on "Capture Groups" in
> https://perldoc.perl.org/perlre.html, accessed just now) states "If a group
> did not match, the associated backreference won't match either. (This can
> happen if the group is optional, or in a different branch of an
> alternation.) " there is clearly a bug somewhere. On the other hand, as
> CDCDC fails to match the test while CBCDC (unsurprisingly, but for
> surprising reason) does, so there is some kind of "did this match" knowledge
> happening, otherwise CDCDC would set \1 to C before failing to match the B,
> and the implementation could be interpreted as conformant with the
> documentation's "if a group did not match" but it takes a lot of squinting.

In my "afoobar" example, the inner group doesn't match successfully
against "bar", so it shouldn't return "b" as it does.  But that group
DID successfully match "a", so it seems to me that it ought to be
returned from the successful match.  If the match text were "barbar",
then $2 should be undef because the inner group wouldn't match at all.

> The current documentation (that section) contains no guidance concerning
> capture groups within repeating constructs.

I think you're right about this.  I've searched the documentation
pretty carefully and haven't found anything that describes what
happens when capture groups are repeated.  I'm surprised by this, but
I can't find it anywhere.

> Honestly, before today I expected regex constructions like
>
>                "abcdef" =~ /(?:(.))+/
>
> to magically create $1 through $6 and load them all. This was erroneous!
> That's not how it works! The documentation is silent on the matter.

It's not silent about this, it's in the sentence immediately before
the one you quoted!  "Groups are numbered with the leftmost open
parenthesis being number 1, etc."  This only applies to parens used
for capturing groups, not other parens (e.g. non-capturing groups,
lookahead matches, etc.) -- just count the open parens to find the
group number.  (The branch-reset construct (?|...) is an exception to
this rule, reusing the same group numbers for each branch.)

Since your example above only has a single capturing group (.), it's
always $1 and there is no $2.  When your non-capturing group repeats,
it just keeps replacing $1 with each letter, returning "f" at the end.

> As an opinionated person, I'm in favor of fixing the regression and
> including
>
>   # we don't clobber capture groups with data from failed alternate branches
> (although we used to)
>  ( "ABCD" =~ /^(?:(.)B|CD)*$/ and $1 eq ( $] ge '5.027' ? 'A' : 'C' ))
>
> into the test suite and documenting how captures into buffers in
> alternations that passed in earlier iterations but not the most recent one
> used to work, in perldoc perlre.

In the example above, "C" is clearly the wrong result for $1, since
there is no "CB" in "ABCD".  There should be no version check to
whitewash the results -- this is a regression test and those older
versions should FAIL this test because they DO contain this bug!

> ...
> After looking at
> https://rt.perl.org/Public/Ticket/Attachment/1566563/824618/perl-133136-test1.patch

I don't know how I ended up typing the wrong ticket number into that
filename!  Of course, it should have been 133352, not 133136...

> I wonder if it might be possible to defer the assignment into the capture
> buffers until after branches have succeeded, rather than resetting them.

I'm sure it's possible, and I've been wondering if it would be better or worse.

> This approach might require making a set of provisional capture buffers at
> every juncture that could become a descent into an iterating subregex
> containing captures, but wouldn't be vulnerable to only operating correctly
> at the first level.

Yes, it would require handling the possibility of multiple levels of
provisional captures.

> But maybe the engine already stacks these things so with
> the patch
>
> $ perl -e ' "ABCDAFCDAD" =~ /(?:(?:(.)B|CD)+|(?:(.)D|A(.))*)+/ and print "$1
>> $2 > $3"'
> C > A > F
>
> will do the right thing, whatever that is.

The correct answer is A > A > F, which is what it returns with my
patch.  The bug doesn't affect the other groups in this example.

I get where you're going with this.  Here's an example that makes the
point more strongly:

perl -MData::Dumper -e '$Data::Dumper::Varname=""; print
Dumper("ABCDEFGHIJKLMHO" =~
/^(?:(.)(?:(.)C|DE)+FG|H(.)(?:(.)K|LM)+|.O)+$/);'

Perl currently returns the wrong values for ALL of these captures:

$1 = 'H';
$2 = 'D';
$3 = 'O';
$4 = 'L';

With my patch, the correct values are returned:

$1 = 'A';
$2 = 'B';
$3 = 'I';
$4 = 'J';

With a bit more mangling, I came up with this example as well:

perl -MData::Dumper -e '$Data::Dumper::Varname=""; print
Dumper("ABCDEHIJKLMHO" =~
/^(?:(.)(?:(.)C|DE)+FG|H(.)(?:(.)K|LM)+|.O|.*?)+$/);'

Currently, Perl only gets $2 correct for this example, and the other
captures are wrong:

$1 = 'H';
$2 = undef;
$3 = 'O';
$4 = 'L';

Again, with my patch, the correct values are returned:

$1 = undef;
$2 = undef;
$3 = 'I';
$4 = 'J';

Isn't this fun? :)

> Thank you

Thanks for participating in the discussion!

Deven

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About