develooper Front page | perl.perl5.porters | Postings from July 2018

[perl #133352] Ancient Regex Regression

Thread Next
Deven T. Corzine via RT
July 13, 2018 03:30
[perl #133352] Ancient Regex Regression
Message ID:
On Tue, 10 Jul 2018 12:19:30 -0700, hv wrote:
> On Sun, 08 Jul 2018 21:04:32 -0700, dcorzine wrote:
> > This was my test case, which works with or without anchors:
> >
> > "afoobar" =~ /((.)foo|bar)*/
> > "afoobar" =~ /^((.)foo|bar)*$/
> >
> > Or, as a standalone command:
> >
> > perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'
> >
> > This prints "b", even though "bfoo" never appears in "afoobar"!
> [...]
> > The correct answer seems to be "a", since that's the last match
> > of the inner group and the overall match is successful.
> It's by no means clear to me that that must be the correct answer; the
> other candidate worth considering is that, since in /(...)*/ we return
> the match for the last iteration of the group, you'd expect further
> captures embedded within there also to deliver the version from the
> last iteration of the group.
> Under that interpretation, the correct answer would be undef, and the
> earlier releases that returned 'a' were merely more subtly wrong than
> the current release.

Yeah, that's why I said the correct "seems" to be "a".  There's a decent argument for returning undef, and it's certainly counter-intuitive to some degree to have $1="bar" from the second iteration and $2="a" from the first iteration, but the inner group does successfully match during the first iteration only, so "a" is indeed the last successful match.

I see two different viewpoints here, and both are quite reasonable.  One viewpoint is that these are nested groups and both should be returning results from the same iteration, and therefore $2 should be undef because the nested match doesn't match anything on that iteration.  Intuitively, this feels right.  The other viewpoint is that each group can match multiple times and we only get to keep one capture per group, so $2 should be "a" since that's the last successful match for that group, despite the confusion of matching $1 and $2 from different iterations.

Personally, the $2="a" viewpoint seems like a stronger argument to me, but I could be in the minority in thinking that.  I like the $2=undef viewpoint too, but we can't have both.  In terms of how regular expressions are defined and documented, I'm having trouble rationalizing $2=undef even though it sounds good.

Is there anything definitive in the documentation that would resolve the question without ambiguity?  I haven't found it yet, if there is.

For what it's worth, other regular expression engines like PCRE, RE2, GNU and others all return "a", which seems to suggest there may be some sort of consensus that "a" is the correct answer, but maybe it's just a gray area with no definite right answer?

> The fact that other examples don't match such an interpretation argues
> against it:
> % perl -wle 'use Data::Dumper; print Dumper([ "foobar" =~
> /((foo)|(bar))*/ ])'
> $VAR1 = [
>           'bar',
>           'foo',
>           'bar'
>         ];
>  %
> .. but I don't recall seeing docs to justify such results. I'll have
> to have another wade through them.
> Hugo

That example is certainly returning $2 and $3 from different loop iterations, and it's less clear in this case that returning $2=undef would somehow be preferable or more intuitive for this example.  Is that what you were getting at?


via perlbug:  queue: perl5 status: open

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About