develooper Front page | perl.perl5.porters | Postings from July 2018

Re: [perl #133352] Ancient Regex Regression

Thread Previous | Thread Next
From:
Deven T. Corzine
Date:
July 11, 2018 14:20
Subject:
Re: [perl #133352] Ancient Regex Regression
Message ID:
CAFVdu0Rvt_FfGLvBGkg86GSOHw02pQSpyeATLKJWG_Ymkxh96g@mail.gmail.com
[I originally entered this via RT, but I didn't see it come though
p5p, so I'm resending directly, hopefully it's not a duplicate!]

On Tue, 10 Jul 2018 12:19:30 -0700, hv wrote:
> On Sun, 08 Jul 2018 21:04:32 -0700, dcorzine wrote:
> > This was my test case, which works with or without anchors:
> >
> > "afoobar" =~ /((.)foo|bar)*/
> > "afoobar" =~ /^((.)foo|bar)*$/
> >
> > Or, as a standalone command:
> >
> > perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'
> >
> > This prints "b", even though "bfoo" never appears in "afoobar"!
> [...]
> > The correct answer seems to be "a", since that's the last match
> > of the inner group and the overall match is successful.
>
> It's by no means clear to me that that must be the correct answer; the
> other candidate worth considering is that, since in /(...)*/ we return
> the match for the last iteration of the group, you'd expect further
> captures embedded within there also to deliver the version from the
> last iteration of the group.
>
> Under that interpretation, the correct answer would be undef, and the
> earlier releases that returned 'a' were merely more subtly wrong than
> the current release.

Yeah, that's why I said the correct "seems" to be "a".  There's a
decent argument for returning undef, and it's certainly
counter-intuitive to some degree to have $1="bar" from the second
iteration and $2="a" from the first iteration, but the inner group
does successfully match during the first iteration only, so "a" is
indeed the last successful match.

I see two different viewpoints here, and both are quite reasonable.
One viewpoint is that these are nested groups and both should be
returning results from the same iteration, and therefore $2 should be
undef because the nested match doesn't match anything on that
iteration.  Intuitively, this feels right.  The other viewpoint is
that each group can match multiple times and we only get to keep one
capture per group, so $2 should be "a" since that's the last
successful match for that group, despite the confusion of matching $1
and $2 from different iterations.

Personally, the $2="a" viewpoint seems like a stronger argument to me,
but I could be in the minority in thinking that.  I like the $2=undef
viewpoint too, but we can't have both.  In terms of how regular
expressions are defined and documented, I'm having trouble
rationalizing $2=undef even though it sounds good.

Is there anything definitive in the documentation that would resolve
the question without ambiguity?  I haven't found it yet, if there is.

For what it's worth, other regular expression engines like PCRE, RE2,
GNU and others all return "a", which seems to suggest there may be
some sort of consensus that "a" is the correct answer, but maybe it's
just a gray area with no definite right answer?

> The fact that other examples don't match such an interpretation argues
> against it:
> % perl -wle 'use Data::Dumper; print Dumper([ "foobar" =~
> /((foo)|(bar))*/ ])'
> $VAR1 = [
>           'bar',
>           'foo',
>           'bar'
>         ];
>  %
> .. but I don't recall seeing docs to justify such results. I'll have
> to have another wade through them.
>
> Hugo

That example is certainly returning $2 and $3 from different loop
iterations, and it's less clear in this case that returning $2=undef
would somehow be preferable or more intuitive for this example.  Is
that what you were getting at?

Deven

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About