develooper Front page | perl.perl5.porters | Postings from July 2018

Re: [perl #133352] Ancient Regex Regression

Thread Previous | Thread Next
From:
Deven T. Corzine
Date:
July 17, 2018 15:42
Subject:
Re: [perl #133352] Ancient Regex Regression
Message ID:
4967_1531842171_5B4E0E7B_4967_334_3_CAFVdu0Tinyn+Yx7pwwu5O+eneicjKOZzU4AYcOFrNGxc_Ti-pg@mail.gmail.com
On Tue, Jul 17, 2018 at 11:02 AM, Dave Mitchell <davem@iabyn.com> wrote:
> On Tue, Jul 10, 2018 at 02:14:03PM -0700, Deven T. Corzine via RT wrote:
>> Yeah, that's why I said the correct "seems" to be "a".  There's a decent
>> argument for returning undef
>
> My take on it is based on two (possibly conflicting) ideals.
>
> The first is that when starting the (N+1)th iteration of a '*' expression
> or similar, the captures from the Nth iteration are still valid, until
> over-written by the (N+1)th iteration. This allows patterns of this form
> to work:
>
>     print "1=[$1]\n" if "AA-ABB-" =~ /^ (?: \1? (\w) \1 - )* $/x
>
> which outputs 'B'.
>
> On the first iteration:
>     the first  \1 fails to match anything, and is skipped via the '?';
>     the second \1 matches 'A'
>
> On the second iteration:
>     the first  \1 matches 'A' - the value from the last iteration
>     the second \1 matches 'B' - the value from the current iteration

Interesting.  It's not a forward reference, but it's sort of an
inverted backreference.  I don't think I've ever seen that done
before, but it seems to me that this ought to be valid, unusual though
it is.

> The second principle is that following an alternation, the captures
> from branches that failed or weren't tried, are invalid.
>
> Neither of these match:
>
>     print "matched\n" if "B" =~ /^ (?: A(.) | B ) \1 $/x;
>     print "matched\n" if "B" =~ /^ (?: A | B | (.) ) \1 $/x;

My patch doesn't change the behavior of those examples.

On the other hand, my patch does allow this example to match:

     print "matched\n" if "ABCDA" =~ /^ (?: (.)B | CD )* \1 $/x;

Without my patch, this matches instead:

     print "matched\n" if "ABCDC" =~ /^ (?: (.)B | CD )* \1 $/x;

> So the real question is, at the end of an alternation, should any
> 'unused' captures within the alternation be flagged as invalid,
> or should they preserved -  retaining the values they had at the start of
> the alternation (which may be real values if this is a second+ iteration
> of an enclosing '*' etc).

Indeed, and there is certainly room for debate here, as both arguments
are reasonable.

Personally, invalidating a successful capture that was part of a
successful match feels wrong to me.  Consider this example:

     "foobar" =~ /^ (?: (foo) | (bar) )* $/x;

Both with and without my patch, this leaves $1="foo" and $2="bar".  If
the "unused" capture were to be invalidated, this would leave $1
undefined instead.  Would this be desirable?

> I think you could argue it either way. However, since this bug has been
> around since forever, with no-one apparently noticing it before, I think
> can fix it how we like - so we should pick whichever is easiest to
> implement.

That's a good point. Still, it's worth considering that restoring the
original captures is what Perl used to do in Perl 2.0 through Perl 5.0
alpha 9, and it's also what PCRE, RE2 and other regex implementations
do also.  Javascript does seem to invalidate the capture instead, but
that's the only counter-example I've seen so far.

Strangely, I can't find a single word in the documentation about what
happens with captures when quantifiers are applied.  We're used to the
capture being replaced with the new one, yet I can't find anything
saying that it does that!  All of it seems to be unspecified behavior.
Is it somewhere I'm missing?

> Invalidating captures set by a failing branch involves just knowing the
> max index at the start and end of the branch execution, and invalidating
> everything in between; restoring previous values involves saving a whole
> set of capture indices and restoring them on failure (which is what I
> think your patch does). The latter sounds a whole lot more expensive, and
> would potentially slow down all alterations.

Yes, that's what my current patch does, and there IS a performance
issue here, since I'm currently saving and restoring ALL captures,
whether or not they're in the alternation.  This is unnecessary, but I
haven't figured out yet how to set the paren floor correctly to only
save the necessary ones.  I think that would mitigate most of the
performance impact, though not all of it.

Would it be better to remember the captures some other way?  Perl 6
returns ALL the captures; should Perl 5 have that capability too?

Deven

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About