develooper Front page | perl.perl5.porters | Postings from July 2018

Re: [perl #133352] Ancient Regex Regression

Thread Previous | Thread Next
From:
Deven T. Corzine
Date:
July 22, 2018 18:22
Subject:
Re: [perl #133352] Ancient Regex Regression
Message ID:
CAFVdu0TT-QgjQ-L2aYfcLQupq_rHL2Pz8XTszSR-8sxdCVMtqw@mail.gmail.com
On Wed, Jul 18, 2018 at 4:23 AM, Dave Mitchell <davem@iabyn.com> wrote:
> On Tue, Jul 17, 2018 at 11:42:11AM -0400, Deven T. Corzine wrote:
>> On Tue, Jul 17, 2018 at 11:02 AM, Dave Mitchell <davem@iabyn.com> wrote:
>> > So the real question is, at the end of an alternation, should any
>> > 'unused' captures within the alternation be flagged as invalid,
>> > or should they preserved -  retaining the values they had at the start of
>> > the alternation (which may be real values if this is a second+ iteration
>> > of an enclosing '*' etc).
>>
>> Indeed, and there is certainly room for debate here, as both arguments
>> are reasonable.
>>
>> Personally, invalidating a successful capture that was part of a
>> successful match feels wrong to me.  Consider this example:
>>
>>      "foobar" =~ /^ (?: (foo) | (bar) )* $/x;
>>
>> Both with and without my patch, this leaves $1="foo" and $2="bar".  If
>> the "unused" capture were to be invalidated, this would leave $1
>> undefined instead.  Would this be desirable?
>
> It would match the (admittedly ambiguous) bit in perlre that David Nicol
> quoted:
>
>     If a group did not match, the associated backreference won't match
>     either. (This can happen if the group is optional, or in a different
>     branch of an alternation.)

This says "if a group DID NOT match" -- but in the example above, the
"(foo)" group DID match against the "foo" in "foobar", and the "(bar)"
group DID match against the "bar" in "foobar", so how does this quote
apply here?

>> Yes, that's what my current patch does, and there IS a performance
>> issue here, since I'm currently saving and restoring ALL captures,
>> whether or not they're in the alternation.  This is unnecessary, but I
>> haven't figured out yet how to set the paren floor correctly to only
>> save the necessary ones.  I think that would mitigate most of the
>> performance impact, though not all of it.
>
> There is considerable performance overhead in saving even just one set of
> capture indices - the marginal cost of saving more than one is less. So
> saving fewer is good, saving none is a *lot* better.

I can totally see that.  I have several ideas on possible ways to
solve this; do you have any thoughts on the best approach?

> The only ones needing saving or invalidating are the ones with indices
> lying between lastopen+1 .. maxopenparen (I think, based on a quick look).

Maybe I'm misunderstanding, but isn't lastopen going to point at the
group inside the branch after the first time it gets captured?

My current thinking is to cache the initial value of maxopenparen the
first time a BRANCH or TRIE is executed, and use that as the paren
floor for future iterations.

Do you have any suggestions on how to cache that value?  I haven't
thought of a good solution yet.

> It might be worth writing an alternative patch which does just the
> invalidation, rather than saving/restoring, and see what, if any, tests fail.
> Those failures may give more insight into original intent, and whether
> saving is worth it.
>
> I suspect that writing the invalidation code might be quite tricky (in the
> sense that the eventual code will be simple, but working out what that
> code should be exactly may be hard).

Invalidating the captures would have less performance impact, but it
would be a significant change to the regex semantics, since there
could be a lot of code out there relying on the captures to remain
valid.  This bug escaped notice for 24 years, but invalidating the
captures could break working code that would be unaffected by saving
and restoring the captures, as my patch does.

Consider this arbitrary example:

perl -MData::Dumper -e '$Data::Dumper::Varname=""; print
Dumper("135813885" =~
/^(?:(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(0)|.)*$/);'

This has separate captures for each ASCII digit which can be used to
find out whether or not each digit exists in the string:

$1 = '1';
$2 = undef;
$3 = '3';
$4 = undef;
$5 = '5';
$6 = undef;
$7 = undef;
$8 = '8';
$9 = undef;
$10 = undef;

If you invalidate the captures every time the branch iterates, only
the capture for the last digit ($5) would be set here, and all the
rest would be undef.  On the other hand, saving and restoring the
captures doesn't change the behavior of this example at all.

I realize that digits are a little simplistic, but a similar approach
could be used to search for specific typographical errors in a block
of text: /^(?:(\bTHe\b)|(\bteh\b)|.*?)*$/

>> Would it be better to remember the captures some other way?  Perl 6
>> returns ALL the captures; should Perl 5 have that capability too?
>
> What do you mean exactly?

Perl 6 returns nested Match objects for nested captures and arrays for
quantified captures -- it never clobbers one capture with another the
way most regex engines do.

For my "afoobar" =~ /((.)foo|bar)*/ example, during the second
iteration, Perl 5 re-captures "bar" to $1, clobbering the original
value of "afoo" from the first iteration.  Perl 6 returns both
captures instead:

> "afoobar" ~~ /((.)foo|bar)*/
「afoobar」
 0 => 「afoo」
  0 => 「a」
 0 => 「bar」

Here is the actual object structure returned in Perl 6:

> ("afoobar" ~~ /((.)foo|bar)*/).perl
Match.new(made => Any, pos => 7, orig => "afoobar", hash =>
Map.new(()), from => 0, list => ([Match.new(made => Any, pos => 4,
orig => "afoobar", hash => Map.new(()), from => 0, list =>
(Match.new(made => Any, pos => 1, orig => "afoobar", hash =>
Map.new(()), from => 0, list => ()),)), Match.new(made => Any, pos =>
7, orig => "afoobar", hash => Map.new(()), from => 4, list => ())],))

This could be approximated with basic Perl 5 data types like this perhaps:

# Top-level array corresponding to top-level captures.
[
   # Possibly include the overall regex match ($&) here?
   "afoobar",

   # Array for the outer group ($1), because it is quantified.
   [
      # Arrays for each iteration because there are also nested captures...

      # Array for captures from the first iteration of the outer group ($1).
      [
         # Capture of outer group ($1) from first iteration.
         "afoo",

         # Capture of inner group ($2) from first iteration.
         "a",
      ],

      # Array for captures from the second iteration of the outer group ($1).
      [
         # Capture of outer group ($1) from second iteration.
         "bar",

         # Capture of inner group ($2) from second iteration, or omit
this perhaps?
         undef,
      ],

      # No more iterations of the outer group ($1) match successfully.
   ],

   # No more top-level captures in the regex.
]

This example is for illustrative purposes; such a data structure is
subject to bikeshedding, of course.  If there are named captures, they
should probably be in a hash somewhere, for example.  That's not
important at the moment.

My basic question was whether or not Perl 5 ought to have this
capability to retain the "afoo" match instead of always clobbering
quantified captures -- perhaps only when enabled with a flag on the
regex match, for efficiency?

What do you think?

Deven

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About