develooper Front page | perl.perl5.porters | Postings from October 2018

Re: [perl #133352] Ancient Regex Regression

From:
Deven T. Corzine
Date:
October 16, 2018 19:18
Subject:
Re: [perl #133352] Ancient Regex Regression
Message ID:
CAFVdu0SN1OR76Umvv=tAJkQhaOcoHFByXyX-Fcut5J=L8c-Ytw@mail.gmail.com
On Tue, Aug 28, 2018 at 10:47 AM Dave Mitchell <davem@iabyn.com> wrote:

> On Sun, Aug 12, 2018 at 09:39:16PM -0400, Deven T. Corzine wrote:
> > It's been quiet for a couple weeks.  Has anyone had a chance to take a
> > look at the patch yet?
>
> (Yves, I'e CCed you as there's a question for you further down this
> email.)
>
> Sorry for the late reply. I took the opportunity to try to properly
> understand this issue (you may remember me mentioning at the beginning how
> "I hate and fear the capture save and restore code in regmatch()").
>

Sorry, now I'm the one being slow to reply!  Yes, I remember that comment,
which certainly illustrates why so many people told me I was crazy to
attempt a patch myself!


> I think I now have a much better handle on this. I've just pushed a branch
> which contains a proposed fix, along with lots of tests and benchmark
> entries. It avoids saving and restoring capture indices unless the
> alternation is wrapped in a repeat, and only saves indices back to the
> start of the repeat. This is less inefficient than your original
> suggestion, but it still makes alternations with captures within a repeat
> run at about 50% of their former speed.
>

I submitted my initial patch at James Keenan's request, solely as a
starting point for discussion.  I stated in the accompanying notes that I
didn't consider the patch ready to apply yet, for reasons including the
unacceptable performance impact on regexes that aren't even affected by the
bug in the first place.  I was hoping to find a clean way to set a paren
floor (as you've done with your patch), to mitigate the performance impact
that would otherwise affect every regex with captures inside branches.  I'd
hate to have a 50% performance impact on common regexes just to fix a bug
that's so rarely encountered that it went unnoticed for decades!

My initial patch was only focused on correctness, not performance.
Stipulating that the performance was unacceptably slow, could you tell me
if my original patch was clean and valid from a correctness perspective, or
was I doing something wrong/ugly/unvalid in the patch?

I know a way to make it much faster, but it involves, for every
> BRANCH/BRANCHJ/TRIE/TRIEC node, knowing the current capture index - i.e.
> in the same way that OPEN nodes have 'n = ARG(scan);' but I don't know how
> this can stored.
>
> Yves, is there type of node - or a way to extend the current nodes - such
> that a 32-bit capture index can stored as part of each of these nodes
> at compile time? Without breaking everything?
>

I know there are only 8 bits to distinguish them, but perhaps this is a
situation where adding new regnode types would be worthwhile?  Am I right
in thinking that a 32-bit capture index and the WHILEM pointer could be
stored that way?

Alternatively, if "ARG(scan)" or another mechanism could be used to save
even an 8-bit or 16-bit index value, it would probably cover most use
cases, and the max value (255 or 65535) could be used by default if the
index doesn't fit.  I'm sure the vast majority of regexes have 255 or fewer
captures in the first place, and in the rare case where a regex has
thousands of captures and the proper index won't fit, this would only cause
an extra unnecessary performance for those rare cases, and the correctness
of the result should not be affected.

Anyway, I've pushed my current proposed fix as smoke-me/davem/captures,
> and here's the commit message:
>

 I'll definitely check out your proposed fix; hopefully I'll get to it
sooner than later!

Thanks for looking into this!

Deven



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About