develooper Front page | perl.perl5.porters | Postings from October 2018

Re: [perl #133352] Ancient Regex Regression

Thread Previous | Thread Next
From:
Deven T. Corzine
Date:
October 17, 2018 14:29
Subject:
Re: [perl #133352] Ancient Regex Regression
Message ID:
CAFVdu0QUde7jegWEb4WceK4LXNRSdp4LdkraV4QwxPgirCRcHQ@mail.gmail.com
On Wed, Oct 17, 2018 at 7:02 AM Dave Mitchell <davem@iabyn.com> wrote:

> On Tue, Oct 16, 2018 at 03:18:07PM -0400, Deven T. Corzine wrote:
> > My initial patch was only focused on correctness, not performance.
> > Stipulating that the performance was unacceptably slow, could you tell me
> > if my original patch was clean and valid from a correctness perspective,
> or
> > was I doing something wrong/ugly/unvalid in the patch?
>
> Since nearly two months have passed since I worked on this, the details
> have ebbed from my mind so I can't remember whether your approach was
> merely slow or had issues.
>

I had already spent probably 20 hours or more tracking down this bug while
I was at YAPC.  After leaving the conference, I spent at least 4-6 more
hours very carefully studying and tracing the code, including running test
cases with debugging output many times, adding extra debugging output of my
own to trace key variables through all the state changes during a regex
match of my test case.

After all that research and testing, I felt confident that I finally
understood regcppush(), REGCP_SET(), REGCP_UNWIND(), UNWIND_PARENT() and
regcp_restore() pretty well, so I started developing this patch.  I spent
about an hour writing the patch and 2-3 hours debugging it, because I
initially overlooked something simple that should have been obvious.  I
forget what it was, but it was something dumb that I should have noticed
right away when the regression tests failed -- when I finally figured out
what I did wrong, I fixed the code to be what I actually _thought_ that I
had typed in the first place!  (That's why it took me hours to notice it.)

Ultimately, the changes were fairly straightforward, considering.  I'll
describe the changes I made, in chronological order.  (The hunks in the
patch are in a different order.)

First, I added "CHECKPOINT lastcp;" to the "regmatch_state,u" union in the
"branchlike", "branch" and "trie" structs, just after "CHECKPOINT cp;".

Next, I used regcppush() to save the captures near the beginning of "case
BRANCH":

@@ -8043,7 +8042,10 @@ NULL
      ST.lastparen = rex->lastparen;
      ST.lastcloseparen = rex->lastcloseparen;
      ST.next_branch = next;
-     REGCP_SET(ST.cp);
+
+     /* save state of captures at branch start */
+     ST.cp = regcppush(rex, 0, maxopenparen);
+     REGCP_SET(ST.lastcp);

      /* Now go into the branch */
      if (has_cutgroup) {

From looking at the existing code, this appeared the right way to save all
the captures.  I realized that saving ALL the captures was overkill and
very slow, of course.  I was initially only attempting to get the correct
captures back from my test case without breaking any regression tests, and
this patch did succeed at doing that.  (I was still studying the code to
try to figure out a clean way to set a paren floor like CURLYX does.)  I
didn't see any point in keeping the original REGCP_SET() call here.

I did consider whether "case BRANCHJ" might need any separate changes, but
"case BRANCHJ" falls through to "case BRANCH", so this change seems to
cover it.

Next, I used regcp_restore() to restore the saved captures in "case
BRANCH_next_fail":

@@ -8077,8 +8079,10 @@ NULL
                 do_cutgroup = 0;
                 no_final = 0;
             }
 -           REGCP_UNWIND(ST.cp);
 -            UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
 +
 +           /* restore state of captures from branch start */
 +           regcp_restore(rex, ST.lastcp, &maxopenparen);
 +
             scan = ST.next_branch;
             /* no more branches? */
             if (!scan || (OP(scan) != BRANCH && OP(scan) != BRANCHJ)) {

From studying the existing code, regcp_restore() appeared to be the right
way to restore the saved captures without discarding them.  The original
REGCP_UNWIND() and UNWIND_PAREN() calls were clearly intended to undo the
captures from the failed alternation, but that approach seems to be
unworkable for these edge cases involving captures inside repeated
alternations.  I suspect UNWIND_PAREN() might always have such edge cases,
but I haven't investigated that hunch.

Finally, I mirrored both of these changes in "case TRIE" and "case
TRIE_next_fail", for trie-optimized alternations:

@@ -5920,6 +5920,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char
*startpos, regnode *prog)
  HV * widecharmap = MUTABLE_HV(rexi->data->data[ ARG( scan ) + 1 ]);
                 U32 state = trie->startstate;

+ /* save state of captures at branch start */
+ ST.cp = regcppush(rex, 0, maxopenparen);
+ REGCP_SET(ST.lastcp);
+
                 if (scan->flags == EXACTL || scan->flags == EXACTFLU8) {
                     _CHECK_AND_WARN_PROBLEMATIC_LOCALE;
                     if (utf8_target
@@ -6067,15 +6071,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char
*startpos, regnode *prog)
  case TRIE_next_fail: /* we failed - try next alternative */
         {
             U8 *uc;
-            if ( ST.jump ) {
-                /* undo any captures done in the tail part of a branch,
-                 * e.g.
-                 *    /(?:X(.)(.)|Y(.)).../
-                 * where the trie just matches X then calls out to do the
-                 * rest of the branch */
-                REGCP_UNWIND(ST.cp);
-                UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
-     }
+
+     /* restore state of captures from branch start */
+     regcp_restore(rex, ST.lastcp, &maxopenparen);
+
      if (!--ST.accepted) {
          DEBUG_EXECUTE_r({
                     Perl_re_exec_indentf( aTHX_  "%sTRIE failed...%s\n",

I didn't analyze ST.jump as fully as regcppush() and friends, so I was a
bit unsure about removing this ST.jump code, but the regcp_restore() seemed
to make it redundant.  Since all the regression tests passed, I included
this change in the patch.  Was it correct to remove that code?

Come to think of it, perhaps I should have also removed this other ST.jump
block near the beginning of "case trie_first_try"?

            if ( ST.jump ) {
                ST.lastparen = rex->lastparen;
                ST.lastcloseparen = rex->lastcloseparen;
                REGCP_SET(ST.cp);
            }

Looking at it again, this seems redundant too, but I didn't think to try
removing it when I developed the initial patch.

I'm not sure why the TRIE case was only dealing with the captures when
ST.jump is set.  Is it possible that captures might need to be restored
when ST.jump isn't set?

The node just needs to store the capture index, while the runtime code
> in S_regmatch needs to keep track of the last seen WHILEM node, presumably
> in a local var (that may need saving and restoring).
>

That makes sense.  Mind if I take a stab at that, if I get a chance?


> With the current node types used by the various branch ops, there is
> no spare space to store even an 8-bit index for the TRIE/TRIEC nodes,
> since they make use of the 8-but flags field. (That field appears to be
> unused in BRANCH/BRANCHJ).
>

Yeah, I came to the same conclusions about the flags too.


> Which is why I asked Yves about the best way to extend such nodes.
> The naive way would be to use regnode_1 for BRANCH which
> adds an extra 32-bit field, and 'upgrade' BRANCHJ from regnode_1 to
> regnode_2L; But the TRIE nodes are more complex than that and I don't
> think I could safely add an extra 32-bit field without breaking things.
>

This sounds like a viable option.  Why do you call it naive?

The TRIE nodes are more complex, but is there a particular reason you're
wary of breaking things by adding a field to it?

If the flags can't be used, I'm thinking that new regnode types make sense
here.  That way, the regex compiler could trigger the save/restore
functionality only when needed, provide the correct paren floor and avoid
increasing the size of compiled regexes that don't need to restore captures
like this.  It would take more work to implement, of course, and use up a
few of the unused 8-bit regnode slots, but it seems like it might be the
most effective way to optimize this.  What do you and Yves think about this
idea?

Thanks again for taking the time to work on this bug.  (I hope I'm not
wasting too much of your time on the conversation!)

Deven

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About