develooper Front page | perl.perl5.porters | Postings from October 2018

Re: [perl #133352] Ancient Regex Regression

Thread Previous | Thread Next
From:
demerphq
Date:
October 20, 2018 14:04
Subject:
Re: [perl #133352] Ancient Regex Regression
Message ID:
CANgJU+VASodve7rCq3_XJUTwnXUfF=BGKrwSCve6xJQfDyZb0g@mail.gmail.com
I'm sorry I have missed some of this discussion over the past while...

On Wed, 17 Oct 2018, 16:29 Deven T. Corzine, <deven@ties.org> wrote:

> On Wed, Oct 17, 2018 at 7:02 AM Dave Mitchell <davem@iabyn.com> wrote:
>
>> On Tue, Oct 16, 2018 at 03:18:07PM -0400, Deven T. Corzine wrote:
>> > My initial patch was only focused on correctness, not performance.
>> > Stipulating that the performance was unacceptably slow, could you tell
>> me
>> > if my original patch was clean and valid from a correctness
>> perspective, or
>> > was I doing something wrong/ugly/unvalid in the patch?
>>
>> Since nearly two months have passed since I worked on this, the details
>> have ebbed from my mind so I can't remember whether your approach was
>> merely slow or had issues.
>>
>
> I had already spent probably 20 hours or more tracking down this bug while
> I was at YAPC.  After leaving the conference, I spent at least 4-6 more
> hours very carefully studying and tracing the code, including running test
> cases with debugging output many times, adding extra debugging output of my
> own to trace key variables through all the state changes during a regex
> match of my test case.
>
> After all that research and testing, I felt confident that I finally
> understood regcppush(), REGCP_SET(), REGCP_UNWIND(), UNWIND_PARENT() and
> regcp_restore() pretty well, so I started developing this patch.  I spent
> about an hour writing the patch and 2-3 hours debugging it, because I
> initially overlooked something simple that should have been obvious.  I
> forget what it was, but it was something dumb that I should have noticed
> right away when the regression tests failed -- when I finally figured out
> what I did wrong, I fixed the code to be what I actually _thought_ that I
> had typed in the first place!  (That's why it took me hours to notice it.)
>
> Ultimately, the changes were fairly straightforward, considering.  I'll
> describe the changes I made, in chronological order.  (The hunks in the
> patch are in a different order.)
>
> First, I added "CHECKPOINT lastcp;" to the "regmatch_state,u" union in the
> "branchlike", "branch" and "trie" structs, just after "CHECKPOINT cp;".
>
> Next, I used regcppush() to save the captures near the beginning of "case
> BRANCH":
>
> @@ -8043,7 +8042,10 @@ NULL
>       ST.lastparen = rex->lastparen;
>       ST.lastcloseparen = rex->lastcloseparen;
>       ST.next_branch = next;
> -     REGCP_SET(ST.cp);
> +
> +     /* save state of captures at branch start */
> +     ST.cp = regcppush(rex, 0, maxopenparen);
> +     REGCP_SET(ST.lastcp);
>
>       /* Now go into the branch */
>       if (has_cutgroup) {
>
> From looking at the existing code, this appeared the right way to save all
> the captures.  I realized that saving ALL the captures was overkill and
> very slow, of course.  I was initially only attempting to get the correct
> captures back from my test case without breaking any regression tests, and
> this patch did succeed at doing that.  (I was still studying the code to
> try to figure out a clean way to set a paren floor like CURLYX does.)  I
> didn't see any point in keeping the original REGCP_SET() call here.
>
> I did consider whether "case BRANCHJ" might need any separate changes, but
> "case BRANCHJ" falls through to "case BRANCH", so this change seems to
> cover it.
>
> Next, I used regcp_restore() to restore the saved captures in "case
> BRANCH_next_fail":
>
> @@ -8077,8 +8079,10 @@ NULL
>                  do_cutgroup = 0;
>                  no_final = 0;
>              }
>  -           REGCP_UNWIND(ST.cp);
>  -            UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
>  +
>  +           /* restore state of captures from branch start */
>  +           regcp_restore(rex, ST.lastcp, &maxopenparen);
>  +
>              scan = ST.next_branch;
>              /* no more branches? */
>              if (!scan || (OP(scan) != BRANCH && OP(scan) != BRANCHJ)) {
>
> From studying the existing code, regcp_restore() appeared to be the right
> way to restore the saved captures without discarding them.  The original
> REGCP_UNWIND() and UNWIND_PAREN() calls were clearly intended to undo the
> captures from the failed alternation, but that approach seems to be
> unworkable for these edge cases involving captures inside repeated
> alternations.  I suspect UNWIND_PAREN() might always have such edge cases,
> but I haven't investigated that hunch.
>
> Finally, I mirrored both of these changes in "case TRIE" and "case
> TRIE_next_fail", for trie-optimized alternations:
>
> @@ -5920,6 +5920,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char
> *startpos, regnode *prog)
>   HV * widecharmap = MUTABLE_HV(rexi->data->data[ ARG( scan ) + 1 ]);
>                  U32 state = trie->startstate;
>
> + /* save state of captures at branch start */
> + ST.cp = regcppush(rex, 0, maxopenparen);
> + REGCP_SET(ST.lastcp);
> +
>                  if (scan->flags == EXACTL || scan->flags == EXACTFLU8) {
>                      _CHECK_AND_WARN_PROBLEMATIC_LOCALE;
>                      if (utf8_target
> @@ -6067,15 +6071,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char
> *startpos, regnode *prog)
>   case TRIE_next_fail: /* we failed - try next alternative */
>          {
>              U8 *uc;
> -            if ( ST.jump ) {
> -                /* undo any captures done in the tail part of a branch,
> -                 * e.g.
> -                 *    /(?:X(.)(.)|Y(.)).../
> -                 * where the trie just matches X then calls out to do the
> -                 * rest of the branch */
> -                REGCP_UNWIND(ST.cp);
> -                UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
> -     }
> +
> +     /* restore state of captures from branch start */
> +     regcp_restore(rex, ST.lastcp, &maxopenparen);
> +
>       if (!--ST.accepted) {
>           DEBUG_EXECUTE_r({
>                      Perl_re_exec_indentf( aTHX_  "%sTRIE failed...%s\n",
>
> I didn't analyze ST.jump as fully as regcppush() and friends, so I was a
> bit unsure about removing this ST.jump code, but the regcp_restore() seemed
> to make it redundant.  Since all the regression tests passed, I included
> this change in the patch.  Was it correct to remove that code?
>
> Come to think of it, perhaps I should have also removed this other ST.jump
> block near the beginning of "case trie_first_try"?
>
>             if ( ST.jump ) {
>                 ST.lastparen = rex->lastparen;
>                 ST.lastcloseparen = rex->lastcloseparen;
>                 REGCP_SET(ST.cp);
>             }
>
> Looking at it again, this seems redundant too, but I didn't think to try
> removing it when I developed the initial patch.
>
> I'm not sure why the TRIE case was only dealing with the captures when
> ST.jump is set.  Is it possible that captures might need to be restored
> when ST.jump isn't set?
>

A non jump trie node is a Boolean test, meaning it shouldn't touch any
capture buffers (except maybe if it's successful?) Whereas a jump trie is a
Boolean test followed by arbitrary pattern, so it might change the capture
buffer state.

Consider

/(Foo(baz)|wh(oops))(zop)/

Versus

/(Foobaz|whoops)(zop)/

The first should be a jump trie, the second a non-jump true.






> The node just needs to store the capture index, while the runtime code
>> in S_regmatch needs to keep track of the last seen WHILEM node, presumably
>> in a local var (that may need saving and restoring).
>>
>
> That makes sense.  Mind if I take a stab at that, if I get a chance?
>
>
>> With the current node types used by the various branch ops, there is
>> no spare space to store even an 8-bit index for the TRIE/TRIEC nodes,
>> since they make use of the 8-but flags field. (That field appears to be
>> unused in BRANCH/BRANCHJ).
>>
>
> Yeah, I came to the same conclusions about the flags too.
>
>
>> Which is why I asked Yves about the best way to extend such nodes.
>> The naive way would be to use regnode_1 for BRANCH which
>> adds an extra 32-bit field, and 'upgrade' BRANCHJ from regnode_1 to
>> regnode_2L; But the TRIE nodes are more complex than that and I don't
>> think I could safely add an extra 32-bit field without breaking things.
>>
>
> This sounds like a viable option.  Why do you call it naive?
>
> The TRIE nodes are more complex, but is there a particular reason you're
> wary of breaking things by adding a field to it?
>

I imagine because it is a product of peephole optimisation and thus is
sometimes written directly over the data it replaces. Trie nodes actually
have several forms and imagine changing the size could be troublesome but
it should be possible.



> If the flags can't be used, I'm thinking that new regnode types make sense
> here.  That way, the regex compiler could trigger the save/restore
> functionality only when needed, provide the correct paren floor and avoid
> increasing the size of compiled regexes that don't need to restore captures
> like this.  It would take more work to implement, of course, and use up a
> few of the unused 8-bit regnode slots, but it seems like it might be the
> most effective way to optimize this.  What do you and Yves think about this
> idea?
>

I will try to find time to think about this but make no promises.


Cheers
Yves


> Thanks again for taking the time to work on this bug.  (I hope I'm not
> wasting too much of your time on the conversation!)
>
> Deven
>

Its not a waste of time at all,, it's just hard to find time to complement
your efforts

>
>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About