develooper Front page | perl.perl5.porters | Postings from August 2018

Re: [perl #133352] Ancient Regex Regression

Thread Previous
Dave Mitchell
August 28, 2018 14:47
Re: [perl #133352] Ancient Regex Regression
Message ID:
On Sun, Aug 12, 2018 at 09:39:16PM -0400, Deven T. Corzine wrote:
> It's been quiet for a couple weeks.  Has anyone had a chance to take a
> look at the patch yet?

(Yves, I'e CCed you as there's a question for you further down this

Sorry for the late reply. I took the opportunity to try to properly
understand this issue (you may remember me mentioning at the beginning how
"I hate and fear the capture save and restore code in regmatch()").

I think I now have a much better handle on this. I've just pushed a branch
which contains a proposed fix, along with lots of tests and benchmark
entries. It avoids saving and restoring capture indices unless the
alternation is wrapped in a repeat, and only saves indices back to the
start of the repeat. This is less inefficient than your original
suggestion, but it still makes alternations with captures within a repeat
run at about 50% of their former speed.

I know a way to make it much faster, but it involves, for every
BRANCH/BRANCHJ/TRIE/TRIEC node, knowing the current capture index - i.e.
in the same way that OPEN nodes have 'n = ARG(scan);' but I don't know how
this can stored.

Yves, is there type of node - or a way to extend the current nodes - such
that a 32-bit capture index can stored as part of each of these nodes
at compile time? Without breaking everything?

Anyway, I've pushed my current proposed fix as smoke-me/davem/captures,
and here's the commit message:

commit d0f9ad5d5ba5300a04bba7be2ea8cded41e95666
Author:     David Mitchell <>
AuthorDate: Tue Aug 28 09:15:04 2018 +0100
Commit:     David Mitchell <>
CommitDate: Tue Aug 28 15:32:46 2018 +0100

    regex: restore failed branch capture within repeat
    RT #133352
    There are two competing behaviours for what to do with any captures
    within a failed alternation branch.
    Normally the regex engine invalidates any captures created during
    execution of the failed branch. For example in something like
        /(A)  (?: (B)(C) | (D)(E) ) /x
    perl remembers the current highest capture index at the start of each
    branch attempt (1 in this example), and if the branch fails it
    invalidates all captures above this level (i.e. 2..3 or 2..5 depending
    on which branch failed) before trying the next branch (if any).
    However for repeats, such as (...)*, on each new iteration any captures
    inside the repeat initially maintain their value from the previous
    iteration, until updated within the current iteration. For example in
        "ABCD" =~ /(?: (.)(.) )*/x
    At the start of the first iteration, \1 and \2 are invalid.
    At the start of the second iteration, \1 and \2 are 'A', 'B'.
    During the course of the second iteration, \1 changes to 'C', then
    \2 changes to 'D'.
    This causes a problem when there's an alternation within a repeat.
    Taking the first example above and adding a repeat:
        /(A)  (?: (B)(C) | (D)(E) )* /x
    Now on the second iteration, on entry to the branch the current highest
    capture index is no longer 1 (i.e. A), but is instead 3 or 5 depending
    on what was captured on the previous iteration. Thus on branch failure,
    the 'disable everything above the previous floor' technique no longer
    works: rather than disabling (1+1)..5 say, it disables (5+1)..5 - i.e.
    failed captures are left as-is. This is the essence of the bug in this
    It is not clear what the correct behaviour should be - arguably failed
    branch captures could be either invalidated or restored to their value
    at the start of the branch. Invalidating is probably cheaper. However,
    there are a couple of tests in t/te/re_tests added many years ago by
    Ilya which assume the latter behaviour (they pass because the pattern of
    branches and backtracking happen to leave $1 with the correct value;
    a slight alteration to the tests and they would have the wrong value).
    But in any case the tests aren't expecting $1 to be undefined.
    This commit fixes the bug by, at the start of any branch, saving any
    captures beyond the start of the innermost enclosing repeat, but only if
    it *is* within a repeat, and if there are any valid captures beyond the
    start of that repeat. These are restored on branch failure. Otherwise
    it just invalidates back to the highest capture index at the start of
    the branch, as before.
    So for example this is unchanged, just invalidating indices 2+ on each
    branch failure:
        /(A)  (?: (B)(C) | (D)(E) ) /x
    While this:
        "ABCD" =~ /(?: ([AC]) ([BD]) )*/x
    does invalidation on the first iteration, and full saving/restore of
    capture indices 2+ on subsequent iterations, while this:
        "-AB-CD" =~ /(?: (-) ([AC]) ([BD]) )*/x
    saves and restores on *every* iteration, since the (-) is above the
    floor of the curlyx, so at the start of each branch, lastparen is always
    above cur_curlyx->lastparen.
    In terms of performance, the benchmarks included with this commit show
    that a plain branch is unaffected; a plain branch with captures is a few
    % slower, and a branch within a repeat is now about half the speed.
    This fix could be made a lot more efficient if the current paren index
    was stored in every BRANCH/BRANCHJ/TRIE/TRIEC node, and if a pointer to
    the current WHILEM node was maintained. Since every iteration of a
    CURLYX/WHILEM saves all paren indices above the curlyx floor anyway, it
    wouldn't be necessary to save them again for each branch; just on branch
    failure, restore all saved indices from that last WHILEM which are above
    the branch floor,

The warp engines start playing up a bit, but seem to sort themselves out
after a while without any intervention from boy genius Wesley Crusher.
    -- Things That Never Happen in "Star Trek" #17

Thread Previous Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About