develooper Front page | perl.perl5.porters | Postings from October 2018

Re: [perl #133352] Ancient Regex Regression

Thread Previous | Thread Next
From:
Dave Mitchell
Date:
October 17, 2018 11:02
Subject:
Re: [perl #133352] Ancient Regex Regression
Message ID:
20181017110201.GZ3102@iabyn.com
On Tue, Oct 16, 2018 at 03:18:07PM -0400, Deven T. Corzine wrote:
> My initial patch was only focused on correctness, not performance.
> Stipulating that the performance was unacceptably slow, could you tell me
> if my original patch was clean and valid from a correctness perspective, or
> was I doing something wrong/ugly/unvalid in the patch?

Since nearly two months have passed since I worked on this, the details
have ebbed from my mind so I can't remember whether your approach was
merely slow or had issues.

> 
> I know a way to make it much faster, but it involves, for every
> > BRANCH/BRANCHJ/TRIE/TRIEC node, knowing the current capture index - i.e.
> > in the same way that OPEN nodes have 'n = ARG(scan);' but I don't know how
> > this can stored.
> >
> > Yves, is there type of node - or a way to extend the current nodes - such
> > that a 32-bit capture index can stored as part of each of these nodes
> > at compile time? Without breaking everything?
> >
> 
> I know there are only 8 bits to distinguish them, but perhaps this is a
> situation where adding new regnode types would be worthwhile?  Am I right
> in thinking that a 32-bit capture index and the WHILEM pointer could be
> stored that way?

The node just needs to store the capture index, while the runtime code
in S_regmatch needs to keep track of the last seen WHILEM node, presumably
in a local var (that may need saving and restoring).

> Alternatively, if "ARG(scan)" or another mechanism could be used to save
> even an 8-bit or 16-bit index value, it would probably cover most use
> cases, and the max value (255 or 65535) could be used by default if the
> index doesn't fit.  I'm sure the vast majority of regexes have 255 or fewer
> captures in the first place, and in the rare case where a regex has
> thousands of captures and the proper index won't fit, this would only cause
> an extra unnecessary performance for those rare cases, and the correctness
> of the result should not be affected.

With the current node types used by the various branch ops, there is
no spare space to store even an 8-bit index for the TRIE/TRIEC nodes,
since they make use of the 8-but flags field. (That field appears to be
unused in BRANCH/BRANCHJ).

Which is why I asked Yves about the best way to extend such nodes.
The naive way would be to use regnode_1 for BRANCH which
adds an extra 32-bit field, and 'upgrade' BRANCHJ from regnode_1 to
regnode_2L; But the TRIE nodes are more complex than that and I don't
think I could safely add an extra 32-bit field without breaking things.


-- 
"Strange women lying in ponds distributing swords is no basis for a system
of government. Supreme executive power derives from a mandate from the
masses, not from some farcical aquatic ceremony."
    -- Dennis, "Monty Python and the Holy Grail"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About