develooper Front page | perl.perl5.porters | Postings from February 2018

Re: EXACTFish matching in regexec.c find_byclass() is a waste?

Thread Previous | Thread Next
From:
Karl Williamson
Date:
February 1, 2018 23:06
Subject:
Re: EXACTFish matching in regexec.c find_byclass() is a waste?
Message ID:
77a4f6dd-31fc-0f9a-0b54-940a460ccd6c@khwilliamson.com
On 02/01/2018 02:20 AM, demerphq wrote:
> On 31 January 2018 at 15:27, Karl Williamson <public@khwilliamson.com> wrote:
>> The way folded EXACTFish nodes appear to me to work currently is that when
>> we get to a position in the matching that we go through the string in the
>> node, folding the target string and comparing.  If this works we call
>> regtry() which calls regmatch() that, in the process of checking everything
>> matches, does the same thing again.
>>
>> Is there a reason to not just generate a synthetic start class which
>> contains the union of all possible first bytes the node could match? Then we
>> scan for one of those bytes, check that there's enough length left in the
>> target string to match, and then call regtry.  Thus we would call regtry
>> potentially more often, but we avoid reparsing and refolding the string.
>> And that seems like a big win to me.
> 
> I am guessing that the rationale is the startup cost of retry() is
> quite high. By hoisting the folding logic up to find_byclass we cut
> down the calls to regtry, and avoid that setup cost.
> 
> Consider a simple case:
> 
>    ("f" x 1_000_000)=~/foo[0-9]{5}blurb.*blooper/i
> 
> If I understand your proposal right I think we would call retry()
> 1_000_000 times, but with the existing code we wouldn't call it at
> all.
> 
> Maybe there is a way we can precompute a better structure, but I
> assume anything that increases the calls to regtry() in find_byclass
> is a pessimisation - but you would need to benchmark to be sure, the
> code is definitely structured around that assumption.
> 
> FWIW, it is not uncommon, or perhaps even standard to have retry()
> replicate some aspect of the matching that is handled by
> find_byclass(). The AHO-CORASICK code does for instance, if only
> because the thing that the startclass matches might not be the first
> opnode that needs to execute before executing the regnodes that are
> equivalent to the start class.
> 
> Consider something like this:
> 
> /((foo))/i
> 
> In terms of regnodes we need to go through two OPEN's before we
> execute the EXACTF. But the start class with be an EXACTF.
> 
> Avoiding the unnecessary duplicate operation is too tricky, and the
> wins from calling regtry() less often too great to care. (At least in
> theory and code structure.)
> 
Ok, I got it, thanks.

But then I have another question.  EXACTish nodes max out at around 127 
bytes, though there's room for 256.  It would seem that fully packing 
this could lead to fewer retry calls, and less overhead during normal 
matching.  Is there a reason not to fully pack?

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About