develooper Front page | perl.perl5.porters | Postings from February 2018

Re: EXACTFish matching in regexec.c find_byclass() is a waste?

Thread Previous | Thread Next
From:
demerphq
Date:
February 1, 2018 09:20
Subject:
Re: EXACTFish matching in regexec.c find_byclass() is a waste?
Message ID:
CANgJU+WXxLNDTPH-jFx8R3v_e8t_ZdvAs7MT=Mxq1ruXZMOZCw@mail.gmail.com
On 31 January 2018 at 15:27, Karl Williamson <public@khwilliamson.com> wrote:
> The way folded EXACTFish nodes appear to me to work currently is that when
> we get to a position in the matching that we go through the string in the
> node, folding the target string and comparing.  If this works we call
> regtry() which calls regmatch() that, in the process of checking everything
> matches, does the same thing again.
>
> Is there a reason to not just generate a synthetic start class which
> contains the union of all possible first bytes the node could match? Then we
> scan for one of those bytes, check that there's enough length left in the
> target string to match, and then call regtry.  Thus we would call regtry
> potentially more often, but we avoid reparsing and refolding the string.
> And that seems like a big win to me.

I am guessing that the rationale is the startup cost of retry() is
quite high. By hoisting the folding logic up to find_byclass we cut
down the calls to regtry, and avoid that setup cost.

Consider a simple case:

  ("f" x 1_000_000)=~/foo[0-9]{5}blurb.*blooper/i

If I understand your proposal right I think we would call retry()
1_000_000 times, but with the existing code we wouldn't call it at
all.

Maybe there is a way we can precompute a better structure, but I
assume anything that increases the calls to regtry() in find_byclass
is a pessimisation - but you would need to benchmark to be sure, the
code is definitely structured around that assumption.

FWIW, it is not uncommon, or perhaps even standard to have retry()
replicate some aspect of the matching that is handled by
find_byclass(). The AHO-CORASICK code does for instance, if only
because the thing that the startclass matches might not be the first
opnode that needs to execute before executing the regnodes that are
equivalent to the start class.

Consider something like this:

/((foo))/i

In terms of regnodes we need to go through two OPEN's before we
execute the EXACTF. But the start class with be an EXACTF.

Avoiding the unnecessary duplicate operation is too tricky, and the
wins from calling regtry() less often too great to care. (At least in
theory and code structure.)

cheers,
Yves





-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About