develooper Front page | perl.perl5.porters | Postings from February 2018

Re: EXACTFish matching in regexec.c find_byclass() is a waste?

Thread Previous
February 2, 2018 09:03
Re: EXACTFish matching in regexec.c find_byclass() is a waste?
Message ID:
On 2 February 2018 at 00:05, Karl Williamson <> wrote:
> On 02/01/2018 02:20 AM, demerphq wrote:
>> On 31 January 2018 at 15:27, Karl Williamson <>
>> wrote:
>>> The way folded EXACTFish nodes appear to me to work currently is that
>>> when
>>> we get to a position in the matching that we go through the string in the
>>> node, folding the target string and comparing.  If this works we call
>>> regtry() which calls regmatch() that, in the process of checking
>>> everything
>>> matches, does the same thing again.
>>> Is there a reason to not just generate a synthetic start class which
>>> contains the union of all possible first bytes the node could match? Then
>>> we
>>> scan for one of those bytes, check that there's enough length left in the
>>> target string to match, and then call regtry.  Thus we would call regtry
>>> potentially more often, but we avoid reparsing and refolding the string.
>>> And that seems like a big win to me.
>> I am guessing that the rationale is the startup cost of retry() is
>> quite high. By hoisting the folding logic up to find_byclass we cut
>> down the calls to regtry, and avoid that setup cost.
>> Consider a simple case:
>>    ("f" x 1_000_000)=~/foo[0-9]{5}blurb.*blooper/i
>> If I understand your proposal right I think we would call retry()
>> 1_000_000 times, but with the existing code we wouldn't call it at
>> all.
>> Maybe there is a way we can precompute a better structure, but I
>> assume anything that increases the calls to regtry() in find_byclass
>> is a pessimisation - but you would need to benchmark to be sure, the
>> code is definitely structured around that assumption.
>> FWIW, it is not uncommon, or perhaps even standard to have retry()
>> replicate some aspect of the matching that is handled by
>> find_byclass(). The AHO-CORASICK code does for instance, if only
>> because the thing that the startclass matches might not be the first
>> opnode that needs to execute before executing the regnodes that are
>> equivalent to the start class.
>> Consider something like this:
>> /((foo))/i
>> In terms of regnodes we need to go through two OPEN's before we
>> execute the EXACTF. But the start class with be an EXACTF.
>> Avoiding the unnecessary duplicate operation is too tricky, and the
>> wins from calling regtry() less often too great to care. (At least in
>> theory and code structure.)
> Ok, I got it, thanks.
> But then I have another question.  EXACTish nodes max out at around 127
> bytes, though there's room for 256.  It would seem that fully packing this
> could lead to fewer retry calls, and less overhead during normal matching.
> Is there a reason not to fully pack?

I tried it once and it blew up in my face and I lost interest. :-)

I don't see any reason for you not to try it again if you think it is
doable. I very well may have broken something when I attempted it.


perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About