develooper Front page | perl.perl5.porters | Postings from February 2018

Re: EXACTFish matching in regexec.c find_byclass() is a waste?

Thread Previous
From:
demerphq
Date:
February 2, 2018 09:03
Subject:
Re: EXACTFish matching in regexec.c find_byclass() is a waste?
Message ID:
CANgJU+XYUgoRSGyDEgx-nmf_c9CWQw2_T1WSHKw1v5LMArA9Bw@mail.gmail.com
On 2 February 2018 at 00:05, Karl Williamson <public@khwilliamson.com> wrote:
> On 02/01/2018 02:20 AM, demerphq wrote:
>>
>> On 31 January 2018 at 15:27, Karl Williamson <public@khwilliamson.com>
>> wrote:
>>>
>>> The way folded EXACTFish nodes appear to me to work currently is that
>>> when
>>> we get to a position in the matching that we go through the string in the
>>> node, folding the target string and comparing.  If this works we call
>>> regtry() which calls regmatch() that, in the process of checking
>>> everything
>>> matches, does the same thing again.
>>>
>>> Is there a reason to not just generate a synthetic start class which
>>> contains the union of all possible first bytes the node could match? Then
>>> we
>>> scan for one of those bytes, check that there's enough length left in the
>>> target string to match, and then call regtry.  Thus we would call regtry
>>> potentially more often, but we avoid reparsing and refolding the string.
>>> And that seems like a big win to me.
>>
>>
>> I am guessing that the rationale is the startup cost of retry() is
>> quite high. By hoisting the folding logic up to find_byclass we cut
>> down the calls to regtry, and avoid that setup cost.
>>
>> Consider a simple case:
>>
>>    ("f" x 1_000_000)=~/foo[0-9]{5}blurb.*blooper/i
>>
>> If I understand your proposal right I think we would call retry()
>> 1_000_000 times, but with the existing code we wouldn't call it at
>> all.
>>
>> Maybe there is a way we can precompute a better structure, but I
>> assume anything that increases the calls to regtry() in find_byclass
>> is a pessimisation - but you would need to benchmark to be sure, the
>> code is definitely structured around that assumption.
>>
>> FWIW, it is not uncommon, or perhaps even standard to have retry()
>> replicate some aspect of the matching that is handled by
>> find_byclass(). The AHO-CORASICK code does for instance, if only
>> because the thing that the startclass matches might not be the first
>> opnode that needs to execute before executing the regnodes that are
>> equivalent to the start class.
>>
>> Consider something like this:
>>
>> /((foo))/i
>>
>> In terms of regnodes we need to go through two OPEN's before we
>> execute the EXACTF. But the start class with be an EXACTF.
>>
>> Avoiding the unnecessary duplicate operation is too tricky, and the
>> wins from calling regtry() less often too great to care. (At least in
>> theory and code structure.)
>>
> Ok, I got it, thanks.
>
> But then I have another question.  EXACTish nodes max out at around 127
> bytes, though there's room for 256.  It would seem that fully packing this
> could lead to fewer retry calls, and less overhead during normal matching.
> Is there a reason not to fully pack?

I tried it once and it blew up in my face and I lost interest. :-)

I don't see any reason for you not to try it again if you think it is
doable. I very well may have broken something when I attempted it.

Yves


-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About