Front page | perl.perl5.porters |
Postings from February 2018
Re: EXACTFish matching in regexec.c find_byclass() is a waste?
Thread Previous
|
Thread Next
From:
demerphq
Date:
February 2, 2018 09:05
Subject:
Re: EXACTFish matching in regexec.c find_byclass() is a waste?
Message ID:
CANgJU+VCJhFZr5UtXpNVrNM0935ki1u_=1Lg2DBfq0-6Zw4EiQ@mail.gmail.com
On 2 February 2018 at 05:45, Karl Williamson <public@khwilliamson.com> wrote:
> On 02/01/2018 02:20 AM, demerphq wrote:
>>
>> On 31 January 2018 at 15:27, Karl Williamson <public@khwilliamson.com>
>> wrote:
>>>
>>> The way folded EXACTFish nodes appear to me to work currently is that
>>> when
>>> we get to a position in the matching that we go through the string in the
>>> node, folding the target string and comparing. If this works we call
>>> regtry() which calls regmatch() that, in the process of checking
>>> everything
>>> matches, does the same thing again.
>>>
>>> Is there a reason to not just generate a synthetic start class which
>>> contains the union of all possible first bytes the node could match? Then
>>> we
>>> scan for one of those bytes, check that there's enough length left in the
>>> target string to match, and then call regtry. Thus we would call regtry
>>> potentially more often, but we avoid reparsing and refolding the string.
>>> And that seems like a big win to me.
>>
>>
>> I am guessing that the rationale is the startup cost of retry() is
>> quite high. By hoisting the folding logic up to find_byclass we cut
>> down the calls to regtry, and avoid that setup cost.
>>
>> Consider a simple case:
>>
>> ("f" x 1_000_000)=~/foo[0-9]{5}blurb.*blooper/i
>>
>> If I understand your proposal right I think we would call retry()
>> 1_000_000 times, but with the existing code we wouldn't call it at
>> all.
>>
>> Maybe there is a way we can precompute a better structure, but I
>> assume anything that increases the calls to regtry() in find_byclass
>> is a pessimisation - but you would need to benchmark to be sure, the
>> code is definitely structured around that assumption.
>>
>> FWIW, it is not uncommon, or perhaps even standard to have retry()
>> replicate some aspect of the matching that is handled by
>> find_byclass(). The AHO-CORASICK code does for instance, if only
>> because the thing that the startclass matches might not be the first
>> opnode that needs to execute before executing the regnodes that are
>> equivalent to the start class.
>>
>> Consider something like this:
>>
>> /((foo))/i
>>
>> In terms of regnodes we need to go through two OPEN's before we
>> execute the EXACTF. But the start class with be an EXACTF.
>>
>> Avoiding the unnecessary duplicate operation is too tricky, and the
>> wins from calling regtry() less often too great to care. (At least in
>> theory and code structure.)
>>
>
> Based on this response, I pushed the following commit. I'll look at what
> can be done for UTF-8 some time in the future. Others should feel free to
> look sooner.
>
> From: Karl Williamson <khw@cpan.org>
> Date: Thu, 1 Feb 2018 20:49:03 -0700
> Subject: Speed up finding non-UTF8 EXACTFish initial matches
>
> find_byclass() is used to scan through a target string looking for
> something that matches a particular class. This commit speeds that up
> for patterns of the /foo/i type, where neither the target string nor
> the pattern are UTF-8.
>
> More precisely, it speeds up only finding the first byte of 'foo'
> in the string. The actual matching speed remains the same, once that
> initial character that is a potential match is found.
>
> But finding that first character is sped up immensely by this commit.
>
> It does this by using memchr() when the character is caseless. For
> example in the pattern /:abcde/i, the colon is first, and is caseless.
> On my system memchr is extremely fast, so the numbers below for this
> case may not be as good on other systems.
>
> And when the first character is cased, such as in /abcde/i, it uses the
> techniques added in 2813d4adc971fbaa124b5322d4bccaa73e9df8e2 for the
> ANYOFM regnode. In both ASCII and EBCDIC machines, the case folds of
> the cased letters are almost all of the form that these techniques work
> on. There are no tests in our current test suite that don't have this
> form. However, /il (locale) matching may very well violate this, and
> will use the per-byte scheme that has been in effect until this commit.
>
> The numbers below are for finding the first letter after a long string
> that doesn't include that character. Doing this isolates the speed up
> attributable to this commit from ovehead.
>
> The only downsides of this commit are that on some systems memchr() may
> introduce function call overhead that won't pay off if the next
> occurrence of the character is close by; and in the other case, a single
> extra conditional is required to determine if there is at least a word's
> worth of data to look at, plus some masking, shifting, and arithmetic
> instructions associated with that conditional. A static function is
> called, so there may or may not be function call overhead, depending on
> the compiler optimizer.
>
> Key:
> Ir Instruction read
> Dr Data read
> Dw Data write
> COND conditional branches
> IND indirect branches
>
> The numbers represent raw counts per loop iteration.
>
> caseless first letter
> ('b' x 10000) . ':' =~ /:a/i
>
> blead fast Ratio %
> -------- ------- -------
> Ir 72109.0 4819.0 1496.3
> Dr 20608.0 1237.5 1665.3
> Dw 10409.0 409.5 2541.9
> COND 20376.0 702.0 2902.6
> IND 15.0 16.0 93.8
>
> cased first letter
> ('b' x 10000) . 'a' =~ /A/i
>
> blead fast Ratio %
> -------- ------- -------
> Ir 103074.0 25704.6 401.0
> Dr 20896.5 2164.9 965.2
> Dw 10587.5 601.9 1759.0
> COND 30516.0 3036.2 1005.1
> IND 22.0 22.0 100.0
>
Nice. :-) ++karl.
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread Previous
|
Thread Next