develooper Front page | perl.perl5.porters | Postings from February 2018

Re: EXACTFish matching in regexec.c find_byclass() is a waste?

Thread Previous | Thread Next
February 2, 2018 09:05
Re: EXACTFish matching in regexec.c find_byclass() is a waste?
Message ID:
On 2 February 2018 at 05:45, Karl Williamson <> wrote:
> On 02/01/2018 02:20 AM, demerphq wrote:
>> On 31 January 2018 at 15:27, Karl Williamson <>
>> wrote:
>>> The way folded EXACTFish nodes appear to me to work currently is that
>>> when
>>> we get to a position in the matching that we go through the string in the
>>> node, folding the target string and comparing.  If this works we call
>>> regtry() which calls regmatch() that, in the process of checking
>>> everything
>>> matches, does the same thing again.
>>> Is there a reason to not just generate a synthetic start class which
>>> contains the union of all possible first bytes the node could match? Then
>>> we
>>> scan for one of those bytes, check that there's enough length left in the
>>> target string to match, and then call regtry.  Thus we would call regtry
>>> potentially more often, but we avoid reparsing and refolding the string.
>>> And that seems like a big win to me.
>> I am guessing that the rationale is the startup cost of retry() is
>> quite high. By hoisting the folding logic up to find_byclass we cut
>> down the calls to regtry, and avoid that setup cost.
>> Consider a simple case:
>>    ("f" x 1_000_000)=~/foo[0-9]{5}blurb.*blooper/i
>> If I understand your proposal right I think we would call retry()
>> 1_000_000 times, but with the existing code we wouldn't call it at
>> all.
>> Maybe there is a way we can precompute a better structure, but I
>> assume anything that increases the calls to regtry() in find_byclass
>> is a pessimisation - but you would need to benchmark to be sure, the
>> code is definitely structured around that assumption.
>> FWIW, it is not uncommon, or perhaps even standard to have retry()
>> replicate some aspect of the matching that is handled by
>> find_byclass(). The AHO-CORASICK code does for instance, if only
>> because the thing that the startclass matches might not be the first
>> opnode that needs to execute before executing the regnodes that are
>> equivalent to the start class.
>> Consider something like this:
>> /((foo))/i
>> In terms of regnodes we need to go through two OPEN's before we
>> execute the EXACTF. But the start class with be an EXACTF.
>> Avoiding the unnecessary duplicate operation is too tricky, and the
>> wins from calling regtry() less often too great to care. (At least in
>> theory and code structure.)
> Based on this response, I pushed the following commit.  I'll look at what
> can be done for UTF-8 some time in the future.  Others should feel free to
> look sooner.
> From: Karl Williamson <>
> Date: Thu, 1 Feb 2018 20:49:03 -0700
> Subject: Speed up finding non-UTF8 EXACTFish initial matches
> find_byclass() is used to scan through a target string looking for
> something that matches a particular class.  This commit speeds that up
> for patterns of the /foo/i type, where neither the target string nor
> the pattern are UTF-8.
> More precisely, it speeds up only finding the first byte of 'foo'
> in the string.  The actual matching speed remains the same, once that
> initial character that is a potential match is found.
> But finding that first character is sped up immensely by this commit.
> It does this by using memchr() when the character is caseless.  For
> example in the pattern /:abcde/i, the colon is first, and is caseless.
> On my system memchr is extremely fast, so the numbers below for this
> case may not be as good on other systems.
> And when the first character is cased, such as in /abcde/i, it uses the
> techniques added in 2813d4adc971fbaa124b5322d4bccaa73e9df8e2 for the
> ANYOFM regnode.  In both ASCII and EBCDIC machines, the case folds of
> the cased letters are almost all of the form that these techniques work
> on.  There are no tests in our current test suite that don't have this
> form.  However, /il (locale) matching may very well violate this, and
> will use the per-byte scheme that has been in effect until this commit.
> The numbers below are for finding the first letter after a long string
> that doesn't include that character.  Doing this isolates the speed up
> attributable to this commit from ovehead.
> The only downsides of this commit are that on some systems memchr() may
> introduce function call overhead that won't pay off if the next
> occurrence of the character is close by; and in the other case, a single
> extra conditional is required to determine if there is at least a word's
> worth of data to look at, plus some masking, shifting, and arithmetic
> instructions associated with that conditional.  A static function is
> called, so there may or may not be function call overhead, depending on
> the compiler optimizer.
> Key:
>     Ir   Instruction read
>     Dr   Data read
>     Dw   Data write
>     COND conditional branches
>     IND  indirect branches
> The numbers represent raw counts per loop iteration.
> caseless first letter
> ('b' x 10000) . ':' =~ /:a/i
>           blead    fast Ratio %
>        -------- ------- -------
>     Ir  72109.0  4819.0  1496.3
>     Dr  20608.0  1237.5  1665.3
>     Dw  10409.0   409.5  2541.9
>   COND  20376.0   702.0  2902.6
>    IND     15.0    16.0    93.8
> cased first letter
> ('b' x 10000) . 'a' =~ /A/i
>           blead    fast Ratio %
>        -------- ------- -------
>     Ir 103074.0 25704.6   401.0
>     Dr  20896.5  2164.9   965.2
>     Dw  10587.5   601.9  1759.0
>   COND  30516.0  3036.2  1005.1
>    IND     22.0 22.0 100.0

Nice. :-) ++karl.


perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About