develooper Front page | perl.perl5.porters | Postings from February 2018

Re: EXACTFish matching in regexec.c find_byclass() is a waste?

Thread Previous | Thread Next
From:
Karl Williamson
Date:
February 2, 2018 04:45
Subject:
Re: EXACTFish matching in regexec.c find_byclass() is a waste?
Message ID:
c553775a-39df-2d98-78e8-fc0feb7adb93@khwilliamson.com
On 02/01/2018 02:20 AM, demerphq wrote:
> On 31 January 2018 at 15:27, Karl Williamson <public@khwilliamson.com> wrote:
>> The way folded EXACTFish nodes appear to me to work currently is that when
>> we get to a position in the matching that we go through the string in the
>> node, folding the target string and comparing.  If this works we call
>> regtry() which calls regmatch() that, in the process of checking everything
>> matches, does the same thing again.
>>
>> Is there a reason to not just generate a synthetic start class which
>> contains the union of all possible first bytes the node could match? Then we
>> scan for one of those bytes, check that there's enough length left in the
>> target string to match, and then call regtry.  Thus we would call regtry
>> potentially more often, but we avoid reparsing and refolding the string.
>> And that seems like a big win to me.
> 
> I am guessing that the rationale is the startup cost of retry() is
> quite high. By hoisting the folding logic up to find_byclass we cut
> down the calls to regtry, and avoid that setup cost.
> 
> Consider a simple case:
> 
>    ("f" x 1_000_000)=~/foo[0-9]{5}blurb.*blooper/i
> 
> If I understand your proposal right I think we would call retry()
> 1_000_000 times, but with the existing code we wouldn't call it at
> all.
> 
> Maybe there is a way we can precompute a better structure, but I
> assume anything that increases the calls to regtry() in find_byclass
> is a pessimisation - but you would need to benchmark to be sure, the
> code is definitely structured around that assumption.
> 
> FWIW, it is not uncommon, or perhaps even standard to have retry()
> replicate some aspect of the matching that is handled by
> find_byclass(). The AHO-CORASICK code does for instance, if only
> because the thing that the startclass matches might not be the first
> opnode that needs to execute before executing the regnodes that are
> equivalent to the start class.
> 
> Consider something like this:
> 
> /((foo))/i
> 
> In terms of regnodes we need to go through two OPEN's before we
> execute the EXACTF. But the start class with be an EXACTF.
> 
> Avoiding the unnecessary duplicate operation is too tricky, and the
> wins from calling regtry() less often too great to care. (At least in
> theory and code structure.)
> 

Based on this response, I pushed the following commit.  I'll look at 
what can be done for UTF-8 some time in the future.  Others should feel 
free to look sooner.

From: Karl Williamson <khw@cpan.org>
Date: Thu, 1 Feb 2018 20:49:03 -0700
Subject: Speed up finding non-UTF8 EXACTFish initial matches

find_byclass() is used to scan through a target string looking for
something that matches a particular class.  This commit speeds that up
for patterns of the /foo/i type, where neither the target string nor
the pattern are UTF-8.

More precisely, it speeds up only finding the first byte of 'foo'
in the string.  The actual matching speed remains the same, once that
initial character that is a potential match is found.

But finding that first character is sped up immensely by this commit.

It does this by using memchr() when the character is caseless.  For
example in the pattern /:abcde/i, the colon is first, and is caseless.
On my system memchr is extremely fast, so the numbers below for this
case may not be as good on other systems.

And when the first character is cased, such as in /abcde/i, it uses the
techniques added in 2813d4adc971fbaa124b5322d4bccaa73e9df8e2 for the
ANYOFM regnode.  In both ASCII and EBCDIC machines, the case folds of
the cased letters are almost all of the form that these techniques work
on.  There are no tests in our current test suite that don't have this
form.  However, /il (locale) matching may very well violate this, and
will use the per-byte scheme that has been in effect until this commit.

The numbers below are for finding the first letter after a long string
that doesn't include that character.  Doing this isolates the speed up
attributable to this commit from ovehead.

The only downsides of this commit are that on some systems memchr() may
introduce function call overhead that won't pay off if the next
occurrence of the character is close by; and in the other case, a single
extra conditional is required to determine if there is at least a word's
worth of data to look at, plus some masking, shifting, and arithmetic
instructions associated with that conditional.  A static function is
called, so there may or may not be function call overhead, depending on
the compiler optimizer.

Key:
     Ir   Instruction read
     Dr   Data read
     Dw   Data write
     COND conditional branches
     IND  indirect branches

The numbers represent raw counts per loop iteration.

caseless first letter
('b' x 10000) . ':' =~ /:a/i

           blead    fast Ratio %
        -------- ------- -------
     Ir  72109.0  4819.0  1496.3
     Dr  20608.0  1237.5  1665.3
     Dw  10409.0   409.5  2541.9
   COND  20376.0   702.0  2902.6
    IND     15.0    16.0    93.8

cased first letter
('b' x 10000) . 'a' =~ /A/i

           blead    fast Ratio %
        -------- ------- -------
     Ir 103074.0 25704.6   401.0
     Dr  20896.5  2164.9   965.2
     Dw  10587.5   601.9  1759.0
   COND  30516.0  3036.2  1005.1
    IND     22.0    22.0   100.0

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About