develooper Front page | perl.perl5.porters | Postings from February 2018

Re: Another idea for speeding up /i matching on ASCII characters

Thread Previous | Thread Next
From:
Karl Williamson
Date:
February 2, 2018 05:04
Subject:
Re: Another idea for speeding up /i matching on ASCII characters
Message ID:
9c2067d7-24a5-186e-e497-c06ec4d2984c@khwilliamson.com
I'm forwarding this response from Yves to the list, which he sent me 
personally, and I think should be public.  It looks like it came from 
me, but it's really him.

On 25 January 2018 at 21:18, Karl Williamson <public@khwilliamson.com> 
wrote:
> This however is not space efficient.
>
> If we have an EXACTFish node consisting entirely of ASCII characters that
> don't have folds outside of ASCII (so Kk are out because they match the
> KELVIN SIGN), it would be possible to create a parallel string to be used as
> a mask, specially constructed so that each byte is the appropriate mask for
> /i on the corresponding byte in the main string.
>
> When doing the match, you load a words-worth of the string, and the
> corresponding portion of the mask, AND them together and compare.  If then
> fail or go on to the next word full.
>
> Thus you wouldn't have to look up the fold for every byte as we do now. If
> you've ever watched an EXACTFish node execute with backtracking, it's
> depressing how often we have to do this over and over.
>
> This could also work on many UTF-8 strings as well.
>
> It essentially doubles the length of the pattern, but my guess is that it's
> around an order of magnitude faster on a 64-bit machine.

Karl, I think you are over focusing on the node type and inline matching.

I think any and all of your ideas are worth trying as *start-nodes*,
that is nodes which are only executed by S_find_byclass().

Take the AHOCORASICK node type as an example. It is only ever used as
a regstclass node. It is constructed out a TRIE node, but is distinct
from it.

You could try out your optimizations on these nodes only, and thus
considerations like space efficiency and things like that are less of
a concern as there can only be one of these nodes per pattern. It also
give you a good way to see if it makes a difference and is worth
integrating deeper into the regex engine.

Yves
-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About