develooper Front page | perl.perl5.porters | Postings from February 2018

Re: Another idea for speeding up /i matching on ASCII characters

Thread Previous | Thread Next
Karl Williamson
February 23, 2018 06:07
Re: Another idea for speeding up /i matching on ASCII characters
Message ID:
On 02/01/2018 10:03 PM, Karl Williamson wrote:
> I'm forwarding this response from Yves to the list, which he sent me 
> personally, and I think should be public.  It looks like it came from 
> me, but it's really him.
> On 25 January 2018 at 21:18, Karl Williamson <> 
> wrote:
>> This however is not space efficient.
>> If we have an EXACTFish node consisting entirely of ASCII characters that
>> don't have folds outside of ASCII (so Kk are out because they match the
>> KELVIN SIGN), it would be possible to create a parallel string to be 
>> used as
>> a mask, specially constructed so that each byte is the appropriate 
>> mask for
>> /i on the corresponding byte in the main string.
>> When doing the match, you load a words-worth of the string, and the
>> corresponding portion of the mask, AND them together and compare.  If 
>> then
>> fail or go on to the next word full.
>> Thus you wouldn't have to look up the fold for every byte as we do 
>> now. If
>> you've ever watched an EXACTFish node execute with backtracking, it's
>> depressing how often we have to do this over and over.
>> This could also work on many UTF-8 strings as well.
>> It essentially doubles the length of the pattern, but my guess is that 
>> it's
>> around an order of magnitude faster on a 64-bit machine.
> Karl, I think you are over focusing on the node type and inline matching.
> I think any and all of your ideas are worth trying as *start-nodes*,
> that is nodes which are only executed by S_find_byclass().
> Take the AHOCORASICK node type as an example. It is only ever used as
> a regstclass node. It is constructed out a TRIE node, but is distinct
> from it.
> You could try out your optimizations on these nodes only, and thus
> considerations like space efficiency and things like that are less of
> a concern as there can only be one of these nodes per pattern. It also
> give you a good way to see if it makes a difference and is worth
> integrating deeper into the regex engine.
> Yves

In "Programming Perl", there is a pattern which the footnote says 
wouldn't fail this particular match until the heat death of the Universe.

"aa...................a" =~ qr/a*a*a*...a*[Bb]/

Using this technique, it would be possible to fail immediately.  If the 
optimizer were extended to not just have fixed strings, but say the 
longest fixed and floating masked strings, then we could quickly search 
for that during execution, and fail.  Obviously this is a special case, 
but /i of ASCII is common.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About