develooper Front page | perl.perl5.porters | Postings from December 2016

some thoughts on 8-bytes-at-a-time in the regex engine

Thread Next
Dave Mitchell
December 29, 2016 14:26
some thoughts on 8-bytes-at-a-time in the regex engine
Message ID:
I've recently been giving some thought to making parts of the regex engine
use the ability to treat 8 bytes of input string as a single UV to make
(for example) searching for a character class faster.

In particular I can see two main ways of speeding things up.

First, when scanning the string for a byte matching a character class,
we currently read each byte in turn and use it as an index into into a
bitmap to see if it matches. For cases where a character class is small
i.e. it holds <= 4 separate values such as /[ACGT]/ or the [bfqt] synthetic
class generated from /(the|quick|brown|fox)/, or contains <= 2 ranges of
characters, such as /[A-Za-z]/, then it's possible to read in 8 bytes as a
single UV and apply a series of bit ops to see whether *any* of the 8
bytes within the UV match that character class, in a way that is faster
than checking each byte individually.

Second, where there are multiple alternatives that could match at a
particular point, such as /(abc|def).../ or /\d+FOO.../, I envisage
constructing a "fingerprint" at that point, which consists of two UVs
(plus a mask in case the fingerprint is short) that together specify for
the next 64-bits of string, which bits *must* be zero, and which bits
*must* be one. Then a particular position can be quickly rejected (with
possible false negatives but no false positives) by doing a couple of
boolean ops, based on the next 8 bytes read into a single UV. (This can be
considered a generalisation of the technique we already use in things like
/(abc)+def/ where we do a quick check that the next char is  a 'd' while
iterating the CURLYM.)

Another way of looking at the fingerprint is to consider it as a
distillation of a series of 8 character classes, one for each of the next
8 byte positions, where for each class, each of the allowed byte patterns
is anded together and ored together to determine which (if any) of the 8
bit positions is always 0 or always 1 across all valid characters in the

How effective the fingerprint is at quickly rejecting invalid input
positions will depend on the nature of the alternatives available at that
point. For example /(abc|defg)vwxyz/ would give the 8 character classes
[ad][be][cf][vg][wv][xw][yx][zy], which even when reduced to a 64-bit
fingerprint, would likely reject most non-viable positions. Conversely
/.{8,}/ would be useless.

The effectiveness of a fingerprint depends not just on the number of
bytes allowed at a particular position, but on the particular bit
patterns. For example /[OP]/ contains these two bytes:

    O: 01001111
    P: 01010000

which gives bits that must always be 0:


and bits that must always be 1:


so any characters in the set 


will match that byte of the fingerprint. Even so, note that it will still
reject lower-case letters and digits.

So a pattern like /\d{2,4}[A-Z]+/ would create a fingerprint where the
first two bytes must be in the range 0x30-0x3f and byte 3 in the range
0x00-0x7f. Even with a relatively poor fingerprint like that, it could
still quickly reject most positions where the first two bytes aren't

I've pushed a proof-of-concept branch, davem/regex8 that tries out those
two ideas, but *only* in find_by_class() and *only* where the starting
character class is a trie. This is not meant as a serious effort, but just
to see whether performance improvements are possible. I tried it on the
various patterns in the regexdna shootout benchmark, which looks for
various DNA sequences in a long string of nucleotides. The results I saw
were (in millions of CPU cycles, so lower is better):

         pre    post
      12,783  13,609 /a[act]ggtaaa|tttacc[agt]t/
       6,663   6,778 /ag[act]gtaaa|tttac[agt]ct/
       5,773   5,100 /agg[act]taaa|ttta[agt]cct/
       5,288   2,409 /agggtaaa|tttaccct/

Note that this is in some ways a poor demonstration, since

1) looking for, e.g. the class [at] 8 bytes at a time doesn't speed up
much, since the input string mostly consists only of [acgt] chars, so not
much skipping takes place;

2) I didn't generate a full fingerprint for each trie; I only used the
chars in the trie itself; so /a[act]ggtaaa|tttacc[agt]t/ only generated a
1-byte fingerprint, even though an 8-byte fingerprint would be possible to
construct. For the pattern which did use a full 8-byte fingerprint,
/agggtaaa|tttaccct/, the speed more than doubled.

If we were to use such a scheme in anger, there are a lot issues which I'm
not sure of, including:

How well does this work under utf8? Would /\x{100}\x{101}/ contribute a
fingerprint of 4 bytes? Would we have a separate fingerprint for utf8 and

Would anything in this scheme be defeated by EBCDIC?

Rather than just using this scheme in S_find_by_class(), I'm assuming it
could be used all over the place within regmatch(). For example the
various *, + and {,} iterators would all know the fingerprint of what's
supposed to follow them, so could quickly reject iteration counts which
would fail soon after. EXACTish nodes could quickly reject before doing a
full comparison. tries could quickly reject without having to run the
full trie.

But I have no real idea how the fingerprint (3 or 6 UVs: a 0, a 1 and a
mask, maybe doubled for utf8) could be added to existing nodes, or whether
we have to add extra 'fingerprint' nodes.

I guess //i patterns would just increase what the fingerprint matches,
rather than having to be handled separately.

I don't know whether any the case-folding edge cases (e.g. the /ss/ stuff)
would defeat this scheme.

I have no real idea how fingerprints would be generated - I guess early
on before the regex is optmised, anywhere where a fingerprint would be
useful, to recursively follow the nodes to determine all possible bytes at
the next 8 byte positions.

I don't know how how the next 8 bytes would be read in and maintained.
I can envisage in rgematch() a UV var which contains the current 8 bytes,
and maybe another var which contains the next 8 aligned bytes, read in a
single go, and shift one into the other as bytes are processed. Or maybe
on platforms which allow misaligned reads, it would be simpler and cheaper
to just read the next UV regardless of alignment.

Whether this scheme would still be a net performance gain on platforms
where a UV is only 32 bits.

Lots of hand waving above.

Any thoughts?

O Unicef Clearasil!
Gibberish and Drivel!
    -- "Bored of the Rings"

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About