Front page | perl.perl5.porters |
Postings from January 2018
Name that regnode
Thread Next
From:
Karl Williamson
Date:
January 24, 2018 21:02
Subject:
Name that regnode
Message ID:
21e66a6f-2723-37c7-a738-e2fd0645c5eb@khwilliamson.com
I recently commited changes that sped up the pattern matching
performance of /[[:ascii:]]/ significantly by using word-at-a-time
operations instead of byte.
Since then, I realized that the same techniques can be applied to a
bunch of different sets of characters so that the speed of matching them
or finding the next occurrence of one of them can run in 1/8 the
operations as currently.
Consider a byte 'B' and a mask 'M', and the result of the operation
B & M. Let's call it 'x'. If the set of bytes we want to match
consists of precisely all bytes B for which ANDing with M yields 'x',
then these techniques work.
For example, consider the mask FE, and byte 0x30. The only other byte
ANDED with FE that yields 30 is 0x31. So, if the set we are looking for
is precisely [01], these techniques apply. Similarly the mask F8 and
0x30 works to match precisely [0-7], so the techniques apply
It turns out that these sets are not uncommon. There are about 18K
compilations of such sets when our test suite is run. The most used of
these is [01]. And if your application is looking for binary numbers,
matching can be sped up significantly. As shown above, the octal
numbers are another set, as are the set of all the ASCII characters, as
any ASCII ANDed with 0x80 yields 0. As are any ASCII case pair, like G
and g. And the C0 controls. Some you might not think of are <> and [{.
It's fairly easy to detect these when compiling a bracketed character
class, and to instead generate a different regnode that the re engine
will process much faster when trying to find the next one of and to find
when a span of them end. And there isn't a lot of code added.
But I don't like the name I came up with for this regnode, MASKED. I'm
open to suggestions for a better name.
I haven't studied the implications of using this technique, but one area
where it could be used is in /i matching, and the first character in the
string being matched is such an ASCII case pair. Right now, when
looking for the next occurrence of this string case insensitively, we
look one byte at a time and fold. It would be a lot lot faster to use
this technique on the first string character. Similarly if that first
character has no fold, we should just use memchr to find it.
This technique only works for ASCII characters, as otherwise UTF-8ness
gets in the way. It would be possible to have a node type that uses
this technique if the string wasn't UTF-8, and used the current methods
for handling bracket character classes otherwise.
Thread Next
-
Name that regnode
by Karl Williamson