develooper Front page | perl.perl5.porters | Postings from January 2018

Name that regnode

Thread Next
Karl Williamson
January 24, 2018 21:02
Name that regnode
Message ID:
I recently commited changes that sped up the pattern matching 
performance of /[[:ascii:]]/ significantly by using word-at-a-time 
operations instead of byte.

Since then, I realized that the same techniques can be applied to a 
bunch of different sets of characters so that the speed of matching them 
or finding the next occurrence of one of them can run in 1/8 the 
operations as currently.

Consider a byte 'B' and a mask 'M', and the result of the operation
B & M.  Let's call it 'x'.  If the set of bytes we want to match 
consists of precisely all bytes B for which ANDing with M yields 'x', 
then these techniques work.

For example, consider the mask FE, and byte 0x30.  The only other byte 
ANDED with FE that yields 30 is 0x31.  So, if the set we are looking for 
is precisely [01], these techniques apply.  Similarly the mask F8 and 
0x30 works to match precisely [0-7], so the techniques apply

It turns out that these sets are not uncommon.  There are about 18K 
compilations of such sets when our test suite is run.  The most used of 
these is [01].  And if your application is looking for binary numbers, 
matching can be sped up significantly.  As shown above, the octal 
numbers are another set, as are the set of all the ASCII characters, as 
any ASCII ANDed with 0x80 yields 0.  As are any ASCII case pair, like G 
and g.  And the C0 controls.  Some you might not think of are <> and [{.

It's fairly easy to detect these when compiling a bracketed character 
class, and to instead generate a different regnode that the re engine 
will process much faster when trying to find the next one of and to find 
when a span of them end.  And there isn't a lot of code added.

But I don't like the name I came up with for this regnode, MASKED.  I'm 
open to suggestions for a better name.

I haven't studied the implications of using this technique, but one area 
where it could be used is in /i matching, and the first character in the 
string being matched is such an ASCII case pair.  Right now, when 
looking for the next occurrence of this string case insensitively, we 
look one byte at a time and fold.  It would be a lot lot faster to use 
this technique on the first string character.  Similarly if that first 
character has no fold, we should just use memchr to find it.

This technique only works for ASCII characters, as otherwise UTF-8ness 
gets in the way.  It would be possible to have a node type that uses 
this technique if the string wasn't UTF-8, and used the current methods 
for handling bracket character classes otherwise.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About