develooper Front page | perl.perl5.porters | Postings from January 2017

Re: some thoughts on 8-bytes-at-a-time in the regex engine

Thread Previous | Thread Next
David Nicol
January 1, 2017 05:53
Re: some thoughts on 8-bytes-at-a-time in the regex engine
Message ID:
> I am not keen on this optimisation, especially as it essentially
> creates a trie out of charclasses.
> I would prefer we just make the TRIE code faster by not making it
> support huge alphabets and eliminating the compressed table
> representation.

I'm surprised to learn that unicode character classes aren't already treated
as tries. I thought they were, and [ab♞] was essentially shorthand for
(?:a|b|♞), with
the multi-byte ♞ treated like any other series of bytes that have to
appear in a particular order,
and [ab♚♝♞] turning into a test that the first byte contains either a,
b, or the appropriate utf8 length prefix byte, followed by the any of
the three possible matching continuations. The trie optimization.

> As I mention in a different reply I have been working on a perl based
> regex optimising parser, which constructs a proper optree and then
> transforms and optimises the optree. If we had such a thing in C then
> we could do MANY things we currently cannot.

Just for a thought experiment, if the regex parser worked ideally,
what kind of situations would doing wider comparisons help with? The
answer would be, comparisons involving more than one character, of
course, but how many characters? How many characters before the effort
of aligning the comparison is smaller than the savings of doing fewer
comparisons? That is, starting from an ideal regex engine that turns
the regex into a finite state automaton (FSA) consisting of
octet-processing nodes that are implemented as tables of 256
where-to-go-next questions.

For that matter, could there be anything gained by treating the input
as nybbles, so each octet-processing point would break into two 16-way
nodes instead of one 256-way node? Probably not. Practically, it's
easy to see situations where pre-testing for a high byte -- c&0x80 --
could eliminate without having to look up in the table, allowing a
128-wide table instead of 256 -- but is having smaller tables worth
it, for speed? maybe selecting from various strategies at each node in
the FSA depending on what it is checking would be ideal.

Given the above, if the regex compiler knows that for a match on c[0]
to result in a match on the whole pattern, c[1] ... c[7] are going to
have to match too, it could make sense to deal with the alignment
issues somehow and then compare 64 bits at once. Outside of that
situation -- long sequences where they all must match to continue on
that path in the FSA -- I don't see doing vector comparisons helping
any, especially when the engine would have to backtrack and recheck
matches when ambiguity might get encoded as a loose test.

Maybe four characters is enough?
     Is a string that must be matched at this point longer than 4
chars? Apply the optimization. Once applied, instead of creating four
single byte nodes, the code would copy c and the three bytes after c
to somewhere aligned, then compare the aligned register against the
required four bytes. This would be instead of having four nodes each
testing one byte, and would be a speed optimization -- fewer branches
for fewer pipeline stalls -- as well as a space optimization -- fewer
256-wide tables. I expect (without benchmarking) that always copying
will be better than testing to see if the input is already aligned,
thereby not having to copy one eighth of the time.

Like I think I made clear above, I haven't done any work on the regex
engine internals, but I imagine a regex compiles into a FSA that
operates on bytes, with side effects of filling capture buffers, and I
do see particular scenarios where testing more than one byte at a time
could make sense as an optimization to shrink the FSA.

Happy new year!!!
"Teaching radical novelties is our main safeguard against
dictatorships" -- Edsger W. Dijkstra

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About