develooper Front page | perl.perl5.porters | Postings from December 2016

Re: some thoughts on 8-bytes-at-a-time in the regex engine

Thread Previous | Thread Next
December 29, 2016 14:59
Re: some thoughts on 8-bytes-at-a-time in the regex engine
Message ID:
Dave Mitchell wrote:
>How well does this work under utf8? Would /\x{100}\x{101}/ contribute a
>fingerprint of 4 bytes? Would we have a separate fingerprint for utf8 and

Yes, /\x{100}\x{101}/ specifies four bytes, and can only match a
UTF8-encoded string.  We would need distinct fingerprints for UTF8-encoded
and Latin1-encoded inputs.

>Would anything in this scheme be defeated by EBCDIC?

No, the same methods apply.

>But I have no real idea how the fingerprint (3 or 6 UVs: a 0, a 1 and a
>mask, maybe doubled for utf8)

I don't see how a fingerprint needs to be three words.  Surely it's
just two: a mask to apply to the input, and a value to compare the
masked input against.  Or, equivalently (but less efficiently), a mask
specifying which bits must be 1 and a mask specifying which bits must
be 0.  Each bit position in the fingerprint can be in three states:
must be 1, must be 0, or don't care.  This doesn't take more than two
bits per bit position to encode.

>I guess //i patterns would just increase what the fingerprint matches,
>rather than having to be handled separately.

That would be the sensible way.  The alternative, of folding case and
comparing the folded input against the fingerprint, would be a lot slower.

>I don't know whether any the case-folding edge cases (e.g. the /ss/ stuff)
>would defeat this scheme.

They would force fingerprints to be looser, but they don't fundamentally
break it.

>on platforms which allow misaligned reads, it would be simpler and cheaper
>to just read the next UV regardless of alignment.

Yes, we should take advantage of that where possible.  But watch out
for a misaligned read that might go past the end of an allocated (and
mapped) block.  Aligned reads don't need to worry about going past the
end of the string into uninitialised memory, though valgrind would gripe.

>Whether this scheme would still be a net performance gain on platforms
>where a UV is only 32 bits.

It's not UV you need to be concerned about, but machine words.  The extent
of a fingerprint should be either a fixed 8 bytes or determined directly
by machine architecture, not determined by UV size.  What we need to
determine is *for 32-bit architectures* whether it's best to have an 8
byte fingerprint, a 4 byte fingerprint, or no fingerprint.


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About