develooper Front page | perl.perl5.porters | Postings from December 2016

Re: some thoughts on 8-bytes-at-a-time in the regex engine

Thread Previous | Thread Next
Karl Williamson
December 30, 2016 19:17
Re: some thoughts on 8-bytes-at-a-time in the regex engine
Message ID:
On 12/29/2016 07:59 AM, Zefram wrote:
> Dave Mitchell wrote:
>> How well does this work under utf8? Would /\x{100}\x{101}/ contribute a
>> fingerprint of 4 bytes? Would we have a separate fingerprint for utf8 and
>> non-utf8?
> Yes, /\x{100}\x{101}/ specifies four bytes, and can only match a
> UTF8-encoded string.  We would need distinct fingerprints for UTF8-encoded
> and Latin1-encoded inputs.
>> Would anything in this scheme be defeated by EBCDIC?
> No, the same methods apply.
>> But I have no real idea how the fingerprint (3 or 6 UVs: a 0, a 1 and a
>> mask, maybe doubled for utf8)
> I don't see how a fingerprint needs to be three words.  Surely it's
> just two: a mask to apply to the input, and a value to compare the
> masked input against.  Or, equivalently (but less efficiently), a mask
> specifying which bits must be 1 and a mask specifying which bits must
> be 0.  Each bit position in the fingerprint can be in three states:
> must be 1, must be 0, or don't care.  This doesn't take more than two
> bits per bit position to encode.
>> I guess //i patterns would just increase what the fingerprint matches,
>> rather than having to be handled separately.
> That would be the sensible way.  The alternative, of folding case and
> comparing the folded input against the fingerprint, would be a lot slower.
>> I don't know whether any the case-folding edge cases (e.g. the /ss/ stuff)
>> would defeat this scheme.
> They would force fingerprints to be looser, but they don't fundamentally
> break it.
>> on platforms which allow misaligned reads, it would be simpler and cheaper
>> to just read the next UV regardless of alignment.
> Yes, we should take advantage of that where possible.  But watch out
> for a misaligned read that might go past the end of an allocated (and
> mapped) block.  Aligned reads don't need to worry about going past the
> end of the string into uninitialised memory, though valgrind would gripe.
>> Whether this scheme would still be a net performance gain on platforms
>> where a UV is only 32 bits.
> It's not UV you need to be concerned about, but machine words.  The extent
> of a fingerprint should be either a fixed 8 bytes or determined directly
> by machine architecture, not determined by UV size.  What we need to
> determine is *for 32-bit architectures* whether it's best to have an 8
> byte fingerprint, a 4 byte fingerprint, or no fingerprint.
> -zefram

I essentially agree with this.  Here are some further considerations.

Edge cases like /ss/ probably would have to have special handling; 
currently that particular sequence has its own regop under some 

Node boundaries introduce bugs with folding.  qr/[sS][sS]/i currently 
doesn't match ß because of the separate nodes.  So your trick of 
transforming /(abc|defg)vwxyz/ into [ad][be][cf][vg][wv][xw][yx][zy]
is not always valid, depending on the particular things a,b, are.

The ASCII-range alphabetics differ in exactly one bit position from 
their most obvious case-changed version.  So the bit pattern of 'A' and 
'a' are identical except for one bit, they're 32 apart on ASCII 
platforms; 64 on EBCDIC.  That is true also of most Latin1 ranges ones. 
But note that things like k and s also match above-Unicode chars under 
/i.  Folding of above-Latin1 range characters has various patterns, so 
no overall one.

If you watch a /i regex match proceed with backtracking, you will see 
that folding of the same thing gets calculated over and over.  It would 
be tempting to do that fold just once, and then one could have EXACT 
nodes instead of EXACTF.  This could happen if we had fold magic on 
scalars similar to the existing collation magic that gets calculated on 
first need, and recalculated only if the scalar changes.  But it would 
only work on patterns that were entirely /i (that is, no subparts that 
are (?i:...) when the rest of the pattern isn't /i) and would be useful 
only on patterns that had significant EXACTF nodes and backtracking.

Also, some years ago, I proposed using word reads to count UTF-8 
variants when calculating the size needed for converting a string to 
UTF-8.  That was objected to because of cross compilation issues.  Those 
same objections would apply to this scheme, but I suspect that we no 
longer care so much about that.

Not germane to this, but FYI, I am working on our previously agreed on 
goal of compiling in all Unicode properties, and getting rid of runtime 
swashes completely.  Most Unicode properties can be completely described 
by using one to three ranges.  This is not a good fit for ANYOF nodes 
nor swashes, but a new node type could be created for it.  (This fact is 
already used to cut down significantly on the number of disk files 
created in lib/unicore/lib for the Unicode properties.)

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About