Dave Mitchell wrote: >How well does this work under utf8? Would /\x{100}\x{101}/ contribute a >fingerprint of 4 bytes? Would we have a separate fingerprint for utf8 and >non-utf8? Yes, /\x{100}\x{101}/ specifies four bytes, and can only match a UTF8-encoded string. We would need distinct fingerprints for UTF8-encoded and Latin1-encoded inputs. >Would anything in this scheme be defeated by EBCDIC? No, the same methods apply. >But I have no real idea how the fingerprint (3 or 6 UVs: a 0, a 1 and a >mask, maybe doubled for utf8) I don't see how a fingerprint needs to be three words. Surely it's just two: a mask to apply to the input, and a value to compare the masked input against. Or, equivalently (but less efficiently), a mask specifying which bits must be 1 and a mask specifying which bits must be 0. Each bit position in the fingerprint can be in three states: must be 1, must be 0, or don't care. This doesn't take more than two bits per bit position to encode. >I guess //i patterns would just increase what the fingerprint matches, >rather than having to be handled separately. That would be the sensible way. The alternative, of folding case and comparing the folded input against the fingerprint, would be a lot slower. >I don't know whether any the case-folding edge cases (e.g. the /ss/ stuff) >would defeat this scheme. They would force fingerprints to be looser, but they don't fundamentally break it. >on platforms which allow misaligned reads, it would be simpler and cheaper >to just read the next UV regardless of alignment. Yes, we should take advantage of that where possible. But watch out for a misaligned read that might go past the end of an allocated (and mapped) block. Aligned reads don't need to worry about going past the end of the string into uninitialised memory, though valgrind would gripe. >Whether this scheme would still be a net performance gain on platforms >where a UV is only 32 bits. It's not UV you need to be concerned about, but machine words. The extent of a fingerprint should be either a fixed 8 bytes or determined directly by machine architecture, not determined by UV size. What we need to determine is *for 32-bit architectures* whether it's best to have an 8 byte fingerprint, a 4 byte fingerprint, or no fingerprint. -zeframThread Previous | Thread Next