demerphq wrote: > Hash lookups are done on bytes because that is how pretty much every > existing hash function works. > > So in a language like ours where we want characters to have the same > semantics regardless if they are encoded as binary/latin1, or if they > are encoded as unicode, we have to do something about the fact that > the code points 128 to 255 may have two different representations. > > So we bias towards non-unicode strings, and use a normalization > strategy as I described. > > If we had decided when we introduced Unicode that latin1 "\xDF" was > not the same as unicode "\xDF", then we would not have to do this, and > we would just look at the raw byte sequence. > > So basically we already had the normalization problem, but at a > different level, and we already did what you said we should do, > albeit a simpler form. > > However that decision is IMO suspect (although not changable) as it > leaves other issues, such as if you store "\xDF" into a hash as > latin1, then store "\xDF" into a hash as unicode, you will get back > the non-unicode key when you do keys. IOW, last store wins as far as > the utf8 flag goes. Instead of keeping the utf8 flag in HEK_FLAGS (1 byte bitfield), here is a random idea that IDK if I support, add 1 byte, that consists of 0x0 or 0x1, to the start or end of each key string that indicates the utf8ness of the key? The first byte would not be accessible from PP, but is part of the HEK struct and is always fed to the hash func as part of the normal string, currently the utf8 key is often a separate branch http://perl5.git.perl.org/perl.git/blob/HEAD:/hv.c#l677 .Thread Previous | Thread Next