develooper Front page | perl.perl5.porters | Postings from December 2015

Re: Obsolete text in

Thread Previous | Thread Next
December 29, 2015 03:38
Re: Obsolete text in
Message ID:
demerphq wrote:
> Hash lookups are done on bytes because that is how pretty much every
> existing hash function works.
> So in a language like ours where we want characters to have the same
> semantics regardless if they are encoded as binary/latin1, or if they
> are encoded as unicode, we have to do something about the fact that
> the code points 128 to 255 may have two different representations.
> So we bias towards non-unicode strings, and use a normalization
> strategy as I described.
> If we had decided when we introduced Unicode that latin1 "\xDF" was
> not the same as unicode "\xDF", then we would not have to do this, and
> we would just look at the raw byte sequence.
> So basically we already  had the normalization problem, but at a
> different level, and we already  did what you said we should do,
> albeit a simpler form.
> However that decision  is IMO suspect (although not changable) as it
> leaves other issues, such as if you store "\xDF" into a hash as
> latin1, then store "\xDF" into a hash as unicode, you will get back
> the non-unicode key when you do keys. IOW, last store wins as far as
> the utf8 flag goes.

Instead of keeping the utf8 flag in HEK_FLAGS (1 byte bitfield), here is 
a random idea that IDK if I support, add 1 byte, that consists of 0x0 or 
0x1, to the start or end of each key string that indicates the utf8ness 
of the key? The first byte would not be accessible from PP, but is part 
of the HEK struct and is always fed to the hash func as part of the 
normal string, currently the utf8 key is often a separate branch .

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About