develooper Front page | perl.perl5.porters | Postings from December 2015

Re: Obsolete text in

Thread Previous | Thread Next
December 29, 2015 04:00
Re: Obsolete text in
Message ID:
On 29 December 2015 at 04:37, bulk88 <> wrote:
> demerphq wrote:
>> Hash lookups are done on bytes because that is how pretty much every
>> existing hash function works.
>> So in a language like ours where we want characters to have the same
>> semantics regardless if they are encoded as binary/latin1, or if they
>> are encoded as unicode, we have to do something about the fact that
>> the code points 128 to 255 may have two different representations.
>> So we bias towards non-unicode strings, and use a normalization
>> strategy as I described.
>> If we had decided when we introduced Unicode that latin1 "\xDF" was
>> not the same as unicode "\xDF", then we would not have to do this, and
>> we would just look at the raw byte sequence.
>> So basically we already  had the normalization problem, but at a
>> different level, and we already  did what you said we should do,
>> albeit a simpler form.
>> However that decision  is IMO suspect (although not changable) as it
>> leaves other issues, such as if you store "\xDF" into a hash as
>> latin1, then store "\xDF" into a hash as unicode, you will get back
>> the non-unicode key when you do keys. IOW, last store wins as far as
>> the utf8 flag goes.
> Instead of keeping the utf8 flag in HEK_FLAGS (1 byte bitfield), here is a
> random idea that IDK if I support, add 1 byte, that consists of 0x0 or 0x1,
> to the start or end of each key string that indicates the utf8ness of the
> key? The first byte would not be accessible from PP, but is part of the HEK
> struct and is always fed to the hash func as part of the normal string,
> currently the utf8 key is often a separate branch
> .

Wouldnt that fail due to the fact that we have the "was-utf8" case?

It would mean that u"\xDF" and l"\xDF" would hash differently.[1]

[1] it would be really nice if we had "unicode quotes" of some sort.
How horrible would it be to introduce u"" quotes ?

perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About