develooper Front page | perl.perl5.porters | Postings from December 2015

Re: Obsolete text in utf8.pm

Thread Previous | Thread Next
From:
demerphq
Date:
December 29, 2015 04:00
Subject:
Re: Obsolete text in utf8.pm
Message ID:
CANgJU+UPUpeFM5FT4znY0DLdVviWGZOkOW3HAEKQf337Dw=e2Q@mail.gmail.com
On 29 December 2015 at 04:37, bulk88 <bulk88@hotmail.com> wrote:
> demerphq wrote:
>>
>> Hash lookups are done on bytes because that is how pretty much every
>> existing hash function works.
>>
>> So in a language like ours where we want characters to have the same
>> semantics regardless if they are encoded as binary/latin1, or if they
>> are encoded as unicode, we have to do something about the fact that
>> the code points 128 to 255 may have two different representations.
>>
>> So we bias towards non-unicode strings, and use a normalization
>> strategy as I described.
>>
>> If we had decided when we introduced Unicode that latin1 "\xDF" was
>> not the same as unicode "\xDF", then we would not have to do this, and
>> we would just look at the raw byte sequence.
>>
>> So basically we already  had the normalization problem, but at a
>> different level, and we already  did what you said we should do,
>> albeit a simpler form.
>>
>> However that decision  is IMO suspect (although not changable) as it
>> leaves other issues, such as if you store "\xDF" into a hash as
>> latin1, then store "\xDF" into a hash as unicode, you will get back
>> the non-unicode key when you do keys. IOW, last store wins as far as
>> the utf8 flag goes.
>
>
>
> Instead of keeping the utf8 flag in HEK_FLAGS (1 byte bitfield), here is a
> random idea that IDK if I support, add 1 byte, that consists of 0x0 or 0x1,
> to the start or end of each key string that indicates the utf8ness of the
> key? The first byte would not be accessible from PP, but is part of the HEK
> struct and is always fed to the hash func as part of the normal string,
> currently the utf8 key is often a separate branch
> http://perl5.git.perl.org/perl.git/blob/HEAD:/hv.c#l677 .

Wouldnt that fail due to the fact that we have the "was-utf8" case?

It would mean that u"\xDF" and l"\xDF" would hash differently.[1]

Yves
[1] it would be really nice if we had "unicode quotes" of some sort.
How horrible would it be to introduce u"" quotes ?

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About