develooper Front page | perl.perl5.porters | Postings from December 2015

Re: Obsolete text in utf8.pm

Thread Previous | Thread Next
From:
John Imrie
Date:
December 28, 2015 18:13
Subject:
Re: Obsolete text in utf8.pm
Message ID:
56817BBE.5060701@virginmedia.com
On 28/12/2015 17:22, demerphq wrote:
> On 28 December 2015 at 17:47, Father Chrysostomos <sprout@cpan.org> wrote:
>> John Imrie wrote:
>>> On 28/12/2015 01:01, Father Chrysostomos wrote:
>>>> But that is exactly how symbols are exported (assuming you mean *
>>>> rather than *). Normalization for hash access (which is how we would
>>>> have to do this), even if it is limited to stashes, would be an effic-
>>>> iency nightmare.
>>> I was
>>> hoping that we could do the normalisation on insert into the stash. So
>>> that the stash it's self was normalised. This would make it a compile
>>> time operation, or one hit for each symbol you are exporting. I don't
>>> know enough of the Perl internals as to why this would have to be on
>>> access rather than insert.
>> If the string "whatever" used to ask for the symbol, as in
>>
>>     use Foo "whatever";
>>
>> is not normalised, but *Foo::whatever is stored in the *Foo:: stash
>> normalised, then the symbol lookup that happens at run time every time
>> the symbol is exported will at some point have to normalise the name
>> provided by the caller.
> Just FYI, This already happens for utf8 keys in hashes.
>
> During fetch or store any utf8 key will trigger a downgrade attempt.
>
> During store, if the downgrade is successful then the key will be
> marked as "was-utf8", so that later when it is fetched it will be
> upgraded. If it is not successful then the key will be looked up by
> its utf8 byte sequence.
>
> Combining characters of course are not "properly" downgraded.
>
> Anyway, the consequence of this is that unicode hash lookups are much
> slower than they could be.
>
> Yves
>
>
OK so let me see If I've got this straight. Hash lookups a done on
bytes, because by the time the lookup is done the character semantics
have been removed. So a potentially really yucky solution would be to
special case the stash at that point and normalise the lookup string
prier to the downgrade. Ugg I don't like it so I went looking for other
languages to see what they do. C# says identifiers should be in NFC,
Python performs a NFKC which is Compatibility Decomposition, followed by
Canonical Composition. Which is really yucky in my opinion as it brakes
up ligatures and makes things like the Angstrom match A ring. After this
I gave up.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About