Front page | perl.perl5.porters |
Postings from December 2015
Re: Obsolete text in utf8.pm
December 28, 2015 18:47
Re: Obsolete text in utf8.pm
Message ID: CANgJU+Uq_okO4cxLssa4pWHVrh1Auvo4zBr9sE2O8E1ig=ZPvA@mail.gmail.com
On 28 December 2015 at 19:13, John Imrie <email@example.com> wrote:
> On 28/12/2015 17:22, demerphq wrote:
>> On 28 December 2015 at 17:47, Father Chrysostomos <firstname.lastname@example.org> wrote:
>>> John Imrie wrote:
>>>> On 28/12/2015 01:01, Father Chrysostomos wrote:
>>>>> But that is exactly how symbols are exported (assuming you mean *
>>>>> rather than *). Normalization for hash access (which is how we would
>>>>> have to do this), even if it is limited to stashes, would be an effic-
>>>>> iency nightmare.
>>>> I was
>>>> hoping that we could do the normalisation on insert into the stash. So
>>>> that the stash it's self was normalised. This would make it a compile
>>>> time operation, or one hit for each symbol you are exporting. I don't
>>>> know enough of the Perl internals as to why this would have to be on
>>>> access rather than insert.
>>> If the string "whatever" used to ask for the symbol, as in
>>> use Foo "whatever";
>>> is not normalised, but *Foo::whatever is stored in the *Foo:: stash
>>> normalised, then the symbol lookup that happens at run time every time
>>> the symbol is exported will at some point have to normalise the name
>>> provided by the caller.
>> Just FYI, This already happens for utf8 keys in hashes.
>> During fetch or store any utf8 key will trigger a downgrade attempt.
>> During store, if the downgrade is successful then the key will be
>> marked as "was-utf8", so that later when it is fetched it will be
>> upgraded. If it is not successful then the key will be looked up by
>> its utf8 byte sequence.
>> Combining characters of course are not "properly" downgraded.
>> Anyway, the consequence of this is that unicode hash lookups are much
>> slower than they could be.
> OK so let me see If I've got this straight. Hash lookups a done on
> bytes, because by the time the lookup is done the character semantics
> have been removed.
Hash lookups are done on bytes because that is how pretty much every
existing hash function works.
So in a language like ours where we want characters to have the same
semantics regardless if they are encoded as binary/latin1, or if they
are encoded as unicode, we have to do something about the fact that
the code points 128 to 255 may have two different representations.
So we bias towards non-unicode strings, and use a normalization
strategy as I described.
If we had decided when we introduced Unicode that latin1 "\xDF" was
not the same as unicode "\xDF", then we would not have to do this, and
we would just look at the raw byte sequence.
So basically we already had the normalization problem, but at a
different level, and we already did what you said we should do,
albeit a simpler form.
However that decision is IMO suspect (although not changable) as it
leaves other issues, such as if you store "\xDF" into a hash as
latin1, then store "\xDF" into a hash as unicode, you will get back
the non-unicode key when you do keys. IOW, last store wins as far as
the utf8 flag goes.
> So a potentially really yucky solution would be to
> special case the stash at that point and normalise the lookup string
> prier to the downgrade. Ugg I don't like it so I went looking for other
> languages to see what they do. C# says identifiers should be in NFC,
> Python performs a NFKC which is Compatibility Decomposition, followed by
> Canonical Composition. Which is really yucky in my opinion as it brakes
> up ligatures and makes things like the Angstrom match A ring. After this
> I gave up.
In theory we could do whatever type of normalization we choose.
However back-compat will rear its ugly head.
One options I think, maybe, possible, would be to normalize source
code before we compile it.
perl -Mre=debug -e "/just|another|perl|hacker/"