Front page | perl.perl5.porters |
Postings from November 2010
Re: [perl #78354] PATCH: Unicode 6.0
Thread Previous
From:
karl williamson
Date:
November 16, 2010 17:44
Subject:
Re: [perl #78354] PATCH: Unicode 6.0
Message ID:
4CE3336C.2070904@khwilliamson.com
karl williamson wrote:
> Father Chrysostomos via RT wrote:
>> On Tue Oct 12 21:56:17 2010, public@khwilliamson.com wrote:
>>> This series of commits delivers the Unicode 6.0 db, and upgrades Perl
>>> to use it. There may still be some work to do in Unicode::UCD to
>>> support the new characters (which I'll investigate), but the rest of
>>> the Perl core should fully support it.
>>>
>>> The few code changes are attached to this email, but the bulk of the
>>> changes (along with the attachments here), too large to email, are
>>> located at git://github.com/khwilliamson/perl.git
>>> branch mktables
>>>
>>> Those changes are essentially entirely official Unicode data, except
>>> for the MANIFEST, perldelta, version, and a couple data changes in UCD.t
>>
>> I’ve applied the first patch as 92f9d56c66.
>> With the Unicode 6 database I get a test failure:
>>
>> $ curl http://github.com/khwilliamson/perl/commit/35e84e1c3151243.patch
>> | git am
>> [...]
>> $ cd t
>> $ ./perl harness -v ../lib/charnames.t
>> [...]
>> not ok 17078 - Verify string_vianame("BELL") is chr(0x1F514)
>> # Failed at ../lib/charnames.t line 105
>> # got "\a"
>> # expected "\x{1f514}"
>>
>>
>>
>
> I'm afraid this is what I consider to be a flaw in the new standard,
> though they wouldn't; I regret that I did not find it before it was too
> late; as your tests are the first it surfaced. I'm not sure Unicode
> would have listened to me anyway, but we would have known about this
> earlier.
>
> Your tests showed the problem and my tests didn't, because of the random
> sampling of the tests, because it would take too long to go through all
> million possible code points each time; and my tests just didn't try
> that combination yet.
>
> I'm not sure what to do; suggestions welcome.
>
> The problem stems from the fact that the Standard does not give names to
> the control characters, such as ACK and BEL. It did in version 1.0, and
> it still publishes those names as the "Unicode_1_Name" property. That
> name for character 0x07, known by the acronym BEL, is "BELL". What Perl
> does is to use the Unicode 1 names when there is no current. All was
> fine until 6.0 came along and re-used BELL for a different character.
>
> But as far as Unicode is concerned, there isn't a problem, as BEL has no
> official name. It is Perl who has persisted in using this old name. I
> don't know why Unicode removed the names; and it seems eminently
> reasonable to give them names; but here we are.
>
> The only option I can think of that doesn't violate our stability
> policies is to, in 5.14, keep the old BELL meaning, but deprecate it,
> saying to use BEL instead, which was added in 5.13 as a synonym for it.
> This means that in 5.14 we don't accept that one new Unicode character,
> except by ordinal value. In 5.16, we convert to use Unicode.
>
> In the meantime, I will propose that Unicode adopt a policy of not doing
> this again, and perhaps an alias that gives a somewhat different name,
> just to clear up future confusion.
>
The attached patches work around this problem by deprecating \N{BELL}
for 5.14, and giving the new name \N{ALERT} to it. The new character
with that name will be unnamed. This means that Perl 5.14 doesn't quite
support Unicode 6.0.
The patches are also available at:
git://github.com/khwilliamson/perl.git
branch uni6
which includes the entire series of unicode 6 patches.
Thread Previous