develooper Front page | perl.perl5.porters | Postings from November 2010

Re: [perl #78354] PATCH: Unicode 6.0

Thread Previous
karl williamson
November 16, 2010 17:44
Re: [perl #78354] PATCH: Unicode 6.0
Message ID:
karl williamson wrote:
> Father Chrysostomos via RT wrote:
>> On Tue Oct 12 21:56:17 2010, wrote:
>>> This series of commits delivers the Unicode 6.0 db, and upgrades Perl 
>>> to use it.  There may still be some work to do in Unicode::UCD to 
>>> support the new characters (which I'll investigate), but the rest of 
>>> the Perl core should fully support it.
>>> The few code changes are attached to this email, but the bulk of the 
>>> changes (along with the attachments here), too large to email, are 
>>> located at git://
>>> branch mktables
>>> Those changes are essentially entirely official Unicode data, except 
>>> for the MANIFEST, perldelta, version, and a couple data changes in UCD.t
>> I’ve applied the first patch as 92f9d56c66.
>> With the Unicode 6 database I get a test failure:
>> $ curl
>> | git am
>> [...]
>> $ cd t
>> $ ./perl harness -v ../lib/charnames.t
>> [...]
>> not ok 17078 - Verify string_vianame("BELL") is chr(0x1F514)
>> # Failed at ../lib/charnames.t line 105
>> #      got "\a"
>> # expected "\x{1f514}"
> I'm afraid this is what I consider to be a flaw in the new standard, 
> though they wouldn't; I regret that I did not find it before it was too 
> late; as your tests are the first it surfaced.  I'm not sure Unicode 
> would have listened to me anyway, but we would have known about this 
> earlier.
> Your tests showed the problem and my tests didn't, because of the random 
> sampling of the tests, because it would take too long to go through all 
> million possible code points each time; and my tests just didn't try 
> that combination yet.
> I'm not sure what to do; suggestions welcome.
> The problem stems from the fact that the Standard does not give names to 
> the control characters, such as ACK and BEL.  It did in version 1.0, and 
> it still publishes those names as the "Unicode_1_Name" property.  That 
> name for character 0x07, known by the acronym BEL, is "BELL".  What Perl 
> does is to use the Unicode 1 names when there is no current.  All was 
> fine until 6.0 came along and re-used BELL for a different character.
> But as far as Unicode is concerned, there isn't a problem, as BEL has no 
> official name.  It is Perl who has persisted in using this old name.  I 
> don't know why Unicode removed the names; and it seems eminently 
> reasonable to give them names; but here we are.
> The only option I can think of that doesn't violate our stability 
> policies is to, in 5.14, keep the old BELL meaning, but deprecate it, 
> saying to use BEL instead, which was added in 5.13 as a synonym for it. 
>  This means that in 5.14 we don't accept that one new Unicode character, 
> except by ordinal value.  In 5.16, we convert to use Unicode.
> In the meantime, I will propose that Unicode adopt a policy of not doing 
> this again, and perhaps an alias that gives a somewhat different name, 
> just to clear up future confusion.

The attached patches work around this problem by deprecating \N{BELL} 
for 5.14, and giving the new name \N{ALERT} to it.  The new character 
with that name will be unnamed.  This means that Perl 5.14 doesn't quite 
support Unicode 6.0.

The patches are also available at:
branch uni6

which includes the entire series of unicode 6 patches.

Thread Previous Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About