develooper Front page | perl.perl5.porters | Postings from June 2009

RFC: Unicode 5.2 name clashes

Thread Next
From:
karl williamson
Date:
June 5, 2009 10:04
Subject:
RFC: Unicode 5.2 name clashes
Message ID:
4A295017.1080405@khwilliamson.com
In Unicode 5.1, there are 6 problematic name clashes between perl 
defined properties and Unicode ones.  (So far, I have asked for comments 
on what to do about two of them, and the result was to drop the Perl 
version in favor of the Unicode version.  However, I expect that this 
will not be the answer for all of the remaining 4, which I will post on 
soon.)

In (still in-draft) Unicode 5.2, there are several more clashes, and it 
indicates a trend that I have concluded must be solved in a more than 
piece-meal fashion.

First, some background.  Currently, all Unicode properties are specified 
by Unicode as a pair: the property name and a "property value".    So 
for example, there is a Numeric_Type property which has 4 values: None, 
Decimal, Digit, and Numeric.  (There are also abbreviations for 3 of 
them.)  Every Unicode code point is placed into exactly one of those 4 
categories.    Although accurate, the term "property value" confuses me, 
so I call them subproperties.  (If you can think of a better name, 
please let me know.  'category' would work for this example, but not in 
all.)

This is not the way Perl treats many of the Unicode properties.  I 
haven't delved too much into the history, but it looks like Unicode 
evolved in a way that Perl did not foresee.  (And some of the problems 
with the current mktables stem from trying to bridge this difference 
piece-meal.)

In Unicode, to match a binary property, one would say (in Perl syntax) 
\p{property: y}.  The complement would be \p{property: n}.  (One can 
also use t and f instead of y and n, or spell them out, or add 
translations for different languages.)  In Perl-style, one says simply 
\p{property} or \P{property}.

Also, in Perl, one can preface the binary property name with 'Is_', so 
\p{isproperty} is the same as \p{property}.  (Keep in mind that case is 
ignored, as are interior underscores.)

And herein lies the problem.  Unicode has long had properties that begin 
with the letters 'is', but in 5.2 (draft) four new ones create clashes. 
  For example there (likely) will be an 'Is_Uppercase' property.  But 
Perl already says that 'Is_Uppercase' means the 'Uppercase' property 
(which is entirely different from the new 'Is_Uppercase' property).

(Clashing isn't entirely new.  Perl has had to deal with this in the 
past when blocks and scripts began to be distinguished and have the same 
names.  In Unicode-style, one writes \p{Block: Thai} and \p{Script: 
Thai} and there is no ambiguity.  Perl originally had just \p{Thai} or 
\p{Is_Thai}, but when that became ambiguous (I'm somewhat guessing at 
the history here), Perl solved it by saying that \p{Thai} meant the 
script, and that if you wanted the block you would say \p{In_Thai} 
(I've always had to look up which meant which, because the 'In_' meaning 
wasn't intuitive to me.)  (Perl also allow you to omit the 'In_' when 
there is no current ambiguity, but that is scary because ambiguity might 
be introduced in future Unicode releases, and your program will end up 
matching something different than you originally.)  Although I don't 
believe it is documented, Perl currently lets you also use 
Unicode-style, \p{Block: Thai} and \p{Script: Thai})

My proposal for dealing with the new clashes is is an extension of this 
(I think undocumented) capability.   I propose to accept the Unicode 
syntax not just for blocks and scripts, but for all properties usable in 
\p{} constructs.  A program that uses their syntax, like \p{Uppercase: 
y} and \p{Is_Uppercase: y}, would be guaranteed to not have conflicts 
ever in any future release.

I am less certain as to what to do about backwards compatibility.  I 
suspect that we have to leave \p{Is_Uppercase} mean what it currently 
means.  And that means that to match the new property, you have to use 
Unicode-style \p{Is_Uppercase: y} (or maybe even \p{Is_Is_Uppercase}, 
though I don't like this.)  So my general proposal would be that for new 
properties, if they clash with existing perl shortcuts, to access them, 
you have to use the Unicode-style syntax.  A new property that didn't 
clash would allow the Perl style.

Please comment.

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About