Front page | perl.perl5.porters |
Postings from June 2009
RFC: Unicode 5.2 name clashes
Thread Next
From:
karl williamson
Date:
June 5, 2009 10:04
Subject:
RFC: Unicode 5.2 name clashes
Message ID:
4A295017.1080405@khwilliamson.com
In Unicode 5.1, there are 6 problematic name clashes between perl
defined properties and Unicode ones. (So far, I have asked for comments
on what to do about two of them, and the result was to drop the Perl
version in favor of the Unicode version. However, I expect that this
will not be the answer for all of the remaining 4, which I will post on
soon.)
In (still in-draft) Unicode 5.2, there are several more clashes, and it
indicates a trend that I have concluded must be solved in a more than
piece-meal fashion.
First, some background. Currently, all Unicode properties are specified
by Unicode as a pair: the property name and a "property value". So
for example, there is a Numeric_Type property which has 4 values: None,
Decimal, Digit, and Numeric. (There are also abbreviations for 3 of
them.) Every Unicode code point is placed into exactly one of those 4
categories. Although accurate, the term "property value" confuses me,
so I call them subproperties. (If you can think of a better name,
please let me know. 'category' would work for this example, but not in
all.)
This is not the way Perl treats many of the Unicode properties. I
haven't delved too much into the history, but it looks like Unicode
evolved in a way that Perl did not foresee. (And some of the problems
with the current mktables stem from trying to bridge this difference
piece-meal.)
In Unicode, to match a binary property, one would say (in Perl syntax)
\p{property: y}. The complement would be \p{property: n}. (One can
also use t and f instead of y and n, or spell them out, or add
translations for different languages.) In Perl-style, one says simply
\p{property} or \P{property}.
Also, in Perl, one can preface the binary property name with 'Is_', so
\p{isproperty} is the same as \p{property}. (Keep in mind that case is
ignored, as are interior underscores.)
And herein lies the problem. Unicode has long had properties that begin
with the letters 'is', but in 5.2 (draft) four new ones create clashes.
For example there (likely) will be an 'Is_Uppercase' property. But
Perl already says that 'Is_Uppercase' means the 'Uppercase' property
(which is entirely different from the new 'Is_Uppercase' property).
(Clashing isn't entirely new. Perl has had to deal with this in the
past when blocks and scripts began to be distinguished and have the same
names. In Unicode-style, one writes \p{Block: Thai} and \p{Script:
Thai} and there is no ambiguity. Perl originally had just \p{Thai} or
\p{Is_Thai}, but when that became ambiguous (I'm somewhat guessing at
the history here), Perl solved it by saying that \p{Thai} meant the
script, and that if you wanted the block you would say \p{In_Thai}
(I've always had to look up which meant which, because the 'In_' meaning
wasn't intuitive to me.) (Perl also allow you to omit the 'In_' when
there is no current ambiguity, but that is scary because ambiguity might
be introduced in future Unicode releases, and your program will end up
matching something different than you originally.) Although I don't
believe it is documented, Perl currently lets you also use
Unicode-style, \p{Block: Thai} and \p{Script: Thai})
My proposal for dealing with the new clashes is is an extension of this
(I think undocumented) capability. I propose to accept the Unicode
syntax not just for blocks and scripts, but for all properties usable in
\p{} constructs. A program that uses their syntax, like \p{Uppercase:
y} and \p{Is_Uppercase: y}, would be guaranteed to not have conflicts
ever in any future release.
I am less certain as to what to do about backwards compatibility. I
suspect that we have to leave \p{Is_Uppercase} mean what it currently
means. And that means that to match the new property, you have to use
Unicode-style \p{Is_Uppercase: y} (or maybe even \p{Is_Is_Uppercase},
though I don't like this.) So my general proposal would be that for new
properties, if they clash with existing perl shortcuts, to access them,
you have to use the Unicode-style syntax. A new property that didn't
clash would allow the Perl style.
Please comment.
Thread Next
-
RFC: Unicode 5.2 name clashes
by karl williamson