Front page | perl.perl5.porters |
Postings from September 2009
RFC: Unicode/Perl name clashes
Thread Previous
|
Thread Next
From:
karl williamson
Date:
September 1, 2009 10:54
Subject:
RFC: Unicode/Perl name clashes
Message ID:
4A9D5F9C.1050204@khwilliamson.com
I stopped talking about this during the push to get 5.10.1 out the door,
but here it is again.
To summarize where we were: There are 6 property name clashes currently
between Perl and Unicode (where a property name in one means something
different in the other), and the new Unicode 5.2 beta introduced a
number more. This latter was because Perl allows an optional Is_ (with
or without the underscore) to precede a property name, and Unicode
proposed a number of properties that began with Is_, some of which
conflicted.
I'm happy to report that Unicode backed out the 5.2 conflicts, and
changed the names to something else. (The Is_ names were misleading
anyway, not really describing the underlying property correctly.)
Further, Unicode realizes the inadvisability of ever creating names that
begin with Is_. (It turns out that Perl is not the only one who had
this conflict, and I and Jarkko weren't the only ones who complained.)
I should note that Perl also has introduced the prefix In_ to denote
block properties, and that Unicode could in the future create new
properties that conflict with this. There are already, in fact, some
Unicode properties that begin with 'In' (which is indistinguishable from
'In_' in practice), but none of them conflict with any existing (nor
likely ever) block name. The chances of there ever being a conflict are
small, due to the nature of block names; so I'm just documenting the
theoretical possibility.
In Perl, most people matching Unicode properties would write \p{Foo} or
\p{IsFoo} in regular expressions. Unicode-style would have these both
be \p{Foo=Y}. (The complement would be \p{Foo=N}.) The Perl style only
works for binary properties, those that have only true or false as
possible values. There are many non-binary properties in Unicode, and
Perl currently allows the Unicode style for those it handles. For
example, you can write \p{nt=de} in existing Perls to match characters
that have numeric-type decimal. Script, Block, and General Category
(gc) are also non-binary, but Perl has created binary equivalents for
them, such as \p{Greek}. But you can in existing Perls also say
\p{script=greek} to mean the exact same thing.
To get back to the existing name clashes. The 6 are:
Upper
Lower
Cntrl
Alpha
Space
Decomposition_Type=Compat
I had only brought up the first three so far. The consensus on the
first two, Upper and Lower, was to change Perl's definition to match
Unicode's. It wasn't so clear on Cntrl.
During the hiatus of this thread, I received a personal email I found
persuasive from Jarkko. Here is the relevant portion:
> My opinion would be that if Unicode has a definition "X" or "Xyz", Perl
> should implement that with an identical name \p{X} or \p{Xyz}. In other
> words, Unicode wins. In other words, Perl's definitions should be equal
> to Unicode's. The Perl definitions like "cntrl" (which ultimately came
> from the POSIX named character classes, [[:cntrl:]] were not necessarily
> fully thought through, and it's admissible to break them. Unicode knows
> better than Perl.
So, I propose to essentially follow his advice.
Here's what that means for each of the clashes:
We had decided that Upper and Lower would change to Unicode's. The Perl
versions were proper subsets of the Unicode's, and so the end result is
that a few code points are added to the Perl definitions that aren't
already there.
For cntrl, the perl definition is a superset of the Unicode's. It
includes beyond Unicode's: private use, surrogates, and 139 other
characters, which I listed in an earlier email. 36 of these are not
deprecated by Unicode. So converting to use Unicode's definition will
remove 36 "normal" characters from Perl's definition of cntrl. These
include the soft hyphen, the zero-width join/non-join like characters,
and bi-directional algorithm characters, all of which are considered by
Unicode to not be controls, but to be formatting characters.
For alpha, it is more complicated. Even before getting Jarkko's email,
I had come to the belief that Unicode's was the more "correct"
definition. Both include all Letters, but Perl includes all 'Marks' as
well, many of which really aren't alphas. Unicode instead includes the
'Number Letters', which are numbers that are more used as letters;
Unicode also includes the other code points that they have by-hand
decided should be considered to be alpha. Almost all of these are
marks. So Perl just takes all marks, and Unicode has gone through them
manually, deciding which should be alpha and which not.
For space, it is less complicated. This is a synonym in Unicode for
White Space, and the synonym was not introduced until version 4.1.
Since then, the Perl definition and the Unicode definition match the
exact same code points, so it is not a problem for current releases; nor
is it likely to be a problem going forward, as I think Unicode has said
they don't ever plan to add any space-like code points. (But some of
their stability claims in the past have been misleading, so there is an
outside chance they could diverge; but not worth worrying about now.)
The only reason I classify this as one of the 6 conflicts, and bring it
up here, is because the Perl definition of Space did not match the
Unicode definition of White space in version 3.2 of Unicode. There are
people who want to use Perl with this Unicode release. The difference
is a single code point, and it is a bug in 3.2, fixed in the next
release. Here is a case where Unicode did not know better than Perl. I
therefore propose to leave Perl's definition intact. The implications
of this are that there is no difference for any release but 3.2. If
someone is using 3.2, and they use \p{wspace}, it won't match
identically to \p{space}. Keep in mind that there is no Unicode
\p{space} in that release.
And lastly, is the the decomposition type compat (\p{dt=compat}).
Perl's definition is a superset of Unicode's. Decomposition types can
be split into two major categories: Canonical and Non-canonical. Perl
calls the latter 'compat', whereas Unicode splits that into a number of
subcategories, like <super> to indicate that the code point is a
superscript form of another. Unicode's decomposition type compat means
one of the non-canonical decompositions that doesn't fit neatly into one
of the other categories for them. It has been around since version 2.0,
1996. The Camel book says that the Unicode version was going to be
called IsDCcompat, and the Perl version was going to be called
IsDecoCompat (p. 170.) But none of the decomposition names it said are
currently implemented; perhaps never were, I don't know. I also don't
know how it came to be that compat came to mean both these things, but
it does, as the table that is generated for perl to use is a combination
of both, screwed-up, in improper form. I haven't tested it, but it's
likely that there are cases where it gives the wrong results, for either
definition. I propose to withdraw Perl's definition in favor of
Unicode's. If we want something that matches all the non-canonical
decompositions, it is trivial to create a new table with a new name.
Suggestions are welcome. Here's a couple: AnyCompat, NonCanonical
Thread Previous
|
Thread Next