develooper Front page | perl.perl5.porters | Postings from September 2009

RFC: Unicode/Perl name clashes

Thread Previous | Thread Next
karl williamson
September 1, 2009 10:54
RFC: Unicode/Perl name clashes
Message ID:
I stopped talking about this during the push to get 5.10.1 out the door, 
but here it is again.

To summarize where we were:  There are 6 property name clashes currently 
  between Perl and Unicode (where a property name in one means something 
different in the other), and the new Unicode 5.2 beta introduced a 
number more.  This latter was because Perl allows an optional Is_ (with 
or without the underscore) to precede a property name, and Unicode 
proposed a number of properties that began with Is_, some of which 

I'm happy to report that Unicode backed out the 5.2 conflicts, and 
changed the names to something else.  (The Is_ names were misleading 
anyway, not really describing the underlying property correctly.) 
Further, Unicode realizes the inadvisability of ever creating names that 
begin with Is_.  (It turns out that Perl is not the only one who had 
this conflict, and I and Jarkko weren't the only ones who complained.) 
I should note that Perl also has introduced the prefix In_ to  denote 
block properties, and that Unicode could in the future create new 
properties that conflict with this.  There are already, in fact, some 
Unicode properties that begin with 'In' (which is indistinguishable from 
'In_' in practice), but none of them conflict with any existing (nor 
likely ever) block name.  The chances of there ever being a conflict are 
small, due to the nature of block names; so I'm just documenting the 
theoretical possibility.

In Perl, most people matching Unicode properties would write \p{Foo} or 
\p{IsFoo} in regular expressions.  Unicode-style would have these both 
be \p{Foo=Y}.  (The complement would be \p{Foo=N}.)  The Perl style only 
works for binary properties, those that have only true or false as 
possible values.  There are many non-binary properties in Unicode, and 
Perl currently allows the Unicode style for those it handles.  For 
example, you can write \p{nt=de} in existing Perls to match characters 
that have numeric-type decimal.  Script, Block, and General Category 
(gc) are also non-binary, but Perl has created binary equivalents for 
them, such as \p{Greek}.  But you can in existing Perls also say 
\p{script=greek} to mean the exact same thing.

To get back to the existing name clashes.  The 6 are:

I had only brought up the first three so far.  The consensus on the 
first two, Upper and Lower, was to change Perl's definition to match 
Unicode's.  It wasn't so clear on Cntrl.

During the hiatus of this thread, I received a personal email I found 
persuasive  from Jarkko.  Here is the relevant portion:

 > My opinion would be that if Unicode has a definition "X" or "Xyz", Perl
 > should implement that with an identical name \p{X} or \p{Xyz}.  In other
 > words, Unicode wins.  In other words, Perl's definitions should be equal
 > to Unicode's.  The Perl definitions like "cntrl" (which ultimately came
 > from the POSIX named character classes, [[:cntrl:]] were not necessarily
 > fully thought through, and it's admissible to break them.  Unicode knows
 > better than Perl.

So, I propose to essentially follow his advice.

Here's what that means for each of the clashes:

We had decided that Upper and Lower would change to Unicode's.  The Perl 
versions were proper subsets of the Unicode's, and so the end result is 
that a few code points are added to the Perl definitions that aren't 
already there.

For cntrl, the perl definition is a superset of the Unicode's.   It 
includes beyond Unicode's: private use, surrogates, and 139 other 
characters, which I listed in an earlier email.  36 of these are not 
deprecated by Unicode.  So converting to use Unicode's definition will 
remove 36 "normal" characters from Perl's definition of cntrl.  These 
include the soft hyphen,  the zero-width join/non-join like characters, 
and bi-directional algorithm characters, all of which are considered by 
Unicode to not be controls, but to be formatting characters.

For alpha, it is more complicated.  Even before getting Jarkko's email, 
I had come to the belief that Unicode's was the more "correct" 
definition.  Both include all Letters, but Perl includes all 'Marks' as 
well, many of which really aren't alphas.  Unicode instead includes the 
'Number Letters', which are numbers that are more used as letters; 
Unicode also includes the other code points that they have by-hand 
decided should  be considered to be alpha.  Almost all of these are 
marks.  So Perl just takes all marks, and Unicode has gone through them 
manually, deciding which should be alpha and which not.

For space, it is less complicated.  This is a synonym in Unicode for 
White Space, and the synonym was not introduced until version 4.1. 
Since then, the Perl definition and the Unicode definition match the 
exact same code points, so it is not a problem for current releases; nor 
is it likely to be a problem going forward, as I think Unicode has said 
they don't ever plan to add any space-like code points.  (But some of 
their stability claims in the past have been misleading, so there is an 
outside chance they could diverge; but not worth worrying about now.) 
The only reason I classify this as one of the 6 conflicts, and bring it 
up here, is because the Perl definition of Space did not match the 
Unicode definition of White space in version 3.2 of Unicode.  There are 
people who want to use Perl with this Unicode release.  The difference 
is a single code point, and it is a bug in 3.2, fixed in the next 
release.  Here is a case where Unicode did not know better than Perl.  I 
therefore propose to leave Perl's definition intact.  The implications 
of this are that there is no difference for any release but 3.2.  If 
someone is using 3.2, and they use \p{wspace}, it won't match 
identically to \p{space}.  Keep in mind that there is no Unicode 
\p{space} in that release.

And lastly, is the the decomposition type compat (\p{dt=compat}). 
Perl's definition is a superset of Unicode's.  Decomposition types can 
be split into two major categories: Canonical and Non-canonical.  Perl 
calls the latter 'compat', whereas Unicode splits that into a number of 
subcategories, like <super> to indicate that the code point is a 
superscript form of another.  Unicode's decomposition type compat means 
one of the non-canonical decompositions that doesn't fit neatly into one 
of the other categories for them.  It has been around since version 2.0, 
1996.  The Camel book says that the Unicode version was going to be 
called IsDCcompat, and the Perl version was going to be called 
IsDecoCompat (p. 170.)  But none of the decomposition names it said are 
currently implemented; perhaps never were, I don't know.  I also don't 
know how it came to be that compat came to mean both these things, but 
it does, as the table that is generated for perl to use is a combination 
of both, screwed-up, in improper form.  I haven't tested it, but it's 
likely that there are cases where it gives the wrong results, for either 
definition.  I propose to withdraw Perl's definition in favor of 
Unicode's.  If we want something that matches all the non-canonical 
decompositions, it is trivial to create a new table with a new name. 
Suggestions are welcome.  Here's a couple: AnyCompat, NonCanonical

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About