develooper Front page | perl.perl5.porters | Postings from May 2009

RFC: \p{Upper} and \p{Lower}

Thread Next
From:
Karl Williamson
Date:
May 30, 2009 11:15
Subject:
RFC: \p{Upper} and \p{Lower}
Message ID:
4A217771.60800@khwilliamson.com
Perl defines a number of \p{} properties for regular expression matching 
outside of the ones defined by Unicode.  In the current Unicode release, 
8 of these Perl definitions have the same names as Unicode properties, 
and thus there is a naming conflict that should be considered, and a 
decision made as to what to do.

Actually 2 of the Perl ones are defined to be identical to the 
respective Unicode ones, and so there is no conflict.  (It may even be 
that the Unicode consortium saw our names, liked them and copied them :-))

That leaves 6.  The circumstances vary for these, so I'll send them out 
in different emails.  This email is about Upper and Lower, which have 
similar circumstances.

First, it is unclear to me whether the Unicode property names are 
normative or informative.  (normative means any implementation claiming 
to implement Unicode has to abide by its rules; whereas for informative, 
it is strongly recommended but not required).  The latest draft document 
for Unicode 5.2 says that they are normative; other documentation 
indicates that some, anyway, are informative.

My understanding is that these perl properties were originally designed 
as being replacements for [[:Upper:]], et. al. that extended to Unicode. 
  It is unclear to me what the current implications of this might be.

\p{Upper} in perl is a synonym for \p{Lu}, or Uppercase_Letter.  The 
Unicode Upper property is a superset of this, and a synonym for the 
property named just "Uppercase", which includes not only letters, but 
also code points that match the property Other_Uppercase.  From what I 
understand, this latter property is hand-populated to include other 
characters the Consortium believes should also be treated as Uppercase, 
besides those whose primary category is a letter.  In Unicode 5.1, this 
comprises 42 code points that are capital Roman Numerals, and capital 
letters enclosed in circles.

I believe that the perl definition of Upper can be considered a bug, and 
should be changed to match the Unicode definition.  This would expand 
what \p{Upper} matches by those 42 code points, and would contract what 
\P{Upper} matches by the same 42 code points.

Similarly for \p{Lower}.  I believe it should match the Unicode Lower 
property, which adds in 159 code points, including small Roman Numerals 
and circled letters, and also some subscripts and "modifier letters"

Any comments?

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About