Front page | perl.perl5.porters |
Postings from May 2009
RFC: \p{Upper} and \p{Lower}
Thread Next
From:
Karl Williamson
Date:
May 30, 2009 11:15
Subject:
RFC: \p{Upper} and \p{Lower}
Message ID:
4A217771.60800@khwilliamson.com
Perl defines a number of \p{} properties for regular expression matching
outside of the ones defined by Unicode. In the current Unicode release,
8 of these Perl definitions have the same names as Unicode properties,
and thus there is a naming conflict that should be considered, and a
decision made as to what to do.
Actually 2 of the Perl ones are defined to be identical to the
respective Unicode ones, and so there is no conflict. (It may even be
that the Unicode consortium saw our names, liked them and copied them :-))
That leaves 6. The circumstances vary for these, so I'll send them out
in different emails. This email is about Upper and Lower, which have
similar circumstances.
First, it is unclear to me whether the Unicode property names are
normative or informative. (normative means any implementation claiming
to implement Unicode has to abide by its rules; whereas for informative,
it is strongly recommended but not required). The latest draft document
for Unicode 5.2 says that they are normative; other documentation
indicates that some, anyway, are informative.
My understanding is that these perl properties were originally designed
as being replacements for [[:Upper:]], et. al. that extended to Unicode.
It is unclear to me what the current implications of this might be.
\p{Upper} in perl is a synonym for \p{Lu}, or Uppercase_Letter. The
Unicode Upper property is a superset of this, and a synonym for the
property named just "Uppercase", which includes not only letters, but
also code points that match the property Other_Uppercase. From what I
understand, this latter property is hand-populated to include other
characters the Consortium believes should also be treated as Uppercase,
besides those whose primary category is a letter. In Unicode 5.1, this
comprises 42 code points that are capital Roman Numerals, and capital
letters enclosed in circles.
I believe that the perl definition of Upper can be considered a bug, and
should be changed to match the Unicode definition. This would expand
what \p{Upper} matches by those 42 code points, and would contract what
\P{Upper} matches by the same 42 code points.
Similarly for \p{Lower}. I believe it should match the Unicode Lower
property, which adds in 159 code points, including small Roman Numerals
and circled letters, and also some subscripts and "modifier letters"
Any comments?
Thread Next
-
RFC: \p{Upper} and \p{Lower}
by Karl Williamson