Front page | perl.perl5.porters |
Postings from August 2010
RFC: [perl #60156] What to do about [[:posix:]] ?
Thread Next
From:
karl williamson
Date:
August 14, 2010 10:11
Subject:
RFC: [perl #60156] What to do about [[:posix:]] ?
Message ID:
4C66CDD6.8040509@khwilliamson.com
There are a number of problems with the [[:posix:]] character classes.
I thought we had what to do about this settled, but that was before
there was more of an emphasis on strict backwards compatibility, and
before I did some more investigation, so I thought I had better air it
again.
Here are the problems:
1) They do not match the Posix standard. In our attempt to DWIM, we
violate it. For example, [[:alpha:]] is only supposed to match A-Za-z,
unless in a locale that has other alphabetics. But, if the target
string or pattern indicate a utf8 match, it matches \p{alpha}. I
suppose we could argue that we have created a new locale, the Unicode
locale. I don't know if that argument holds water or not.
2) They suffer from "The Unicode Bug", in which the utf8ness of the
pattern or string affects the semantics of the match. [[:alpha:]] will
match "\xe1" if and only if the pattern or target string are in utf8.
3) A number of characters in utf8 match both a class and the complement
of the class. Here's a list from bug #60156:
[[:alnum:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8
[[:alpha:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8
[[:blank:]] U+A0
[[:cntrl:]] U+80
[[:graph:]] U+A1
[[:lower:]] U+AA U+B5 U+BA U+DF..F6 U+F8
[[:print:]] U+A0
[[:punct:]] U+24 U+2B U+3C..3E U+5E U+60 U+7C U+7E U+A1 U+AB U+B7 U+BB
U+BF
[[:space:]] U+85 U+A0
[[:upper:]] U+C0..D6 U+D8
[[:word:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8
Note that some of these are ASCII. The root cause of these is mostly
from the same causes as the Unicode bug, but also because when they are
stored in utf8 the code re-uses an existing, but not quite
corresponding, \p{} property
4) Extending the posix definitions was not done consistently. This is
especially noticeable in punct. Unicode splits what Posix considers
punctuation into two classes: punctuation and symbol. But in extending
[[:punct:]] to beyond ASCII, Perl doesn't include the Unicode symbols.
The result is inconsistent, the ASCII range symbols are included, but no
other.
It is less clear about other extensions. Should [[:cntrl:]] include
other things that Unicode considers control-like, namely the surrogates,
the formats (soft hyphen et.al), and private use characters? What about
title case, fractions, super and subscripts?
Before, it seemed like the obvious solution to all this was to just go
back to the formal Posix definition of what they should match, not
having a "Unicode locale", and that was done via #ifdefs for a while in
5.11. But it was part of a larger patch that was it decided to revert.
Now the #ifdefs remain defined the other direction, and
perlrecharclass.pod in 5.12 says that it is proposed to make these match
the Posix standard exactly, asking anyone who disagrees to notify us.
There has so far been none.
If we were to just reinstate those #ifdefs, it would fix all the above
problems in one fell swoop. But it seems to me that we will break too
much existing code. I think it was a mistake extending these
definitions to a made-up "Unicode locale" in the first place, but that
ship has sailed, I think, in spite of what we thought we had decided
earlier.
I have done some investigation, and it appears that I can easily solve
problem 3) by creating more properties in mktables tailored just for
these posix character classes; and easily solve 3) for regexes compiled
under feature unicode_strings, by extending what I'm already about to
submit a patch for regarding [\w\s]. I think I should do this, ripping
out the #ifdefs
If we want to restrict the posix classes to strict posix definitions, I
think it probably should be done with a pragma: 'use feature
"strict_posix"' or 'use re "strict_posix"'. This is not as
high-priority in my view; and I'm not certain it even needs to be done
at all if 2) and 3) are fixed.
I think, for consistency, especially if we don't add the strict posix
interpretations that punct should change to include the Unicode symbols
as well; I think the other inconsistencies are not something to worry
about; but am less confident in this.
Comments?
Thread Next
-
RFC: [perl #60156] What to do about [[:posix:]] ?
by karl williamson