develooper Front page | perl.perl5.porters | Postings from August 2010

RFC: [perl #60156] What to do about [[:posix:]] ?

Thread Next
From:
karl williamson
Date:
August 14, 2010 10:11
Subject:
RFC: [perl #60156] What to do about [[:posix:]] ?
Message ID:
4C66CDD6.8040509@khwilliamson.com
There are a number of problems with the [[:posix:]] character classes. 
I thought we had what to do about this settled, but that was before 
there was more of an emphasis on strict backwards compatibility, and 
before I did some more investigation, so I thought I had better air it 
again.

Here are the problems:

1) They do not match the Posix standard.  In our attempt to DWIM, we 
violate it.  For example, [[:alpha:]] is only supposed to match A-Za-z, 
unless in a locale that has other alphabetics.  But, if the target 
string or pattern indicate a utf8 match, it matches \p{alpha}.  I 
suppose we could argue that we have created a new locale, the Unicode 
locale.  I don't know if that argument holds water or not.

2) They suffer from "The Unicode Bug", in which the utf8ness of the 
pattern or string affects the semantics of the match.  [[:alpha:]] will 
match "\xe1" if and only if the pattern or target string are in utf8.

3) A number of characters in utf8 match both a class and the complement 
of the class.  Here's a list from bug #60156:
  [[:alnum:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8
  [[:alpha:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8
  [[:blank:]] U+A0
  [[:cntrl:]] U+80
  [[:graph:]] U+A1
  [[:lower:]] U+AA U+B5 U+BA U+DF..F6 U+F8
  [[:print:]] U+A0
  [[:punct:]] U+24 U+2B U+3C..3E U+5E U+60 U+7C U+7E U+A1 U+AB U+B7 U+BB 
U+BF
  [[:space:]] U+85 U+A0
  [[:upper:]] U+C0..D6 U+D8
  [[:word:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8

Note that some of these are ASCII.  The root cause of these is mostly 
from the same causes as the Unicode bug, but also because when they are 
stored in utf8 the code re-uses an existing, but not quite 
corresponding, \p{} property

4) Extending the posix definitions was not done consistently.  This is 
especially noticeable in punct.  Unicode splits what Posix considers 
punctuation into two classes: punctuation and symbol.  But in extending 
[[:punct:]] to beyond ASCII, Perl doesn't include the Unicode symbols. 
The result is inconsistent, the ASCII range symbols are included, but no 
other.

It is less clear about other extensions.  Should [[:cntrl:]] include 
other things that Unicode considers control-like, namely the surrogates, 
the formats (soft hyphen et.al), and private use characters?  What about 
title case, fractions, super and subscripts?

Before, it seemed like the obvious solution to all this was to just go 
back to the formal Posix definition of what they should match, not 
having a "Unicode locale", and that was done via #ifdefs for a while in 
5.11.  But it was part of a larger patch that was it decided to revert. 
  Now the #ifdefs remain defined the other direction, and 
perlrecharclass.pod in 5.12 says that it is proposed to make these match 
the Posix standard exactly, asking anyone who disagrees to notify us. 
There has so far been none.

If we were to just reinstate those #ifdefs, it would fix all the above 
problems in one fell swoop.  But it seems to me that we will break too 
much existing code.  I think it was a mistake extending these 
definitions to a made-up "Unicode locale" in the first place, but that 
ship has sailed, I think, in spite of what we thought we had decided 
earlier.

I have done some investigation, and it appears that I can easily solve 
problem 3) by creating more properties in mktables tailored just for 
these posix character classes; and easily solve 3) for regexes compiled 
under feature unicode_strings, by extending what I'm already about to 
submit a patch for regarding [\w\s].  I think I should do this, ripping 
out the #ifdefs

If we want to restrict the posix classes to strict posix definitions, I 
think it probably should be done with a pragma: 'use feature 
"strict_posix"' or 'use re "strict_posix"'.  This is not as 
high-priority in my view; and I'm not certain it even needs to be done 
at all if 2) and 3) are fixed.

I think, for consistency, especially if we don't add the strict posix 
interpretations that punct should change to include the Unicode symbols 
as well; I think the other inconsistencies are not something to worry 
about; but am less confident in this.

Comments?

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About