develooper Front page | perl.perl5.porters | Postings from November 2010

Re: RFC: [perl #60156] What to do about [[:posix:]] ?

Thread Previous | Thread Next
November 30, 2010 06:07
Re: RFC: [perl #60156] What to do about [[:posix:]] ?
Message ID:
On Fri, Aug 20, 2010 at 04:41:48PM +0200, demerphq wrote:
> On 14 August 2010 19:09, karl williamson <> wrote:
> > There are a number of problems with the [[:posix:]] character classes. I
> > thought we had what to do about this settled, but that was before there was
> > more of an emphasis on strict backwards compatibility, and before I did some
> > more investigation, so I thought I had better air it again.
> >
> > Here are the problems:
> >
> > 1) They do not match the Posix standard.  In our attempt to DWIM, we violate
> > it.  For example, [[:alpha:]] is only supposed to match A-Za-z, unless in a
> > locale that has other alphabetics.  But, if the target string or pattern
> > indicate a utf8 match, it matches \p{alpha}.  I suppose we could argue that
> > we have created a new locale, the Unicode locale.  I don't know if that
> > argument holds water or not.
> >
> > 2) They suffer from "The Unicode Bug", in which the utf8ness of the pattern
> > or string affects the semantics of the match.  [[:alpha:]] will match "\xe1"
> > if and only if the pattern or target string are in utf8.
> >
> > 3) A number of characters in utf8 match both a class and the complement of
> > the class.  Here's a list from bug #60156:
> >  [[:alnum:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8
> >  [[:alpha:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8
> >  [[:blank:]] U+A0
> >  [[:cntrl:]] U+80
> >  [[:graph:]] U+A1
> >  [[:lower:]] U+AA U+B5 U+BA U+DF..F6 U+F8
> >  [[:print:]] U+A0
> >  [[:punct:]] U+24 U+2B U+3C..3E U+5E U+60 U+7C U+7E U+A1 U+AB U+B7 U+BB U+BF
> >  [[:space:]] U+85 U+A0
> >  [[:upper:]] U+C0..D6 U+D8
> >  [[:word:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8
> >
> > Note that some of these are ASCII.  The root cause of these is mostly from
> > the same causes as the Unicode bug, but also because when they are stored in
> > utf8 the code re-uses an existing, but not quite corresponding, \p{}
> > property
> >
> > 4) Extending the posix definitions was not done consistently.  This is
> > especially noticeable in punct.  Unicode splits what Posix considers
> > punctuation into two classes: punctuation and symbol.  But in extending
> > [[:punct:]] to beyond ASCII, Perl doesn't include the Unicode symbols. The
> > result is inconsistent, the ASCII range symbols are included, but no other.
> >
> > It is less clear about other extensions.  Should [[:cntrl:]] include other
> > things that Unicode considers control-like, namely the surrogates, the
> > formats (soft hyphen, and private use characters?  What about title
> > case, fractions, super and subscripts?
> >
> > Before, it seemed like the obvious solution to all this was to just go back
> > to the formal Posix definition of what they should match, not having a
> > "Unicode locale", and that was done via #ifdefs for a while in 5.11.  But it
> > was part of a larger patch that was it decided to revert.  Now the #ifdefs
> > remain defined the other direction, and perlrecharclass.pod in 5.12 says
> > that it is proposed to make these match the Posix standard exactly, asking
> > anyone who disagrees to notify us. There has so far been none.
> >
> > If we were to just reinstate those #ifdefs, it would fix all the above
> > problems in one fell swoop.  But it seems to me that we will break too much
> > existing code.  I think it was a mistake extending these definitions to a
> > made-up "Unicode locale" in the first place, but that ship has sailed, I
> > think, in spite of what we thought we had decided earlier.
> >
> > I have done some investigation, and it appears that I can easily solve
> > problem 3) by creating more properties in mktables tailored just for these
> > posix character classes; and easily solve 3) for regexes compiled under
> > feature unicode_strings, by extending what I'm already about to submit a
> > patch for regarding [\w\s].  I think I should do this, ripping out the
> > #ifdefs
> >
> > If we want to restrict the posix classes to strict posix definitions, I
> > think it probably should be done with a pragma: 'use feature "strict_posix"'
> > or 'use re "strict_posix"'.  This is not as high-priority in my view; and
> > I'm not certain it even needs to be done at all if 2) and 3) are fixed.
> >
> > I think, for consistency, especially if we don't add the strict posix
> > interpretations that punct should change to include the Unicode symbols as
> > well; I think the other inconsistencies are not something to worry about;
> > but am less confident in this.
> >
> > Comments?
> POSIX is a standard. It is NOT up to us to redefine that standard. Had
> we realized that we were breaking the standard and the massive can of
> worms involved at the time I do not think we would have gone the way
> we did. I think it would be a HUGE benefit to return to the correct
> interpretation of POSIX charclasses and I do not think that backcompat
> will be impacted any more than a bunch of buggy programs stop being
> buggy.

It's a bit late, but I agree with Yves. POSIX is a standard. It defines
what goes in [[:posix:]]. Unicode may be a newer standard, but it has
its own set of properties, \p{Property}. By "extending" POSIX character
classes so they are (more or less) equivalent to Unicode properties, 
we've actually taken away functionality.


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About