Front page | perl.perl5.porters |
Postings from September 2009
Re: RFC: Unicode/Perl name clashes
Thread Previous
|
Thread Next
From:
karl williamson
Date:
September 1, 2009 15:33
Subject:
Re: RFC: Unicode/Perl name clashes
Message ID:
4A9DA0D5.4080809@khwilliamson.com
demerphq wrote:
snip
> I just want to register a concern... Character classes, negation, and
> Unicode property names, POSIX character classes, and Perl specific
> character classes all play together. So we cant just say "cntrl" will
> start meaning whatever unicode says it should mean, without at least
> definiting PosixCntrl as a replacement. Likewise for the other
> character classes where perl/posix and unicode disagree. So for me its
> fine to say that \p{IsCntrl} is the unicode definition, and
> [[:CNTRL:]] is made to map to \p{IsPosixCntrl}. Likewise \d should map
> to [[:digit:]] should map to \p{IsPerlDigit} and not \p{IsDigit}, and
> similar for \w. Etc. As far as I understand this doesnt contradict
> what you are saying, as what you are saying would strictly affect what
> \p{IsDigit} would do, not prevent us from defining \p{IsPerlDigit}. I
> just wanted to point out that its not quite as simple as saying "let
> make them match up", you also have to reconcile some overlapping
> issues too.
snip
I did not think of this as a contradiction, so didn't bring it up. I
believe we are in agreement. Here in my words is what I think the plan
is: Right now, [[:CNTRL:]] is the same thing as \p{Cntrl}, but after
the switch is pulled it will instead be the same thing as the current
\p{PosixCntrl} that you added last November. I'm only proposing
changing \p{Cntrl}, not \p{PosixCntrl}. Likewise \d maps to [[:digit:]]
both before and after the switch is pulled. But right now they are
equivalent to \p{Digit} which is the same thing as \p{Nd}. After the
switch, they will instead both be equivalent to \p{PosixDigit}, again a
property you created.
For those of you who don't know these are 2 of the new properties that
Yves created that are identical to the Posix ones you'd think they'd be
equivalent to. In most cases they are merely the Unicode ones
restricted to the ASCII range, but PosixPrint and PosixPunct are
different because Unicode and Posix don't agree on these in the ASCII range.
The bottom line is that, for example, \p{digit} will mean the same thing
before and after the switch; \p{posixdigit} will mean its same thing
before and after the switch, but the switch will cause \d and
[[:digit:]] to change to be the same thing as posixdigit, namely 0..9.
This brings up a question I wrote to you about back when you were
extremely busy, and did not get a reply to. It seems to me that for
efficiency you'd want to eventually hard-code these values into regcomp
instead of having to go through utf8_heavy. In the short-run, though,
it makes sense to use these Posix properties so that the switch can be
flipped back and forth easily. Am I missing something?
Karl
Thread Previous
|
Thread Next