develooper Front page | perl.perl5.porters | Postings from September 2009

Re: RFC: Unicode/Perl name clashes

Thread Previous | Thread Next
From:
karl williamson
Date:
September 1, 2009 15:33
Subject:
Re: RFC: Unicode/Perl name clashes
Message ID:
4A9DA0D5.4080809@khwilliamson.com
demerphq wrote:
snip

> I just want to register a concern...  Character classes, negation, and
> Unicode property names, POSIX character classes, and Perl specific
> character classes all play together. So we cant just say "cntrl" will
> start meaning whatever unicode says it should mean, without at least
> definiting PosixCntrl as a replacement. Likewise for the other
> character classes where perl/posix and unicode disagree. So for me its
> fine to say that \p{IsCntrl} is the unicode definition, and
> [[:CNTRL:]] is made to map to \p{IsPosixCntrl}. Likewise \d should map
> to [[:digit:]] should map to \p{IsPerlDigit} and not \p{IsDigit}, and
> similar for \w. Etc. As far as I understand this doesnt contradict
> what you are saying, as what you are saying would strictly affect what
> \p{IsDigit} would do, not prevent us from defining \p{IsPerlDigit}. I
> just wanted to point out that its not quite as simple as saying "let
> make them match up", you also have to reconcile some overlapping
> issues too.

snip

I did not think of this as a contradiction, so didn't bring it up.  I 
believe we are in agreement.  Here in my words is what I think the plan 
is:  Right now, [[:CNTRL:]] is the same thing as \p{Cntrl}, but after 
the switch is pulled it will instead be the same thing as the current 
\p{PosixCntrl} that you added last November.  I'm only proposing 
changing \p{Cntrl}, not \p{PosixCntrl}. Likewise \d maps to [[:digit:]] 
both before and after the switch is pulled.  But right now they are 
equivalent to \p{Digit} which is the same thing as \p{Nd}.  After the 
switch, they will instead both be equivalent to \p{PosixDigit}, again a 
property you created.

For those of you who don't know these are 2 of the new properties that 
Yves created that are identical to the Posix ones you'd think they'd be 
equivalent to.  In most cases they are merely the Unicode ones 
restricted to the ASCII range, but PosixPrint and PosixPunct are 
different because Unicode and Posix don't agree on these in the ASCII range.

The bottom line is that, for example, \p{digit} will mean the same thing 
before and after the switch; \p{posixdigit} will mean its same thing 
before and after the switch, but the switch will cause \d and 
[[:digit:]] to change to be the same thing as posixdigit, namely 0..9.

This brings up a question I wrote to you about back when you were 
extremely busy, and did not get a reply to.  It seems to me that for 
efficiency you'd want to eventually hard-code these values into regcomp 
instead of having to go through utf8_heavy.  In the short-run, though, 
it makes sense to use these Posix properties so that the switch can be 
flipped back and forth easily.  Am I missing something?

Karl

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About