Front page | perl.perl5.porters |
Postings from September 2009
Re: RFC: Unicode/Perl name clashes
Thread Previous
|
Thread Next
From:
demerphq
Date:
September 1, 2009 15:51
Subject:
Re: RFC: Unicode/Perl name clashes
Message ID:
9b18b3110909011551y49c47eb5teda5ec955e24879e@mail.gmail.com
2009/9/2 karl williamson <public@khwilliamson.com>:
> demerphq wrote:
> snip
>
>> I just want to register a concern... Character classes, negation, and
>> Unicode property names, POSIX character classes, and Perl specific
>> character classes all play together. So we cant just say "cntrl" will
>> start meaning whatever unicode says it should mean, without at least
>> definiting PosixCntrl as a replacement. Likewise for the other
>> character classes where perl/posix and unicode disagree. So for me its
>> fine to say that \p{IsCntrl} is the unicode definition, and
>> [[:CNTRL:]] is made to map to \p{IsPosixCntrl}. Likewise \d should map
>> to [[:digit:]] should map to \p{IsPerlDigit} and not \p{IsDigit}, and
>> similar for \w. Etc. As far as I understand this doesnt contradict
>> what you are saying, as what you are saying would strictly affect what
>> \p{IsDigit} would do, not prevent us from defining \p{IsPerlDigit}. I
>> just wanted to point out that its not quite as simple as saying "let
>> make them match up", you also have to reconcile some overlapping
>> issues too.
>
> snip
>
> I did not think of this as a contradiction, so didn't bring it up. I
> believe we are in agreement.
Great, I thought as much but I wanted to make sure.
> Here in my words is what I think the plan is:
> Right now, [[:CNTRL:]] is the same thing as \p{Cntrl}, but after the switch
> is pulled it will instead be the same thing as the current \p{PosixCntrl}
> that you added last November. I'm only proposing changing \p{Cntrl}, not
> \p{PosixCntrl}. Likewise \d maps to [[:digit:]] both before and after the
> switch is pulled. But right now they are equivalent to \p{Digit} which is
> the same thing as \p{Nd}. After the switch, they will instead both be
> equivalent to \p{PosixDigit}, again a property you created.
>
> For those of you who don't know these are 2 of the new properties that Yves
> created that are identical to the Posix ones you'd think they'd be
> equivalent to. In most cases they are merely the Unicode ones restricted to
> the ASCII range, but PosixPrint and PosixPunct are different because Unicode
> and Posix don't agree on these in the ASCII range.
>
> The bottom line is that, for example, \p{digit} will mean the same thing
> before and after the switch; \p{posixdigit} will mean its same thing before
> and after the switch, but the switch will cause \d and [[:digit:]] to change
> to be the same thing as posixdigit, namely 0..9.
Correct, and as a consequence will fix a bug of negated posix
charclass issues we have encountered in the past.
We can actually change the mappings Right Now by just patching the
default define in regcomp.h, and then you are pretty much free to do
what you want elsewhere. We then have some todo/contradictory tests to
resolve as well.
>
> This brings up a question I wrote to you about back when you were extremely
> busy, and did not get a reply to.
Sorry. Ill have to hunt through and see if there are any others.
> It seems to me that for efficiency you'd
> want to eventually hard-code these values into regcomp instead of having to
> go through utf8_heavy.
> In the short-run, though, it makes sense to use
> these Posix properties so that the switch can be flipped back and forth
> easily. Am I missing something?
With regard to optimizing these it makes sense for the long run but
isnt IMO so important in the short run as utf8 is slow regardless. I
guess we could use the DEFINE generator in Porting that i wrote,
although it was intended for a different purpose. Maybe you have
better ideas?
One open project is something Jarkko pointed out to me a while back,
using inversion lists as character set representations could make
things a lot nicer in general, as they support boolean operations
reasonably efficiently and have a decent overhead.
cheers,
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread Previous
|
Thread Next