Front page | perl.perl5.porters |
Postings from December 2010
Re: RFC: Restatement of /a regex proposal
Thread Previous
|
Thread Next
From:
karl williamson
Date:
December 5, 2010 21:03
Subject:
Re: RFC: Restatement of /a regex proposal
Message ID:
4CFC6E0F.6070204@khwilliamson.com
karl williamson wrote:
> Abigail wrote:
>> On Sat, Dec 04, 2010 at 10:18:19AM -0700, karl williamson wrote:
>>> I realized as I got further into the design that there were some
>>> unstated things about what I'm proposing. So here is a complete
>>> statement, AFAIK:
>>>
>>> Using /a will have the following effects:
>>> 1) \s, \d, \w will match only the appropriate ASCII characters
>>> 2) [:posix:] will match only (the appropriate) ASCII characters
>>> 3) /i of ASCII characters will match only ASCII characters. eg. the
>>> Kelvin sign will not match 'k'
>>> 4) /i of non-ASCII characters will obey Unicode semantics, eg, a
>>> capital and lower case Greek beta will match, as will the Angstrom
>>> sign and an A with a circle above.
>
> To make it clear, 4) includes the 128-255 range characters.
>
>>> 5) \p{} will match in the full Unicode range, so that \p{Nd} will
>>> match many more characters than the 10 matched by \d.
>>> 6) All of the above is true as well on EBCDIC platforms whose native
>>> character set is Latin1. ie. under /a they would behave identically
>>> as an ASCII platform would.
>>
>>
>> I'm confused by 3). Considering that the Kelvin sign isn't ASCII, I'm
>> not sure what you mean by this.
>
> perl5.8.9 -Mcharnames=:full -E 'say "\N{KELVIN SIGN}" =~ /k/i'
> 1
>
> Unicode rules say that the Kelvin sign and k are supposed to match case
> insensitively, and perl has done that for a long time, since the target
> string in the example above is utf8. Previous comments on this topic
> said that people didn't want ASCII characters matching anything outside
> ASCII, and that seems the right thing to me.
>
>>
>> And to clearify 1), you mean that:
>>
>> \s matches \x09 (CHARACTER TABULATION), \x0A (LINE FEED),
>> \x0C (FORM FEED), \x0D (CARRIAGE RETURN), and
>> \x20 (SPACE), with \x0B (LINE TABULATION) not included?
>>
>> \x0B is a rare enough character that I don't care much either way, but
>> since it was never included, it's probably shouldn't now.
>
> I do not propose to change the meaning of \s from what it had before
> Unicode came along. This is on p.37 of Camel v3.
>
> \s = [ \t\n\r\f]
> \w = [a-zA-Z_0-9]
> \d = [0-9]
>
>>
>> Does your proposal also say something about locales? Personally, I
>> think that a /a should imply that locales are ignored.
>
> /a would override any locale. The characters it would match are those
> defined in the native character set, eg ord('A') = 65 on ASCII
> platforms; ord('A') = 193 on EBCDIC. If the locale effectively
> redefined 'A' to be something else, that change would be ignored.
>>
>>
>>
>> Other then that, I fully endorse the proposal.
>>
>>
>> Abigail
>>
>
Another wrinkle. In looking through the code I identified several more
possible things that might ought to be restricted to ASCII by /a. Does
anyone have an opinion on these?:
\h
\v
\R
\X
Thread Previous
|
Thread Next