karl williamson
December 4, 2010 15:12
Re: RFC: Restatement of /a regex proposal
Message ID:
Abigail wrote:
> On Sat, Dec 04, 2010 at 10:18:19AM -0700, karl williamson wrote:
>> I realized as I got further into the design that there were some  
>> unstated things about what I'm proposing.  So here is a complete  
>> statement, AFAIK:
>> Using /a will have the following effects:
>> 1) \s, \d, \w will match only the appropriate ASCII characters
>> 2) [:posix:] will match only (the appropriate) ASCII characters
>> 3) /i of ASCII characters will match only ASCII characters.  eg. the  
>> Kelvin sign will not match 'k'
>> 4) /i of non-ASCII characters will obey Unicode semantics, eg, a capital  
>> and lower case Greek beta will match, as will the Angstrom sign and an A  
>> with a circle above.

To make it clear, 4) includes the 128-255 range characters.

>> 5) \p{} will match in the full Unicode range, so that \p{Nd} will match  
>> many more characters than the 10 matched by \d.
>> 6) All of the above is true as well on EBCDIC platforms whose native  
>> character set is Latin1. ie. under /a they would behave identically as  
>> an ASCII platform would.
> I'm confused by 3). Considering that the Kelvin sign isn't ASCII, I'm
> not sure what you mean by this.

perl5.8.9 -Mcharnames=:full -E 'say "\N{KELVIN SIGN}" =~ /k/i'

Unicode rules say that the Kelvin sign and k are supposed to match case 
insensitively, and perl has done that for a long time, since the target 
string in the example above is utf8.  Previous comments on this topic 
said that people didn't want ASCII characters matching anything outside 
ASCII, and that seems the right thing to me.

> And to clearify 1), you mean that:
>   \s matches \x09 (CHARACTER TABULATION), \x0A (LINE FEED), 
>              \x0C (FORM FEED), \x0D (CARRIAGE RETURN), and
>              \x20 (SPACE), with \x0B (LINE TABULATION) not included?
> \x0B is a rare enough character that I don't care much either way, but
> since it was never included, it's probably shouldn't now.

I do not propose to change the meaning of \s from what it had  before 
Unicode came along.  This is on p.37 of Camel v3.

\s = [ \t\n\r\f]
\w = [a-zA-Z_0-9]
\d = [0-9]

> Does your proposal also say something about locales? Personally, I 
> think that a /a should imply that locales are ignored.

/a would override any locale.  The characters it would match are those 
defined in the native character set, eg ord('A') = 65 on ASCII 
platforms; ord('A') = 193 on EBCDIC.  If the locale effectively 
redefined 'A' to be something else, that change would be ignored.
> Other then that, I fully endorse the proposal.
> Abigail

