develooper Front page | perl.perl5.porters | Postings from December 2010

Re: RFC: Restatement of /a regex proposal

Thread Previous | Thread Next
karl williamson
December 5, 2010 21:03
Re: RFC: Restatement of /a regex proposal
Message ID:
karl williamson wrote:
> Abigail wrote:
>> On Sat, Dec 04, 2010 at 10:18:19AM -0700, karl williamson wrote:
>>> I realized as I got further into the design that there were some  
>>> unstated things about what I'm proposing.  So here is a complete  
>>> statement, AFAIK:
>>> Using /a will have the following effects:
>>> 1) \s, \d, \w will match only the appropriate ASCII characters
>>> 2) [:posix:] will match only (the appropriate) ASCII characters
>>> 3) /i of ASCII characters will match only ASCII characters.  eg. the  
>>> Kelvin sign will not match 'k'
>>> 4) /i of non-ASCII characters will obey Unicode semantics, eg, a 
>>> capital  and lower case Greek beta will match, as will the Angstrom 
>>> sign and an A  with a circle above.
> To make it clear, 4) includes the 128-255 range characters.
>>> 5) \p{} will match in the full Unicode range, so that \p{Nd} will 
>>> match  many more characters than the 10 matched by \d.
>>> 6) All of the above is true as well on EBCDIC platforms whose native  
>>> character set is Latin1. ie. under /a they would behave identically 
>>> as  an ASCII platform would.
>> I'm confused by 3). Considering that the Kelvin sign isn't ASCII, I'm
>> not sure what you mean by this.
> perl5.8.9 -Mcharnames=:full -E 'say "\N{KELVIN SIGN}" =~ /k/i'
> 1
> Unicode rules say that the Kelvin sign and k are supposed to match case 
> insensitively, and perl has done that for a long time, since the target 
> string in the example above is utf8.  Previous comments on this topic 
> said that people didn't want ASCII characters matching anything outside 
> ASCII, and that seems the right thing to me.
>> And to clearify 1), you mean that:
>>   \s matches \x09 (CHARACTER TABULATION), \x0A (LINE FEED), 
>>              \x0C (FORM FEED), \x0D (CARRIAGE RETURN), and
>>              \x20 (SPACE), with \x0B (LINE TABULATION) not included?
>> \x0B is a rare enough character that I don't care much either way, but
>> since it was never included, it's probably shouldn't now.
> I do not propose to change the meaning of \s from what it had  before 
> Unicode came along.  This is on p.37 of Camel v3.
> \s = [ \t\n\r\f]
> \w = [a-zA-Z_0-9]
> \d = [0-9]
>> Does your proposal also say something about locales? Personally, I 
>> think that a /a should imply that locales are ignored.
> /a would override any locale.  The characters it would match are those 
> defined in the native character set, eg ord('A') = 65 on ASCII 
> platforms; ord('A') = 193 on EBCDIC.  If the locale effectively 
> redefined 'A' to be something else, that change would be ignored.
>> Other then that, I fully endorse the proposal.
>> Abigail

Another wrinkle.  In looking through the code I identified several more 
possible things that might ought to be restricted to ASCII by /a.  Does 
anyone have an opinion on these?:





Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About