develooper Front page | perl.perl5.porters | Postings from September 2010

RFC: m//a restrict matching to ASCII

Thread Next
From:
karl williamson
Date:
September 21, 2010 11:28
Subject:
RFC: m//a restrict matching to ASCII
Message ID:
4C98F926.5010204@khwilliamson.com
I've been thinking about the oft-expressed issue here concerning e.g., 
making \d match only 0-9, and now have a concrete proposal.

I'm proposing a '/a' regex modifier that would restrict matches of \d, 
\s, \w, and [:posix:] to characters in the ASCII character set.    This 
would be true even on utf8-encoded patterns and targets.

I'm leaning against having this affect case insensitive matching, thus 
'"\N{LATIN SMALL LETTER LONG S}" =~ /s/ai' would still be true.

The modifier would be added automatically when a regex is compiled in 
the scope of something like 'use re "ascii"'.  It could also be 
explicitly stated in a (?a...) construct.  It could not be expressed as 
a /suffix in 5.14, unless Jesse changes his mind.

There are 3 features I think that this could interact with, i.e., what 
happens if a regex is compiled in the scope of any combination of:
use re 'ascii'
use bytes
use locale
use feature 'unicode_strings'

1) bytes. I don't think that there is any conflict, as all ascii chars 
are single bytes anyway.

2) locale.  I believe locale should have precedence.

3) unicode_strings.  I believe ascii should have precedence.

If you have an inquiring mind, the current implementation is that bytes 
has highest precedence, followed by locale, then unicode_strings.  I 
propose to insert ascii between locale and unicode_strings in the 
pecking order.

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About