develooper Front page | perl.perl5.porters | Postings from December 2010

Re: RFC: Restatement of /a regex proposal

Thread Previous
karl williamson
December 4, 2010 11:21
Re: RFC: Restatement of /a regex proposal
Message ID:
Tom Christiansen wrote:
> Karl wroteL
>> I realized as I got further into the design that there were some 
>> unstated things about what I'm proposing.  So here is a complete 
>> statement, AFAIK:
>> Using /a will have the following effects:
>> 1) \s, \d, \w will match only the appropriate ASCII characters
>> 2) [:posix:] will match only (the appropriate) ASCII characters
> Your reference to POSIX reminds me that I'm not entirely sure how the /l
> or (?l) locale flag quite works out.  The /a would override a use re /l
> or /u that was in scope, right?  I guess the only forbidden thing would
> be to try to specify more than one of those in the same pattern or use
> re declaration.  Is that so, or have I misunderstood the way /a and /l
> and /u are envisioned to behave?

Thanks for pointing out this missing bit of information.  Currently, all 
three of /d, /u, and /l are mutually exclusive, and the only error in 
using them is to try to specify more than one at the same time, like 
"(?dl:...)".  (Or 'use re "/lu"'.)  My proposal is to add /a to this 
list, mutually exclusive to the others.  And the same overriding rules 
would apply, which are not very clearcut.  Currently, 'use locale' has 
precedence over any 'use feature "unicode_strings", and 'use re "/foo"; 
has precedence over them both.  I can't remember how 'use bytes' fits 
into it--I think it has higher precedence than the others).  But the /a 
is actually simpler, as there would only be one way to specify that it 
is to be the default, namely 'use re "/a"'.  (The others have multiple 
ways because there are more affected operations than just regexes for them.)

>> 3) /i of ASCII characters will match only ASCII characters.  
>>    eg. the Kelvin sign will not match 'k'
>> 4) /i of non-ASCII characters will obey Unicode semantics, eg, a
>>    capital and lower case Greek beta will match, as will the Angstrom
>>    sign and an A with a circle above.
>> 5) \p{} will match in the full Unicode range, so that \p{Nd} will
>>    match many more characters than the 10 matched by \d.
>> 6) All of the above is true as well on EBCDIC platforms whose native
>>    character set is Latin1. ie. under /a they would behave identically
>>    as an ASCII platform would.
> I no longer recall enough about EBCDIC to say anything about it at all. 
> My last experience may have been thirty years ago with the Sperry UNIVAC,
> where we just as often packed up six 6-bit RAD-50 characters into one
> 36-bit word.  Or maybe that was for the DEC machines?  Possibly both.
> Except for those muddying points above clarify, I believe that all makes 
> good sense, that it is desirable, and probably also that it is necessary.
> I can report that I am unfond of the typed-strings and the typed-
> patterns that you need to use in some of the other languages,
> especially Python, where things blow up with an exception if you
> ever apply the wrong type of pattern to the wrong type of string.
> It's very annoying to forever have to hold in your mind which flavor
> you are or are not using. It's like all the many kinds of pointers
> in C++'s Boost library: no thanks!
> Java's (?u) flag in patterns does nothing more than enabling Unicode case
> matching, and then only in conjunction with (?i), so you usually see
> in written (?iu) if they don't use the flags argument to Pattern.compile.
> (?iu) makes things like
> match "s" or "S", and it works both way, so   
>     Pattern.compile("(?iu)s").matcher("\u017F").find()
> returns true, as does 
>     Pattern.compile("(?iu)\\u017F").matcher("s").find()
> --tom

Thread Previous Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About