Front page | perl.perl5.porters |
Postings from December 2010
Re: RFC: Restatement of /a regex proposal
Thread Previous
From:
karl williamson
Date:
December 4, 2010 11:21
Subject:
Re: RFC: Restatement of /a regex proposal
Message ID:
4CFA946D.5040009@khwilliamson.com
Tom Christiansen wrote:
> Karl wroteL
>
>> I realized as I got further into the design that there were some
>> unstated things about what I'm proposing. So here is a complete
>> statement, AFAIK:
>
>> Using /a will have the following effects:
>
>> 1) \s, \d, \w will match only the appropriate ASCII characters
>
>> 2) [:posix:] will match only (the appropriate) ASCII characters
>
> Your reference to POSIX reminds me that I'm not entirely sure how the /l
> or (?l) locale flag quite works out. The /a would override a use re /l
> or /u that was in scope, right? I guess the only forbidden thing would
> be to try to specify more than one of those in the same pattern or use
> re declaration. Is that so, or have I misunderstood the way /a and /l
> and /u are envisioned to behave?
Thanks for pointing out this missing bit of information. Currently, all
three of /d, /u, and /l are mutually exclusive, and the only error in
using them is to try to specify more than one at the same time, like
"(?dl:...)". (Or 'use re "/lu"'.) My proposal is to add /a to this
list, mutually exclusive to the others. And the same overriding rules
would apply, which are not very clearcut. Currently, 'use locale' has
precedence over any 'use feature "unicode_strings", and 'use re "/foo";
has precedence over them both. I can't remember how 'use bytes' fits
into it--I think it has higher precedence than the others). But the /a
is actually simpler, as there would only be one way to specify that it
is to be the default, namely 'use re "/a"'. (The others have multiple
ways because there are more affected operations than just regexes for them.)
>
>> 3) /i of ASCII characters will match only ASCII characters.
>> eg. the Kelvin sign will not match 'k'
>
>> 4) /i of non-ASCII characters will obey Unicode semantics, eg, a
>> capital and lower case Greek beta will match, as will the Angstrom
>> sign and an A with a circle above.
>
>> 5) \p{} will match in the full Unicode range, so that \p{Nd} will
>> match many more characters than the 10 matched by \d.
>
>> 6) All of the above is true as well on EBCDIC platforms whose native
>> character set is Latin1. ie. under /a they would behave identically
>> as an ASCII platform would.
>
> I no longer recall enough about EBCDIC to say anything about it at all.
> My last experience may have been thirty years ago with the Sperry UNIVAC,
> where we just as often packed up six 6-bit RAD-50 characters into one
> 36-bit word. Or maybe that was for the DEC machines? Possibly both.
>
> Except for those muddying points above clarify, I believe that all makes
> good sense, that it is desirable, and probably also that it is necessary.
>
> I can report that I am unfond of the typed-strings and the typed-
> patterns that you need to use in some of the other languages,
> especially Python, where things blow up with an exception if you
> ever apply the wrong type of pattern to the wrong type of string.
> It's very annoying to forever have to hold in your mind which flavor
> you are or are not using. It's like all the many kinds of pointers
> in C++'s Boost library: no thanks!
>
> Java's (?u) flag in patterns does nothing more than enabling Unicode case
> matching, and then only in conjunction with (?i), so you usually see
> in written (?iu) if they don't use the flags argument to Pattern.compile.
>
> (?iu) makes things like
>
> 017F LATIN SMALL LETTER LONG S
>
> match "s" or "S", and it works both way, so
>
> Pattern.compile("(?iu)s").matcher("\u017F").find()
>
> returns true, as does
>
> Pattern.compile("(?iu)\\u017F").matcher("s").find()
>
> --tom
>
Thread Previous