develooper Front page | perl.perl5.porters | Postings from December 2010

Re: RFC: Restatement of /a regex proposal

Thread Previous | Thread Next
From:
Tom Christiansen
Date:
December 4, 2010 10:39
Subject:
Re: RFC: Restatement of /a regex proposal
Message ID:
12474.1291487921@chthon
Karl wroteL

> I realized as I got further into the design that there were some 
> unstated things about what I'm proposing.  So here is a complete 
> statement, AFAIK:

> Using /a will have the following effects:

> 1) \s, \d, \w will match only the appropriate ASCII characters

> 2) [:posix:] will match only (the appropriate) ASCII characters

Your reference to POSIX reminds me that I'm not entirely sure how the /l
or (?l) locale flag quite works out.  The /a would override a use re /l
or /u that was in scope, right?  I guess the only forbidden thing would
be to try to specify more than one of those in the same pattern or use
re declaration.  Is that so, or have I misunderstood the way /a and /l
and /u are envisioned to behave?

> 3) /i of ASCII characters will match only ASCII characters.  
>    eg. the Kelvin sign will not match 'k'

> 4) /i of non-ASCII characters will obey Unicode semantics, eg, a
>    capital and lower case Greek beta will match, as will the Angstrom
>    sign and an A with a circle above.

> 5) \p{} will match in the full Unicode range, so that \p{Nd} will
>    match many more characters than the 10 matched by \d.

> 6) All of the above is true as well on EBCDIC platforms whose native
>    character set is Latin1. ie. under /a they would behave identically
>    as an ASCII platform would.

I no longer recall enough about EBCDIC to say anything about it at all. 
My last experience may have been thirty years ago with the Sperry UNIVAC,
where we just as often packed up six 6-bit RAD-50 characters into one
36-bit word.  Or maybe that was for the DEC machines?  Possibly both.

Except for those muddying points above clarify, I believe that all makes 
good sense, that it is desirable, and probably also that it is necessary.

I can report that I am unfond of the typed-strings and the typed-
patterns that you need to use in some of the other languages,
especially Python, where things blow up with an exception if you
ever apply the wrong type of pattern to the wrong type of string.
It's very annoying to forever have to hold in your mind which flavor
you are or are not using. It's like all the many kinds of pointers
in C++'s Boost library: no thanks!

Java's (?u) flag in patterns does nothing more than enabling Unicode case
matching, and then only in conjunction with (?i), so you usually see
in written (?iu) if they don't use the flags argument to Pattern.compile.

(?iu) makes things like

    017F  LATIN SMALL LETTER LONG S

match "s" or "S", and it works both way, so   

    Pattern.compile("(?iu)s").matcher("\u017F").find()

returns true, as does 

    Pattern.compile("(?iu)\\u017F").matcher("s").find()

--tom

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About