develooper Front page | perl.perl5.porters | Postings from November 2008

Re: [RFC] Regular expression character classes and unicode.

Thread Previous | Thread Next
From:
demerphq
Date:
November 10, 2008 13:07
Subject:
Re: [RFC] Regular expression character classes and unicode.
Message ID:
9b18b3110811101307x3f10c83dh2c25b974af88618f@mail.gmail.com
2008/11/10 Abigail <abigail@abigail.be>:
> On Mon, Nov 10, 2008 at 12:06:43AM +0000, Ben Morrow wrote:
>>
>> Quoth demerphq@gmail.com (demerphq):
>> >
>> > I propose the following:
[snip]

>
>> > 2. Add a new special escape shortcut to mean "unicode word character",
>> > this would have the same semantics as \w does now on a unicode string,
>> > regardless of the internal representation of the string being matched
>> > or the pattern being matched against. I have no idea what this
>> > "unicode word character" should be. In another universe \u \U would be
>> > perfect candidates IMO, but in our universe, well, \U is taken. So
>> > what this is called is an open question.
>
>
> There 13 unclaimed escape shortcuts unused: \F, \i, \I, \j, \J, \m, \M,
> \o, \O, \q, \T, \y and \Y. And only five pairs if you want to negate as
> well. But if you're going to use a unicode equivalent of \w, shouldn't
> there be unicode equivalents of \s and \d as well?

Well, we only have "room" for one more pair, and I figured unicode
word charater was the most likely to be useful to the greatest number
of people.

>> We could instead create Unicode properties with single-character names,
>> so that the Unicode version of \w is \pw, and likewise for \d, \s, &c.
>
> I like this. And, if you disbelieve the documentation that says user
> defined properties must have names starting with 'Is' or 'In', you can
> use \pw right now:
>
>    sub w {<<'--'}
>    +utf8::Alphabetic
>    +utf8::DecimalNumber
>    5F
>    --
>
>    say "abc_123" =~ /^\pw+$/;
>    __END__
>    1

Hmm, You know, for me, this virtually kills the need for a new syntax
for custom defined "standard classes" overrides.

>> > There is one serious downside to the pragma idea. What should happen
>> > if you embed a pattern compiled under one set of charclass semantics
>> > into a pattern with a different set of semantics? The only way to
>> > tackle this currently is to use regex modifiers, but this then leads
>> > to problems, like for instance how does one embed this information
>> > into the pattern itself without having really insane syntax.

See above.

> In my opinion, it's best (from the programmers POV) if patterns use the
> semantics of the environment they are compiled under - and keep them
> when interpolated (or used) in a different environment.

Yeah but thats tough to do right now unless they are stringafiable
somehow. Which kinda limits our options.

Yves


-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About