Front page | perl.perl5.porters |
Postings from November 2008
Re: [RFC] Regular expression character classes and unicode.
Thread Previous
|
Thread Next
From:
demerphq
Date:
November 10, 2008 13:07
Subject:
Re: [RFC] Regular expression character classes and unicode.
Message ID:
9b18b3110811101307x3f10c83dh2c25b974af88618f@mail.gmail.com
2008/11/10 Abigail <abigail@abigail.be>:
> On Mon, Nov 10, 2008 at 12:06:43AM +0000, Ben Morrow wrote:
>>
>> Quoth demerphq@gmail.com (demerphq):
>> >
>> > I propose the following:
[snip]
>
>> > 2. Add a new special escape shortcut to mean "unicode word character",
>> > this would have the same semantics as \w does now on a unicode string,
>> > regardless of the internal representation of the string being matched
>> > or the pattern being matched against. I have no idea what this
>> > "unicode word character" should be. In another universe \u \U would be
>> > perfect candidates IMO, but in our universe, well, \U is taken. So
>> > what this is called is an open question.
>
>
> There 13 unclaimed escape shortcuts unused: \F, \i, \I, \j, \J, \m, \M,
> \o, \O, \q, \T, \y and \Y. And only five pairs if you want to negate as
> well. But if you're going to use a unicode equivalent of \w, shouldn't
> there be unicode equivalents of \s and \d as well?
Well, we only have "room" for one more pair, and I figured unicode
word charater was the most likely to be useful to the greatest number
of people.
>> We could instead create Unicode properties with single-character names,
>> so that the Unicode version of \w is \pw, and likewise for \d, \s, &c.
>
> I like this. And, if you disbelieve the documentation that says user
> defined properties must have names starting with 'Is' or 'In', you can
> use \pw right now:
>
> sub w {<<'--'}
> +utf8::Alphabetic
> +utf8::DecimalNumber
> 5F
> --
>
> say "abc_123" =~ /^\pw+$/;
> __END__
> 1
Hmm, You know, for me, this virtually kills the need for a new syntax
for custom defined "standard classes" overrides.
>> > There is one serious downside to the pragma idea. What should happen
>> > if you embed a pattern compiled under one set of charclass semantics
>> > into a pattern with a different set of semantics? The only way to
>> > tackle this currently is to use regex modifiers, but this then leads
>> > to problems, like for instance how does one embed this information
>> > into the pattern itself without having really insane syntax.
See above.
> In my opinion, it's best (from the programmers POV) if patterns use the
> semantics of the environment they are compiled under - and keep them
> when interpolated (or used) in a different environment.
Yeah but thats tough to do right now unless they are stringafiable
somehow. Which kinda limits our options.
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread Previous
|
Thread Next