Front page | perl.perl5.porters |
Postings from November 2008
Re: [RFC] Regular expression character classes and unicode.
Thread Previous
|
Thread Next
From:
karl williamson
Date:
November 13, 2008 21:41
Subject:
Re: [RFC] Regular expression character classes and unicode.
Message ID:
491D0F5E.7070409@khwilliamson.com
karl williamson wrote:
> demerphq wrote:
>> 2008/11/11 karl williamson <public@khwilliamson.com>:
>>> demerphq wrote:
>>>> 2008/11/10 Abigail <abigail@abigail.be>:
>>>>> On Mon, Nov 10, 2008 at 12:06:43AM +0000, Ben Morrow wrote:
>>>>>> Quoth demerphq@gmail.com (demerphq):
>>>>>>> I propose the following:
>>>> [snip]
>>>>
>>>>>>> 2. Add a new special escape shortcut to mean "unicode word
>>>>>>> character",
>>>>>>> this would have the same semantics as \w does now on a unicode
>>>>>>> string,
>>>>>>> regardless of the internal representation of the string being
>>>>>>> matched
>>>>>>> or the pattern being matched against. I have no idea what this
>>>>>>> "unicode word character" should be. In another universe \u \U
>>>>>>> would be
>>>>>>> perfect candidates IMO, but in our universe, well, \U is taken. So
>>>>>>> what this is called is an open question.
>>>>> There 13 unclaimed escape shortcuts unused: \F, \i, \I, \j, \J, \m,
>>>>> \M,
>>>>> \o, \O, \q, \T, \y and \Y. And only five pairs if you want to
>>>>> negate as
>>>>> well. But if you're going to use a unicode equivalent of \w, shouldn't
>>>>> there be unicode equivalents of \s and \d as well?
>>>> Well, we only have "room" for one more pair, and I figured unicode
>>>> word charater was the most likely to be useful to the greatest number
>>>> of people.
>>>>
>>>>>> We could instead create Unicode properties with single-character
>>>>>> names,
>>>>>> so that the Unicode version of \w is \pw, and likewise for \d, \s,
>>>>>> &c.
>>>>> I like this. And, if you disbelieve the documentation that says user
>>>>> defined properties must have names starting with 'Is' or 'In', you can
>>>>> use \pw right now:
>>>>>
>>>>> sub w {<<'--'}
>>>>> +utf8::Alphabetic
>>>>> +utf8::DecimalNumber
>>>>> 5F
>>>>> --
>>>>>
>>>>> say "abc_123" =~ /^\pw+$/;
>>>>> __END__
>>>>> 1
>>>> Hmm, You know, for me, this virtually kills the need for a new syntax
>>>> for custom defined "standard classes" overrides.
>>>>
>>>>>>> There is one serious downside to the pragma idea. What should happen
>>>>>>> if you embed a pattern compiled under one set of charclass semantics
>>>>>>> into a pattern with a different set of semantics? The only way to
>>>>>>> tackle this currently is to use regex modifiers, but this then leads
>>>>>>> to problems, like for instance how does one embed this information
>>>>>>> into the pattern itself without having really insane syntax.
>>>> See above.
>>>>
>>>>> In my opinion, it's best (from the programmers POV) if patterns use
>>>>> the
>>>>> semantics of the environment they are compiled under - and keep them
>>>>> when interpolated (or used) in a different environment.
>>>> Yeah but thats tough to do right now unless they are stringafiable
>>>> somehow. Which kinda limits our options.
>>>>
>>>> Yves
>>>>
>>>>
>>> I'm not sure I understand the proposal completely.
>>>
>>> As I understand it, the problem really stems from that what perl
>>> defines for
>>> things like \p{Graph} differs from the posix [:graph:]
>>>
>>> Here's what I'd like to see. The old-style character class shortcuts
>>> should
>>> match what they did before unicode came along. So that \w matches
>>> only an
>>> ASCII word character unless changed by using a locale. Same for the
>>> posix-style classes.
>>
>> thats what my proposal amounts to, and what the patch I applied does
>> if the appropriate define is set in regcomp.h
>>
>>> I like the idea of using \pw to match a unicode word
>>> character. But what about the posix style? Does one type
>>> '\p{[:graph:]}',
>>> or can we change what \p{IsGraph} means to be the same in the ASCII
>>> range as
>>> the posix from which I think it was meant to come from?
>>
>> One would type [[:graph:]] or \p{IsPosixGraph} they would work the
>> same on unciode strings as they would on non.
>>
>> I dont plan to redefine \p{IsGraph}, as doing so would just introduce
>> avoidable breakage.
>
> I agree. And I was wrong to imply that \p{...} derives from [:...:].
> Some of them probably do, like \p{Graph}, but most come directly from
> Unicode, and there are inconsistencies between those and any similar
> posix one. The biggest one is that Unicode splits [:punct:] into
> \p{Punct} and \p{Symbol}.
> [snip]
One thing to consider, and I don't claim to know the answer, is what to
do about the newer single character shortcuts that have always (in their
short lives) been defined as matching non-ascii as well: \v, \h, their
complements, and I think without checking: \R. Do they only match
ascii, and one gets non-ascii by saying, eg, \pv ?
Thread Previous
|
Thread Next