Front page | perl.perl5.porters |
Postings from November 2008
Re: [RFC] Regular expression character classes and unicode.
Thread Previous
|
Thread Next
From:
demerphq
Date:
November 12, 2008 01:04
Subject:
Re: [RFC] Regular expression character classes and unicode.
Message ID:
9b18b3110811120104r1faa8976ke2aad1d877ad5a2a@mail.gmail.com
2008/11/11 karl williamson <public@khwilliamson.com>:
> demerphq wrote:
>>
>> 2008/11/10 Abigail <abigail@abigail.be>:
>>>
>>> On Mon, Nov 10, 2008 at 12:06:43AM +0000, Ben Morrow wrote:
>>>>
>>>> Quoth demerphq@gmail.com (demerphq):
>>>>>
>>>>> I propose the following:
>>
>> [snip]
>>
>>>>> 2. Add a new special escape shortcut to mean "unicode word character",
>>>>> this would have the same semantics as \w does now on a unicode string,
>>>>> regardless of the internal representation of the string being matched
>>>>> or the pattern being matched against. I have no idea what this
>>>>> "unicode word character" should be. In another universe \u \U would be
>>>>> perfect candidates IMO, but in our universe, well, \U is taken. So
>>>>> what this is called is an open question.
>>>
>>> There 13 unclaimed escape shortcuts unused: \F, \i, \I, \j, \J, \m, \M,
>>> \o, \O, \q, \T, \y and \Y. And only five pairs if you want to negate as
>>> well. But if you're going to use a unicode equivalent of \w, shouldn't
>>> there be unicode equivalents of \s and \d as well?
>>
>> Well, we only have "room" for one more pair, and I figured unicode
>> word charater was the most likely to be useful to the greatest number
>> of people.
>>
>>>> We could instead create Unicode properties with single-character names,
>>>> so that the Unicode version of \w is \pw, and likewise for \d, \s, &c.
>>>
>>> I like this. And, if you disbelieve the documentation that says user
>>> defined properties must have names starting with 'Is' or 'In', you can
>>> use \pw right now:
>>>
>>> sub w {<<'--'}
>>> +utf8::Alphabetic
>>> +utf8::DecimalNumber
>>> 5F
>>> --
>>>
>>> say "abc_123" =~ /^\pw+$/;
>>> __END__
>>> 1
>>
>> Hmm, You know, for me, this virtually kills the need for a new syntax
>> for custom defined "standard classes" overrides.
>>
>>>>> There is one serious downside to the pragma idea. What should happen
>>>>> if you embed a pattern compiled under one set of charclass semantics
>>>>> into a pattern with a different set of semantics? The only way to
>>>>> tackle this currently is to use regex modifiers, but this then leads
>>>>> to problems, like for instance how does one embed this information
>>>>> into the pattern itself without having really insane syntax.
>>
>> See above.
>>
>>> In my opinion, it's best (from the programmers POV) if patterns use the
>>> semantics of the environment they are compiled under - and keep them
>>> when interpolated (or used) in a different environment.
>>
>> Yeah but thats tough to do right now unless they are stringafiable
>> somehow. Which kinda limits our options.
>>
>> Yves
>>
>>
>
> I'm not sure I understand the proposal completely.
>
> As I understand it, the problem really stems from that what perl defines for
> things like \p{Graph} differs from the posix [:graph:]
>
> Here's what I'd like to see. The old-style character class shortcuts should
> match what they did before unicode came along. So that \w matches only an
> ASCII word character unless changed by using a locale. Same for the
> posix-style classes.
thats what my proposal amounts to, and what the patch I applied does
if the appropriate define is set in regcomp.h
>I like the idea of using \pw to match a unicode word
> character. But what about the posix style? Does one type '\p{[:graph:]}',
> or can we change what \p{IsGraph} means to be the same in the ASCII range as
> the posix from which I think it was meant to come from?
One would type [[:graph:]] or \p{IsPosixGraph} they would work the
same on unciode strings as they would on non.
I dont plan to redefine \p{IsGraph}, as doing so would just introduce
avoidable breakage.
> A goal should be to not have the utf8ness of a string matter in the
> semantics. Consequently, I think that no utf8 upgrades should occur when a
> locale is specified, or at least a warning issued.
Im not following your last comment here. Does use locale sometimes
cause string upgrades?
> I was taught the name "C locale". It appears that a more modern equivalent
> name is the "POSIX locale". I looked it up, and it is defined only on the
> range 0-127; with semantics of characters outside that range "unspecified.
> This means that one can define semantics for any code point one wishes
> above 127 and still call it the POSIX locale, as long as it follows certain
> rules that are specified, eg, no character can be both alphabetic and
> punctuation. I mention this FYI. I suspect that when people do use the
> term POSIX locale (or C), that they think that the characters above 127 are
> undefined. Am I right?
I think so yes.
> So what about characters above 127? Well, in a locale they should behave as
> they should for that locale. For non-locale, I think, as I said above, that
> \w should not match any of them without regard to utf8ness, but \W should
> match all of them, etc. The same for [:graph] none; [:^graph] all. \p
> should match as in Unicode semantics regardless of utf8ness.
Agreed.
> And I guess, we should have a pragma that allows the current behavior to be
> retained. I don't see the need for three different states. Keep in mind
> that I'm planning a pragma to cause utf8ness to not matter, which would
> become the default in 5.12. Does the behavior need to be separable so that
> we need two pragmas? If so,I like combining them so, for example to modify
> Rafael's idea to have:
>
> use legacy qw(re_charclass, utf8_strings);
I dont want to muddy the waters of this discussion with other aspects
of this problem as I dont think we necessarily want, nor need, an "all
in one" pragma.
Which is why I proposed using the re pragma to control the regex
related behaviour.
> But there are problems with pragmas. As I've discovered, the charnames
> pragma goes away in an eval.
I think this is a bug.
>That this should happen was not obvious to me,
> and I suspect not so to the average Perl progammer. It doesn't DWIM, and
> I'm not convinced it is the right thing to do. It means that complementing
> the default from release to release can cause programs to have to add pragma
> calls to their evals.
I think we should fix this bug.
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread Previous
|
Thread Next