develooper Front page | perl.perl5.porters | Postings from November 2008

Re: [RFC] Regular expression character classes and unicode.

Thread Previous | Thread Next
From:
demerphq
Date:
November 12, 2008 01:04
Subject:
Re: [RFC] Regular expression character classes and unicode.
Message ID:
9b18b3110811120104r1faa8976ke2aad1d877ad5a2a@mail.gmail.com
2008/11/11 karl williamson <public@khwilliamson.com>:
> demerphq wrote:
>>
>> 2008/11/10 Abigail <abigail@abigail.be>:
>>>
>>> On Mon, Nov 10, 2008 at 12:06:43AM +0000, Ben Morrow wrote:
>>>>
>>>> Quoth demerphq@gmail.com (demerphq):
>>>>>
>>>>> I propose the following:
>>
>> [snip]
>>
>>>>> 2. Add a new special escape shortcut to mean "unicode word character",
>>>>> this would have the same semantics as \w does now on a unicode string,
>>>>> regardless of the internal representation of the string being matched
>>>>> or the pattern being matched against. I have no idea what this
>>>>> "unicode word character" should be. In another universe \u \U would be
>>>>> perfect candidates IMO, but in our universe, well, \U is taken. So
>>>>> what this is called is an open question.
>>>
>>> There 13 unclaimed escape shortcuts unused: \F, \i, \I, \j, \J, \m, \M,
>>> \o, \O, \q, \T, \y and \Y. And only five pairs if you want to negate as
>>> well. But if you're going to use a unicode equivalent of \w, shouldn't
>>> there be unicode equivalents of \s and \d as well?
>>
>> Well, we only have "room" for one more pair, and I figured unicode
>> word charater was the most likely to be useful to the greatest number
>> of people.
>>
>>>> We could instead create Unicode properties with single-character names,
>>>> so that the Unicode version of \w is \pw, and likewise for \d, \s, &c.
>>>
>>> I like this. And, if you disbelieve the documentation that says user
>>> defined properties must have names starting with 'Is' or 'In', you can
>>> use \pw right now:
>>>
>>>   sub w {<<'--'}
>>>   +utf8::Alphabetic
>>>   +utf8::DecimalNumber
>>>   5F
>>>   --
>>>
>>>   say "abc_123" =~ /^\pw+$/;
>>>   __END__
>>>   1
>>
>> Hmm, You know, for me, this virtually kills the need for a new syntax
>> for custom defined "standard classes" overrides.
>>
>>>>> There is one serious downside to the pragma idea. What should happen
>>>>> if you embed a pattern compiled under one set of charclass semantics
>>>>> into a pattern with a different set of semantics? The only way to
>>>>> tackle this currently is to use regex modifiers, but this then leads
>>>>> to problems, like for instance how does one embed this information
>>>>> into the pattern itself without having really insane syntax.
>>
>> See above.
>>
>>> In my opinion, it's best (from the programmers POV) if patterns use the
>>> semantics of the environment they are compiled under - and keep them
>>> when interpolated (or used) in a different environment.
>>
>> Yeah but thats tough to do right now unless they are stringafiable
>> somehow. Which kinda limits our options.
>>
>> Yves
>>
>>
>
> I'm not sure I understand the proposal completely.
>
> As I understand it, the problem really stems from that what perl defines for
> things like \p{Graph} differs from the posix [:graph:]
>
> Here's what I'd like to see.  The old-style character class shortcuts should
> match what they did before unicode came along.  So that \w matches only an
> ASCII word character unless changed by using a locale. Same for the
> posix-style classes.

thats what my proposal amounts to, and what the patch I applied does
if the appropriate define is set in regcomp.h

>I like the idea of using \pw to match a unicode word
> character.   But what about the posix style?  Does one type '\p{[:graph:]}',
> or can we change what \p{IsGraph} means to be the same in the ASCII range as
> the posix from which I think it was meant to come from?

One would type [[:graph:]] or \p{IsPosixGraph} they would work the
same on unciode strings as they would on non.

I dont plan to redefine \p{IsGraph}, as doing so would just introduce
avoidable breakage.

> A goal should be to not have the utf8ness of a string matter in the
> semantics.  Consequently, I think that no utf8 upgrades should occur when a
> locale is specified, or at least a warning issued.

Im not following your last comment here. Does use locale sometimes
cause string upgrades?

> I was taught the name "C locale".  It appears that a more modern equivalent
> name is the "POSIX locale".  I looked it up, and it is defined only on the
> range 0-127; with semantics of characters outside that range "unspecified.
>  This means that one can define semantics for any code point one wishes
> above 127 and still call it the POSIX locale, as long as it follows certain
> rules that are specified, eg, no character can be both alphabetic and
> punctuation.  I mention this FYI.  I suspect that when people do use the
> term POSIX locale (or C), that they think that the characters above 127 are
> undefined.  Am I right?

I think so yes.

> So what about characters above 127?  Well, in a locale they should behave as
> they should for that locale.  For non-locale, I think, as I said above, that
> \w should not match any of them without regard to utf8ness, but \W should
> match all of them, etc.  The same for [:graph] none; [:^graph] all.  \p
> should match as in Unicode semantics regardless of utf8ness.

Agreed.

> And I guess, we should have a pragma that allows the current behavior to be
> retained.  I don't see the need for three different states.  Keep in mind
> that I'm planning a pragma to cause utf8ness to not matter, which would
> become the default in 5.12.  Does the behavior need to be separable so that
> we need two pragmas?  If so,I like combining them so, for example to modify
> Rafael's  idea to have:
>
> use legacy qw(re_charclass, utf8_strings);

I dont want to muddy the waters of this discussion with other aspects
of this problem as I dont think we necessarily want, nor need, an "all
in one" pragma.

Which is why I proposed using the re pragma to control the regex
related behaviour.

> But there are problems with pragmas.  As I've discovered, the charnames
> pragma goes away in an eval.

I think this is a bug.

>That this should happen was not obvious to me,
> and I suspect not so to the average Perl progammer.  It doesn't DWIM, and
> I'm not convinced it is the right thing to do.  It means that  complementing
> the default from release to release can cause programs to have to add pragma
> calls to their evals.

I think we should fix this bug.

Yves




-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About