develooper Front page | perl.perl5.porters | Postings from October 2014

Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes

Thread Previous | Thread Next
From:
Jarkko Hietaniemi
Date:
October 30, 2014 13:08
Subject:
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes
Message ID:
CAJueppvxhaDUVL2bctBvYUJ53nWYYhaPZ6tQdsaCnqjbEh8EQg@mail.gmail.com
>  [\x{04}-\N{U+09}]

I think people who ask for weird things like this should be expecting
weird results.  In other words, I wouldn't feel bad outlawing them.

The start of the range says "the 0x4 in native", the end of the range
is "the U+09, in Unicode".  It makes no sense.  If they wanted
native-native, they can write that.  If they wanted Unicode-Unicode,
they can write that.

Similarly, think of ranges like [A-z] (that's upper-A-to-lower-z), or
[0-z] (zero-to-lower-z).  Just think in ASCII.  Should these mean
0x41-0x7a, and 0x30-0x7a?  If so, they *will* contain the [[\\\]_`] in
the first case, and the [:;<=>?@\[\\\]_`] in the second.

There's a lot of magic in Perl, but I think there are limits in trying
to always understand what the heck the user meant.  Aborting or
warning at least lets the user be more explicit (many a time the
better solution is to use character classes, like \p{Alpha}), instead
of relying on our guesswork.

As for the non-English speaking view, and how locales would affect
things.  Well, it's complicated... (surprised?): a-z *could* probably
mean "all the lowercase letters" for the languages where z sorts last.
But for languages where z doesn't come last, a-z doesn't feel like
"all the lowercase letters".




On Thu, Oct 30, 2014 at 7:19 AM, Father Chrysostomos via RT
<perlbug-followup@perl.org> wrote:
> On Thu Oct 30 01:25:13 2014, aristotle wrote:
>> * Father Chrysostomos via RT <perlbug-followup@perl.org> [2014-10-30 06:05]:
>> > On Wed Oct 29 21:44:19 2014, public@khwilliamson.com wrote:
>> > > The remaining question I have is what happens if only one end of the
>> > > range is a Unicode construct?
>> > >
>> > > [\N{U+04}-\x{09}]
>> > > [\x{04}-\N{U+09}]
>> > >
>> > > I think this should be deprecated,
>> >
>> > I don’t think it should be deprecated. Most of us don’t care whether
>> > our code runs on EBCDIC, so things that just work on ASCII platforms
>> > should not be deprecated or removed because of EBCDIC-accommodating
>> > reasoning.
>>
>> Are you arguing a principle here
>
> That.
>
>> or do you have code that would break?
>> (In which case, how much?)
>>
>> To me the principle behind this deprecation is not “this would not port
>> to EBCDIC so you should not be doing this” but “we are making \x and \N
>> mean different things that cannot semantically be mixed”.
>
> But on ASCII systems character ranges are simple (start at the Unicode codepoint specified by the left-hand character and iterate through them to the right-hand character).  I don’t think making them more complex brings any benefit.  On EBCDIC, due to the model that Perl follows, they are naturally complex, but that complexity needn’t affect code and programmers that never come in contact with EBCDIC.
>
> --
>
> Father Chrysostomos
>
>
> ---
> via perlbug:  queue: perl5 status: resolved
> https://rt.perl.org/Ticket/Display.html?id=122853



-- 
There is this special biologist word we use for 'stable'. It is
'dead'. -- Jack Cohen

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About