develooper Front page | perl.perl5.porters | Postings from September 2014

Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes

Thread Previous | Thread Next
From:
demerphq
Date:
September 29, 2014 18:26
Subject:
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes
Message ID:
CANgJU+UZEJfcDHYi6mepn5Hmbru8xKMjZE1=0kC9NYMXAnnf5w@mail.gmail.com
On 29 September 2014 19:34, Karl Williamson <public@khwilliamson.com> wrote:

> On 09/29/2014 10:53 AM, Karl Williamson wrote:
>
>> On 09/29/2014 07:13 AM, Abigail wrote:
>>
>>> On Mon, Sep 29, 2014 at 12:55:15PM +0200, demerphq wrote:
>>>
>>>> On 29 September 2014 12:43, Abigail <abigail@abigail.be> wrote:
>>>>
>>>>  On Mon, Sep 29, 2014 at 10:13:21AM +0000, Ed Avis wrote:
>>>>>
>>>>>> Abigail <abigail <at> abigail.be> writes:
>>>>>>
>>>>>>  I've added a remark in perlrecharclass.pod. See commit
>>>>>>> 2a2f23e4f8a50bdcdd10563dc5d933684cb70954
>>>>>>>
>>>>>>
>>>>>> Thanks.  That adds
>>>>>>
>>>>>> +The classes C<< [A-Z] >> and C<< [a-z] >> are special cased, in the
>>>>>>
>>>>> sense
>>>>>
>>>>>> +they always match exactly the 26 upper/lower case letters, regardless
>>>>>> +of the platform (this only effects EBCDIC, which would otherwise
>>>>>> include
>>>>>> +some non-letters).
>>>>>>
>>>>>> I would also add
>>>>>>
>>>>>>      Digit sequences are and will be consecutive on all platforms Perl
>>>>>>      supports, so C<< [0-3] >> always matches the digits 0123, and
>>>>>> so on.
>>>>>>
>>>>>> just to cover all the bases.
>>>>>>
>>>>>
>>>>>
>>>>> I disagree.
>>>>>
>>>>> Because that gives the expectation that C<< [D-N] >> will do that as
>>>>> well,
>>>>> but it does not.
>>>>>
>>>>>
>>>> But it probably should.
>>>>
>>>
>>>
>>> Well, that's another whole kettle of fish.
>>>
>>>
>>> For now, I'm just concerned about documenting what Perl currently does,
>>> and if it does something DWIM for [A-Z] and [a-z] on EBCDIC, than it
>>> should
>>> be documented, independent on whether we want to change to meaning of
>>> [D-N]
>>> in the future or no.
>>>
>>>
>>>
>>> Abigail
>>>
>>>
>> [D-N] means [DEFGHIJKLMN] on EBCDIC platforms, and that is how it has
>> worked, according to perlebcdic, since 5.005_03
>>
>>
> I'm not understanding where the idea that we currently have horrible
> behavior is coming from.
>
>
The docs aren't very clear on this. I dont see anything that spells this
issue out like you have below.


> Any subset of the ranges [a-z] and [A-Z] is (and has been) specially
> handled to match on EBCDIC platforms the same equivalent characters it
> matches on ASCII platforms.  Hence qr/[i-j]/i, matches [ijIJ] on both ASCII
> and EBCDIC platforms.
>
>
I think this is the problem. Why does this apply to [a-z] and [A-Z] only?
Why not to all literals?


> The special handling is only valid if both ends of the range are
> literals.  In EBCDIC, \xC9 is 'I' and \xD1 is 'J'.  If you specify any of
> [\xC9-J], [I-\xD1] , or [\xC9-\xD1], you get all the code points C9, CA,
> CB, CC, CD, CE, CF, and D1.  This is how it has worked since apparently
> 5.005_03, and is how I think it should continue to work.  In other words, I
> think we got the design right.
>

For ranges involving non-literals I agree. But I don't think this design is
sane for literals.

In other words, I think a rule that said that "literals in character
classes will be interpreted according to the Unicode specification" is a
better rule than what you described.

I don't suppose we can change it now but the current rules seem
unnecessarily confusing.

The docs on ranges in perlrecharclass.pod say this:

       Character Ranges

       It is not uncommon to want to match a range of characters. Luckily,
instead of listing all characters in the range, one may
       use the hyphen ("-").  If inside a bracketed character class you
have two characters separated by a hyphen, it's treated as
       if all characters between the two were in the class. For instance,
"[0-9]" matches any ASCII digit, and "[a-m]" matches any
       lowercase letter from the first half of the old ASCII alphabet.

       Note that the two characters on either side of the hyphen are not
necessarily both letters or both digits. Any character is
       possible, although not advisable.  "['-?]" contains a range of
characters, but most people will not know which characters
       that means.  Furthermore, such ranges may lead to portability
problems if the code has to run on a platform that uses a
       different character set, such as EBCDIC.

       If a hyphen in a character class cannot syntactically be part of a
range, for instance because it is the first or the last
       character of the character class, or if it immediately follows a
range, the hyphen isn't special, and so is considered a
       character to be matched literally.  If you want a hyphen in your set
of characters to be matched and its position in the
       class is such that it could be considered part of a range, you must
escape that hyphen with a backslash.

       Examples:

        [a-z]       #  Matches a character that is a lower case ASCII
letter.
        [a-fz]      #  Matches any letter between 'a' and 'f' (inclusive) or
                    #  the letter 'z'.
        [-z]        #  Matches either a hyphen ('-') or the letter 'z'.
        [a-f-m]     #  Matches any letter between 'a' and 'f' (inclusive),
the
                    #  hyphen ('-'), or the letter 'm'.
        ['-?]       #  Matches any of the characters
 '()*+,-./0123456789:;<=>?
                    #  (But not on an EBCDIC platform).

If I read this carefully, with your mails fully in mind, I can see how what
you say and what it say agree, or perhaps better, do not disagree. However
a quick reading of the second paragraph might lead someone to think that
character class ranges are in general not portable. Or might miss the
significance of ASCII in the descriptions.

Also in perlre:

       (The following all
       specify the same class of three characters: "[-az]", "[az-]", and
"[a\-z]".  All are different from "[a-z]", which
       specifies a class containing twenty-six characters, even on
EBCDIC-based character sets.)  Also, if you try to use the
       character classes "\w", "\W", "\s", "\S", "\d", or "\D" as endpoints
of a range, the "-" is understood literally.

       Note also that the whole range idea is rather unportable between
character sets--and even within character sets they may
       cause results you probably didn't expect.  A sound principle is to
use only ranges that begin from and end at either
       alphabetics of equal case ([a-e], [A-E]), or digits ([0-9]).
Anything else is unsafe.  If in doubt, spell out the
       character sets in full.

Now again, when I read that with what you said in mind I understand that
they are in agreement.

But your mail spelled it out a whole lot clearer than any of the docs I
found.

Yves


Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About