develooper Front page | perl.perl5.porters | Postings from September 2014

Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes

Thread Previous | Thread Next
From:
Karl Williamson
Date:
September 29, 2014 17:34
Subject:
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes
Message ID:
54299823.40402@khwilliamson.com
On 09/29/2014 10:53 AM, Karl Williamson wrote:
> On 09/29/2014 07:13 AM, Abigail wrote:
>> On Mon, Sep 29, 2014 at 12:55:15PM +0200, demerphq wrote:
>>> On 29 September 2014 12:43, Abigail <abigail@abigail.be> wrote:
>>>
>>>> On Mon, Sep 29, 2014 at 10:13:21AM +0000, Ed Avis wrote:
>>>>> Abigail <abigail <at> abigail.be> writes:
>>>>>
>>>>>> I've added a remark in perlrecharclass.pod. See commit
>>>>>> 2a2f23e4f8a50bdcdd10563dc5d933684cb70954
>>>>>
>>>>> Thanks.  That adds
>>>>>
>>>>> +The classes C<< [A-Z] >> and C<< [a-z] >> are special cased, in the
>>>> sense
>>>>> +they always match exactly the 26 upper/lower case letters, regardless
>>>>> +of the platform (this only effects EBCDIC, which would otherwise
>>>>> include
>>>>> +some non-letters).
>>>>>
>>>>> I would also add
>>>>>
>>>>>      Digit sequences are and will be consecutive on all platforms Perl
>>>>>      supports, so C<< [0-3] >> always matches the digits 0123, and
>>>>> so on.
>>>>>
>>>>> just to cover all the bases.
>>>>
>>>>
>>>> I disagree.
>>>>
>>>> Because that gives the expectation that C<< [D-N] >> will do that as
>>>> well,
>>>> but it does not.
>>>>
>>>
>>> But it probably should.
>>
>>
>> Well, that's another whole kettle of fish.
>>
>>
>> For now, I'm just concerned about documenting what Perl currently does,
>> and if it does something DWIM for [A-Z] and [a-z] on EBCDIC, than it
>> should
>> be documented, independent on whether we want to change to meaning of
>> [D-N]
>> in the future or no.
>>
>>
>>
>> Abigail
>>
>
> [D-N] means [DEFGHIJKLMN] on EBCDIC platforms, and that is how it has
> worked, according to perlebcdic, since 5.005_03
>

I'm not understanding where the idea that we currently have horrible 
behavior is coming from.

Any subset of the ranges [a-z] and [A-Z] is (and has been) specially 
handled to match on EBCDIC platforms the same equivalent characters it 
matches on ASCII platforms.  Hence qr/[i-j]/i, matches [ijIJ] on both 
ASCII and EBCDIC platforms.

The special handling is only valid if both ends of the range are 
literals.  In EBCDIC, \xC9 is 'I' and \xD1 is 'J'.  If you specify any 
of [\xC9-J], [I-\xD1] , or [\xC9-\xD1], you get all the code points C9, 
CA, CB, CC, CD, CE, CF, and D1.  This is how it has worked since 
apparently 5.005_03, and is how I think it should continue to work.  In 
other words, I think we got the design right.

No special handling is required for 0-9, as they are contiguous on both 
ASCII and EBCDIC.  This is likely true in any native character set.  The 
POSIX standard effectively mandates that the digits in any locale should 
be in 1 or 2 groups of 10 consecutive code points whose numerical values 
are also consecutive, starting with zero.  Unicode now does the same. 
(There was an exception to this that I brought to their attention, and 
they quickly changed it, without the usual dramas.)


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About