develooper Front page | perl.perl5.porters | Postings from October 2014

Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes

Thread Previous | Thread Next
Karl Williamson
October 2, 2014 03:41
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes
Message ID:
On 09/29/2014 12:26 PM, demerphq wrote:
>     Any subset of the ranges [a-z] and [A-Z] is (and has been) specially
>     handled to match on EBCDIC platforms the same equivalent characters
>     it matches on ASCII platforms.  Hence qr/[i-j]/i, matches [ijIJ] on
>     both ASCII and EBCDIC platforms.
> I think this is the problem. Why does this apply to [a-z] and [A-Z]
> only? Why not to all literals?
>     The special handling is only valid if both ends of the range are
>     literals.  In EBCDIC, \xC9 is 'I' and \xD1 is 'J'.  If you specify
>     any of [\xC9-J], [I-\xD1] , or [\xC9-\xD1], you get all the code
>     points C9, CA, CB, CC, CD, CE, CF, and D1.  This is how it has
>     worked since apparently 5.005_03, and is how I think it should
>     continue to work.  In other words, I think we got the design right.
> For ranges involving non-literals I agree. But I don't think this design
> is sane for literals.
> In other words, I think a rule that said that "literals in character
> classes will be interpreted according to the Unicode specification" is a
> better rule than what you described.
> I don't suppose we can change it now but the current rules seem
> unnecessarily confusing.

I'm not sure I understand your point here.  [%] matches an ASCII percent 
on an ASCII platform, and an EBCDIC percent on an EBCDIC platform.  The 
code is perfectly portable.  All literal characters match properly on 
both platforms, and would continue to do so if Perl were ever ported to 
yet another platform.  (The odds of that happening are infinitesimal, I 

But there are only three cases where it is obvious what should be in a 
range of literals.  Those are any subsets of A-Z, a-z, and 0-9.  Perl 
takes special action to handle those as DWIM.

The only other ASCII literal characters are punctuation and space. 
There is no natural language intrinsic ordering of them, and hence 
ranges with these as end points are obfuscations of what is really 

Perl need not take special efforts to handle obfuscated code.  I doubt 
that there is anybody on this list who knows immediately what [%-{] 
matches, or [|-&].  These match differently on EBCDIC than ASCII.  It 
would be too late to change this behavior, nor do I think it would be 
desirable to do so.

This from the docs you quoted is right: "A sound principle is to use 
only ranges that begin from and end at either alphabetics of equal case 
([a-e], [A-E]), or digits ([0-9])"  Perl should support doing that, but 
no more, at least in the ASCII range.

Above ASCII, there may be scripts where there are ranges that might 
benefit from similar handling.  One possibility is Greek, where there is 
a tradition of viewing things as a range ("I am the alpha and the 
omega", for example).  And there is a hole in the upper case version of 
these, which Perl could exclude from matches in subsets of [Α-Ω].  But 
we run into trouble with the lowercase ones, as there are two versions 
of sigma in the middle (which are really glyph variants of each other, 
and so should not have been encoded separately in Unicode, but were for 
compatibility with earlier standards).  I think that probably the number 
of scripts where this makes sense is relatively small, so it might 
create more confusion than it's worth to take special action for just 
those.  So, I'm certainly not going to propose doing it.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About