develooper Front page | perl.perl5.porters | Postings from October 2014

Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes

Thread Previous | Thread Next
October 2, 2014 07:30
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes
Message ID:
On 2 October 2014 05:41, Karl Williamson <> wrote:

> On 09/29/2014 12:26 PM, demerphq wrote:
>>     Any subset of the ranges [a-z] and [A-Z] is (and has been) specially
>>     handled to match on EBCDIC platforms the same equivalent characters
>>     it matches on ASCII platforms.  Hence qr/[i-j]/i, matches [ijIJ] on
>>     both ASCII and EBCDIC platforms.
>> I think this is the problem. Why does this apply to [a-z] and [A-Z]
>> only? Why not to all literals?
>>     The special handling is only valid if both ends of the range are
>>     literals.  In EBCDIC, \xC9 is 'I' and \xD1 is 'J'.  If you specify
>>     any of [\xC9-J], [I-\xD1] , or [\xC9-\xD1], you get all the code
>>     points C9, CA, CB, CC, CD, CE, CF, and D1.  This is how it has
>>     worked since apparently 5.005_03, and is how I think it should
>>     continue to work.  In other words, I think we got the design right.
>> For ranges involving non-literals I agree. But I don't think this design
>> is sane for literals.
>> In other words, I think a rule that said that "literals in character
>> classes will be interpreted according to the Unicode specification" is a
>> better rule than what you described.
>> I don't suppose we can change it now but the current rules seem
>> unnecessarily confusing.
> I'm not sure I understand your point here.  [%] matches an ASCII percent
> on an ASCII platform, and an EBCDIC percent on an EBCDIC platform.  The
> code is perfectly portable.  All literal characters match properly on both
> platforms, and would continue to do so if Perl were ever ported to yet
> another platform.  (The odds of that happening are infinitesimal, I
> realize.)
> But there are only three cases where it is obvious what should be in a
> range of literals.  Those are any subsets of A-Z, a-z, and 0-9.  Perl takes
> special action to handle those as DWIM.
> The only other ASCII literal characters are punctuation and space. There
> is no natural language intrinsic ordering of them, and hence ranges with
> these as end points are obfuscations of what is really happening.
Whether or not they are an obfuscation is a personal aesthetic opinion. And
since there are many natural language ordering of characters in A-Z I dont
feel you are particularly firm ground suggesting there is something
intrinsically more sensible about A-Z than %-{.

> Perl need not take special efforts to handle obfuscated code.

I think this is a terrible justification for the language not being well

I mean, this case is rather different from "The CPU does math in a
different endianness than your code expects" type undefined behaviour that
cannot be avoided.  With character class ranges the damage is self
inflicted. I think that is sad an unnecessary.

> I doubt that there is anybody on this list who knows immediately what
> [%-{] matches, or [|-&].

I dont think whether people offhand know how many characters are in the
unicode character set [%-{] is relevant. The point is that once you looked
it up you should be able to rely on it everywhere Perl runs. And if you
took this kind of argument to the extreme it would lead to seriously
bizarre consequences.

Heck, Im not sure that many people could tell you how many characters there
are between "P" and "W" off the top of their head, and I bet a lot of
people from non-english backgrounds would *disagree* on the subject.

IOW, I think the position you take differentiating between A-Z and %-{ is
rooted in the fact that you and ASCII share a common cultural background.
If you were Icelandic you would expect to find "á" after "a", but ASCII
doesn't do that. In fact strictly speaking ASCII can't even represent "á".

So I think you are manufacturing a distinction between A-Z and %-{ that is
not really there, and to the extent that it does exist, is culturally

I think that is a pretty terrible basis to decide that one part of a regex
pattern is well defined and others are not.

> These match differently on EBCDIC than ASCII.

Yes, well that is the problem right? They are only poorly defined *because*
they are different on EBCDIC and ASCII.

> It would be too late to change this behavior, nor do I think it would be
> desirable to do so.
Yes, I suspect you are right. Sadly.

On the other hand what would we do if we targeted a different platform that
also used a different native character set? IMO we would be *nuts* to
repeat this design decision for said hypothetical platform.

> This from the docs you quoted is right: "A sound principle is to use only
> ranges that begin from and end at either alphabetics of equal case ([a-e],
> [A-E]), or digits ([0-9])"  Perl should support doing that, but no more, at
> least in the ASCII range.

In an ideal world we would delete that sentence and replace it with
"character class ranges composed of literals are always interpreted
according to the unicode standard, so [%-{] will always match 88 characters
regardless of native encoding, although the actual codepoints matched may
differ from unicode where appropriate".

IOW, the problem here is that when we ported the regex engine to EBCDIC we
did not properly separate out "code points in the pattern as expressed as
literals" and "native representation of those code points".  Which I
suppose is natural given our EBCDIC port predates Unicode, but it is still

I do not think we should have any platform specific behaviour other than
that which is forced upon us.

And I do not think it is good that a *scripting* language like Perl has
portability issues which are not forced upon us.


perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About