develooper Front page | perl.perl5.porters | Postings from October 2014

Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes

Thread Previous | Thread Next
From:
Karl Williamson
Date:
October 30, 2014 04:43
Subject:
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes
Message ID:
5451C1F8.3010807@khwilliamson.com
On 10/02/2014 01:30 AM, demerphq wrote:
> On 2 October 2014 05:41, Karl Williamson <public@khwilliamson.com
> <mailto:public@khwilliamson.com>> wrote:
>
>     On 09/29/2014 12:26 PM, demerphq wrote:
>
>              Any subset of the ranges [a-z] and [A-Z] is (and has been)
>         specially
>              handled to match on EBCDIC platforms the same equivalent
>         characters
>              it matches on ASCII platforms.  Hence qr/[i-j]/i, matches
>         [ijIJ] on
>              both ASCII and EBCDIC platforms.
>
>
>         I think this is the problem. Why does this apply to [a-z] and [A-Z]
>         only? Why not to all literals?
>
>              The special handling is only valid if both ends of the
>         range are
>              literals.  In EBCDIC, \xC9 is 'I' and \xD1 is 'J'.  If you
>         specify
>              any of [\xC9-J], [I-\xD1] , or [\xC9-\xD1], you get all the
>         code
>              points C9, CA, CB, CC, CD, CE, CF, and D1.  This is how it has
>              worked since apparently 5.005_03, and is how I think it should
>              continue to work.  In other words, I think we got the
>         design right.
>
>
>         For ranges involving non-literals I agree. But I don't think
>         this design
>         is sane for literals.
>
>         In other words, I think a rule that said that "literals in character
>         classes will be interpreted according to the Unicode
>         specification" is a
>         better rule than what you described.
>
>         I don't suppose we can change it now but the current rules seem
>         unnecessarily confusing.
>
>
>     I'm not sure I understand your point here.  [%] matches an ASCII
>     percent on an ASCII platform, and an EBCDIC percent on an EBCDIC
>     platform.  The code is perfectly portable.  All literal characters
>     match properly on both platforms, and would continue to do so if
>     Perl were ever ported to yet another platform.  (The odds of that
>     happening are infinitesimal, I realize.)
>
>     But there are only three cases where it is obvious what should be in
>     a range of literals.  Those are any subsets of A-Z, a-z, and 0-9.
>     Perl takes special action to handle those as DWIM.
>
>     The only other ASCII literal characters are punctuation and space.
>     There is no natural language intrinsic ordering of them, and hence
>     ranges with these as end points are obfuscations of what is really
>     happening.
>
>
> Whether or not they are an obfuscation is a personal aesthetic opinion.
> And since there are many natural language ordering of characters in A-Z
> I dont feel you are particularly firm ground suggesting there is
> something intrinsically more sensible about A-Z than %-{.
>
>     Perl need not take special efforts to handle obfuscated code.
>
>
> I think this is a terrible justification for the language not being well
> defined.
>
> I mean, this case is rather different from "The CPU does math in a
> different endianness than your code expects" type undefined behaviour
> that cannot be avoided.  With character class ranges the damage is self
> inflicted. I think that is sad an unnecessary.
>
>     I doubt that there is anybody on this list who knows immediately
>     what [%-{] matches, or [|-&].
>
>
> I dont think whether people offhand know how many characters are in the
> unicode character set [%-{] is relevant. The point is that once you
> looked it up you should be able to rely on it everywhere Perl runs. And
> if you took this kind of argument to the extreme it would lead to
> seriously bizarre consequences.
>
> Heck, Im not sure that many people could tell you how many characters
> there are between "P" and "W" off the top of their head, and I bet a lot
> of people from non-english backgrounds would *disagree* on the subject.
>
> IOW, I think the position you take differentiating between A-Z and %-{
> is rooted in the fact that you and ASCII share a common cultural
> background. If you were Icelandic you would expect to find "á" after
> "a", but ASCII doesn't do that. In fact strictly speaking ASCII can't
> even represent "á".
>
> So I think you are manufacturing a distinction between A-Z and %-{ that
> is not really there, and to the extent that it does exist, is culturally
> specific.
>
> I think that is a pretty terrible basis to decide that one part of a
> regex pattern is well defined and others are not.
>
>     These match differently on EBCDIC than ASCII.
>
>
> Yes, well that is the problem right? They are only poorly defined
> *because* they are different on EBCDIC and ASCII.
>
>     It would be too late to change this behavior, nor do I think it
>     would be desirable to do so.
>
>
> Yes, I suspect you are right. Sadly.
>
> On the other hand what would we do if we targeted a different platform
> that also used a different native character set? IMO we would be *nuts*
> to repeat this design decision for said hypothetical platform.
>
>     This from the docs you quoted is right: "A sound principle is to use
>     only ranges that begin from and end at either alphabetics of equal
>     case ([a-e], [A-E]), or digits ([0-9])"  Perl should support doing
>     that, but no more, at least in the ASCII range.
>
>
> In an ideal world we would delete that sentence and replace it with
> "character class ranges composed of literals are always interpreted
> according to the unicode standard, so [%-{] will always match 88
> characters regardless of native encoding, although the actual codepoints
> matched may differ from unicode where appropriate".
>
> IOW, the problem here is that when we ported the regex engine to EBCDIC
> we did not properly separate out "code points in the pattern as
> expressed as literals" and "native representation of those code
> points".  Which I suppose is natural given our EBCDIC port predates
> Unicode, but it is still unfortunate.
>
> I do not think we should have any platform specific behaviour other than
> that which is forced upon us.
>
> And I do not think it is good that a *scripting* language like Perl has
> portability issues which are not forced upon us.
>
> Yves
> --
> perl -Mre=debug -e "/just|another|perl|hacker/"

I agree that it would be nice to be able to portably specify ranges. 
But before I get to that, I have a couple of points to make, moot as 
they might be.

If one has to look up what's exactly in a range when coding, then that 
person is unfairly burdening whomever might take up the maintenance of 
that code in the future.

You may very well be right about my cultural bias about what's in A-Z. 
I've tried to imagine what I would think if my first language had had 
other characters, but I can't really.

But your idealized solution effectively says to people on EBCDIC that 
they have to use a foreign character set, and that is just as 
chauvinistic as my A-Z bias.  There are people who code solely on and 
for EBCDIC, and Perl should accommodate their native way of thinking. 
So \x04 has to mean the character whose code point is natively 4 on 
whatever platform the code is being run on.  If you want to specify the 
character whose *Unicode* code point is 4, you can use \N{U+04}.

But then what about this range?

	[\N{U+04}-\N{U+09}]

It seems obvious to me that what the coder meant is

	[\N{U+04}\N{U+05}\N{U+06}\N{U+07}\N{U+08}\N{U+09}]

But on EBCDIC it currently doesn't mean that; it is an error because 
\N{U+04} is 0x37 and \N{U+09} is 0x05, so we have a range whose first 
value is larger than the second value, which is not allowed.  I think 
this is a bug, and I propose to fix it.  The fix is not hard.  The 
paradigm is that a range in any platform which is specified in terms of 
Unicode end-points should follow Unicode rules.  That gives portability 
across all platforms.

By extension, I think that using the Unicode name syntax should act 
identically as the U+ syntax.  The above range could be specified using 
that syntax as

	[\N{EOT}-\N{HT}]

and should include EOT (4 on ASCII), HT (9 on ASCII) plus U+05..U+08
(ENQ, ACK, BEL and BS (5, 6, 7, 8 respectively in ASCII).

So, by specifying a range in Unicode terminology, one could get the 
portability Yves wants.  [\N{PERCENT SIGN}-\N{LEFT CURLY BRACKET}] would 
match the same characters on all platforms that [%-{] does on ASCII.

The remaining question I have is what happens if only one end of the 
range is a Unicode construct?

	[\N{U+04}-\x{09}]
	[\x{04}-\N{U+09}]

I think this should be deprecated, and in the meantime, the non-Unicode 
endpoint be considered to be the Unicode value.   There are no such 
usages currently in CPAN.  In fact, there are only 2 modules that use 
\N{} in ranges, and both look to be wanting the behavior I'm proposing here.

http://grep.cpan.me/?q=\[.*\\N{[^}]*}-+-file%3A%22\.pod%24%22
http://grep.cpan.me/?q=-\\N{[^}]*}+-file%3A%22\.pod%24%22

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About