On 09/29/2014 12:26 PM, demerphq wrote: > Any subset of the ranges [a-z] and [A-Z] is (and has been) specially > handled to match on EBCDIC platforms the same equivalent characters > it matches on ASCII platforms. Hence qr/[i-j]/i, matches [ijIJ] on > both ASCII and EBCDIC platforms. > > > I think this is the problem. Why does this apply to [a-z] and [A-Z] > only? Why not to all literals? > > The special handling is only valid if both ends of the range are > literals. In EBCDIC, \xC9 is 'I' and \xD1 is 'J'. If you specify > any of [\xC9-J], [I-\xD1] , or [\xC9-\xD1], you get all the code > points C9, CA, CB, CC, CD, CE, CF, and D1. This is how it has > worked since apparently 5.005_03, and is how I think it should > continue to work. In other words, I think we got the design right. > > > For ranges involving non-literals I agree. But I don't think this design > is sane for literals. > > In other words, I think a rule that said that "literals in character > classes will be interpreted according to the Unicode specification" is a > better rule than what you described. > > I don't suppose we can change it now but the current rules seem > unnecessarily confusing. I'm not sure I understand your point here. [%] matches an ASCII percent on an ASCII platform, and an EBCDIC percent on an EBCDIC platform. The code is perfectly portable. All literal characters match properly on both platforms, and would continue to do so if Perl were ever ported to yet another platform. (The odds of that happening are infinitesimal, I realize.) But there are only three cases where it is obvious what should be in a range of literals. Those are any subsets of A-Z, a-z, and 0-9. Perl takes special action to handle those as DWIM. The only other ASCII literal characters are punctuation and space. There is no natural language intrinsic ordering of them, and hence ranges with these as end points are obfuscations of what is really happening. Perl need not take special efforts to handle obfuscated code. I doubt that there is anybody on this list who knows immediately what [%-{] matches, or [|-&]. These match differently on EBCDIC than ASCII. It would be too late to change this behavior, nor do I think it would be desirable to do so. This from the docs you quoted is right: "A sound principle is to use only ranges that begin from and end at either alphabetics of equal case ([a-e], [A-E]), or digits ([0-9])" Perl should support doing that, but no more, at least in the ASCII range. Above ASCII, there may be scripts where there are ranges that might benefit from similar handling. One possibility is Greek, where there is a tradition of viewing things as a range ("I am the alpha and the omega", for example). And there is a hole in the upper case version of these, which Perl could exclude from matches in subsets of [Α-Ω]. But we run into trouble with the lowercase ones, as there are two versions of sigma in the middle (which are really glyph variants of each other, and so should not have been encoded separately in Unicode, but were for compatibility with earlier standards). I think that probably the number of scripts where this makes sense is relatively small, so it might create more confusion than it's worth to take special action for just those. So, I'm certainly not going to propose doing it.Thread Previous | Thread Next