On Tue, Feb 06, 2007 at 01:40:43PM -0500, mark@mark.mielke.cc wrote: > On Tue, Feb 06, 2007 at 06:27:59PM +0100, Gerard Goossen wrote: > > > This is not a matter of context, by the way. Instead, the value "\xFF" > > > is polymorphic. It's both a unicode string representing code point > > > U+00FF, and the single byte 0xFF. > > No. \xFF creates a character represented by FF according to the native > > encoding. > > If your native encoding is EBCDIC this does NOT correspend to > > U+00FF (instead it corresponds to U+007E or U+009F, depending on the > > flavor of EBCDIC you're on). > > You also assume that \xFF in the native encoding corresponds to a byte > > You assume (like everybody else) that in the native encoding a > > character corresponds to a byte with the same numeric value. > > This assumption is what makes the transition to UTF-8 so difficult, > > because in the UTF-8 encoding, the assumption is NOT correct. > > I think are saying that UTF-EBCDIC should be the internal representation > for strings in Perl on EBCDIC platforms if any characters in the string > has a value >= 0x80. I would suggest to make the UTF-EBCDIC the representation in Perl7 on EBCDIC platforms, regardless of what is in the string. > If this is what you are saying, then I can see why I, and other people > cannot understand you. We're not on the same page. I don't believe > UTF-EBCDIC makes sense, as UTF-EBCDIC is not an encoding of UNICODE. > It is an encoding of a mix between EBCDIC/UNICODE. Although UTF-8 > is only an encoding scheme, most people assume that the internal > representation for a language that claims to support UNICODE, should > be UNICODE, therefore the UTF-8 should be encoding UNICODE code > points. Not EBCDIC/UNICODE code points. > > Perhaps this would represent a performance degradation for systems > that use EBCDIC natively? Is this why you would focus on UTF-EBCDIC? UTF-EBCDIC is an encoding of UNICODE. But a strange one in the sence that bytes do NOT correspond to codepoints for codepoints < 0x7F. But the bytes do correspond. If you want have codepoint U+0041 (ASCII 'A') this would be UTF-EBCDIC encoded as 0xC1. Using EBCDIC encoding 0xC1 would also be an 'A'. So although with the codepoints are not the same with UNICODE and EBCDIC, using UTF-EBCDIC the bytes are. Like you my initial my initial idea was also to use UTF-8 on EBCDIC platforms, but SADAHIRO pointed out, that on EBCDIC platform '\n' in C would not generate a LF in UTF-EBCDIC, but not in a LF in UTF-8. > Anyways - I've not shared people's opinions that Perl's implementation > of UNICODE or UTF-8 is excellent. I've avoided it wherever possible. > I prefer Java's approach or GTK's approach. Java uses UTF-16 internal > representation, but never confuses internal representation with > external representation. If portability is of course, this seems > an excellent approach. Having a strict seperation is certainly a valid approach, I think it is much more in the style of Perl to have a transparent conversion from one type of scalar to another, be it number, text-string or byte-string. This is also how it used to work when there was only latin1, where the byte representation would be identical to the numeric value of the character, leading to people using \xFF to create the byte 0xFF Back to why I think a transparent conversion would be convenient, for example when doing: use Socket; use IO::Handle; sub identity { my $string = shift; my ($sock1, $sock2); socketpair($sock1, $sock2, AF_UNIX, SOCK_STREAM, PF_UNSPEC) or die; $sock1->print($string); $sock1->close; local $/ = undef; # slurp mode return $sock2->getline(); } I don't care whether $string is a text-string or byte-string, I just want it to returns the same string. The conversion should not be a problem as long as there is one encoding which is always used (of course the text-string must be able to be encoded using this encoding). The problem with the current Perl 5 is that it uses two encoding latin1 and UTF-8. The above identity holds in Perl 5, for both text and byte string, as long as you don't use any unicode characters, leading to people avoiding unicode :-( Gerard GoossenThread Previous | Thread Next