-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Moin, On Saturday 31 March 2007 00:20:52 Marc Lehmann wrote: > On Sat, Mar 31, 2007 at 01:39:06AM +0000, Tels <nospam-abuse@bloodgate.com> wrote: > > My question was posed because I wanted to know how to *keep* a KOI8 (or > > any other random binary) string in Perl without converting it to > > Unicode. It seems to me this is not easily possible because there are > > literally dozend places where your KOI8 string might get suddenly > > upgraded to UTF-8 (and thus get corrupted because Perl treats it is > > ISO-8859-1). Or did I get this wrong? > > Yes, you did get that wrong, liekly because Juerd wants users to care > about that. But in fact, if you try it, nothing will get corrupted unless > you use unpack "C" to get the first byte of your KOI8-string. Then you > might get surprised (current perl) or an exception (Juerd's idea). I should have said "random binary data" not "KOI8". "KOI8" implies the data is some sort of text that can be "upgraded" to utf-8. Now, you can *always* treat random binary datas f.i. ISO-8859-1, upgrade it to UTF-8 and then downgrade it again, since this is a lossless transformation. But that doesn't mean it is a good idea because: * speed - useless transcodings * memory (utf-8 needs more memory, and the transcoding, too) * pack/unpack or any other "peeking" at the data might leak the fact that Perl suddenly converted "\xfc" to "\xc3\xbc" underneath (as Marcs bugreport showed). So, yes, if Perl works perfectly in every place, converting you data always on the fly whenever you look at it, you could stuff "KOI8" or any other random binary data in, have it (maybe) converted to utf-8, and on output/looking at converted back to the exact bytes you stuffed in. However, as you demonstrated yourself, Perl doesn't work perfectly :) What I was trying to get at is there are different types of data. Before any encoding or data examination goes on you have: ** random binary data (see notes above why you do not want this treated as ISO-8859-1 and "text"). Basically, you never want Perl to encode/decode it, and any attempt in doing so should result in an warning/exception. (utf-8 flag off) ** ascii 7 bit data (utf-8 flag off) ** 8bit data with an encoding (assumed is ISO-8859-1, but user can specify other types of encoding during a call to "decode") (utf-8 flag off) ** utf-8 data (utf-8 flag on) As you can see, there are four different types of data, but Perl has only one bit flag to distiguish them. So whenever you have data without the utf-8 flag, Perl needs to decide between the three cases mentioned above. And since it cannot store the decision of "already seen 7bit ASCII", it needs to do this again sometime later. This is costly (scanning for hight bit characters to distiguish between 7bit ascii and 8bit "something else"), and it overly simple, because Perl cannot distiguish between "text data in ISO-8859-1 or whatever encoding is in effect" and "binary data which shouldn't be treated as text". As an author who inherited software that deals with random binary data (e.g. JPEGs), this deficency concerns me. Unfortunately, I am in no position to do anything about it except bitch on some random mailing list :( Wheere is a time-machine whenever you need one? [snip] > > As you said, the current warnings::encode can't decide between the case > > of "BINARY + UTF_8" and "ISO-8859-1 + UTF_8" as Perl makes no > > distinction between binary data and ISO-8859-1. And this missing > > distinction is certainly a bother :) > > Only when you hit bugs, or unpack. <sarcasm> and you never hit bugs, or use unpack </sarcasm> :) All the best, Tels - -- Signed on Sat Mar 31 11:28:57 2007 with key 0x93B84C15. View my photo gallery: http://bloodgate.com/photos PGP key on http://bloodgate.com/tels.asc or per email. "Duke Nukem Forever will come out before Unreal 2." -- George Broussard, 2001 (http://tinyurl.com/6m8nh) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) iQEVAwUBRg5J2ncLPEOTuEwVAQJTzwf/TH9JUUnoTOq8+sRpROPhb17oWRjLmNs4 +S+vuSldaCk0qxG6LB8NvoJW8BEX7ldz+4zTaEn0/WKi3e+v9YmWFMqblqnRLm5H lEH7FbVCY+TAINJfVj24JJNaBtZc6ptqqYNzStuVD0T2aNutv5vIVgTdKtkgdYHM gLuG53iqN70zqwOSnn/Acq91zC56/LvEkGRZzdBwwj+qWbC7UXLJhRtc3ZuCCI9m DblbMiKoGzorDF7dQVeguBnyohvdCEvKqMPOvs6Wp/ZVReN/DDXhlsGh7kJ3Pjl2 9C9Nmds9KuFkmvsleXZEy5KPmGIKyJVX33llQKPj9woe0g2Iyjeaeg== =4lLh -----END PGP SIGNATURE-----Thread Previous | Thread Next