Tels skribis 2007-03-31 11:45 (+0000): > I should have said "random binary data" not "KOI8". "KOI8" implies the data > is some sort of text that can be "upgraded" to utf-8. Not if "upgrading" refers to the process that Perl has when it goes from latin1 to utf-8, because this doesn't handle arbitrary encodings like koi8r. Your best bet is to treat koi8r encoded data as binary data. (And as such, not mix it with text data.) If this is confusing, you may want to gzip it first, and ungzip it afterwards ;) > Now, you can *always* treat random binary datas f.i. ISO-8859-1, > upgrade it to UTF-8 and then downgrade it again, since this is a > lossless transformation. But that doesn't mean it is a good idea Exactly. > * speed - useless transcodings > * memory (utf-8 needs more memory, and the transcoding, too) > * pack/unpack or any other "peeking" at the data might leak the fact that > Perl suddenly converted "\xfc" to "\xc3\xbc" underneath Good summary. Also, if you output it to an encodingless filehandle before downgrading it again, the value may contain characters greater than 127, and you'll get output that you probably did not intend. > ** random binary data (see notes above why you do not want this treated as > ISO-8859-1 and "text"). Basically, you never want Perl to encode/decode it, > and any attempt in doing so should result in an warning/exception. (utf-8 > flag off) Yep. > ** ascii 7 bit data (utf-8 flag off) The UTF8 flag can also off for 8 bit data. For ASCII data it will typically be off, but it wouldn't matter if it were on. (That is, if you treat ASCII data like text. You don't want to treat UTF8 carrying data as binary, though, because you will want to mix binary data with other binary data, without having it upgraded.) > As you can see, there are four different types of data, but Perl has only > one bit flag to distiguish them. I'd say it has two types of data, and indeed that one bit. With the bit on, it's unicode data that internally is encoded as UTF-8. You're not supposed to access the UTF-8 encoded octet buffer. This string should never be used with octet operations like vec or unpack "C" or "n". With the bit off, it's either unicode data that internally is encoded as ISO-8859-1, or it is binary data. This string can safely be used for octet operations (but of course, that doesn't make sense if the sting was intended as text, with the exception of some ancient 8bit things crypt()). > So whenever you have data without the utf-8 flag, Perl needs to decide > between the three cases mentioned above. It doesn't do that. Every UTF8less string is treated the same. > This is costly (scanning for hight bit characters to distiguish between 7bit > ascii and 8bit "something else") I'm not aware of Perl scanning for high bit characters in UTF8less strings, or any performance loss caused by that. > As an author who inherited software that deals with random binary data (e.g. > JPEGs), this deficency concerns me. I'm not aware of such a deficiency, and my Perl handles JPEG data just fine as long as I don't let it touch unicode text data. -- korajn salutojn, juerd waalboer: perl hacker <juerd@juerd.nl> <http://juerd.nl/sig> convolution: ict solutions and consultancy <sales@convolution.nl> Ik vertrouw stemcomputers niet. Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.Thread Previous | Thread Next