develooper Front page | perl.perl5.porters | Postings from March 2007

Re: perl, the data, and the tf8 flag

Thread Previous | Thread Next
Juerd Waalboer
March 31, 2007 03:03
Re: perl, the data, and the tf8 flag
Message ID:
Tels skribis 2007-03-31 11:45 (+0000):
> I should have said "random binary data" not "KOI8". "KOI8" implies the data 
> is some sort of text that can be "upgraded" to utf-8.

Not if "upgrading" refers to the process that Perl has when it goes from
latin1 to utf-8, because this doesn't handle arbitrary encodings like

Your best bet is to treat koi8r encoded data as binary data. (And as
such, not mix it with text data.) If this is confusing, you may want to
gzip it first, and ungzip it afterwards ;)

> Now, you can *always* treat random binary datas f.i. ISO-8859-1,
> upgrade it to UTF-8 and then downgrade it again, since this is a
> lossless transformation. But that doesn't mean it is a good idea


> * speed - useless transcodings
> * memory (utf-8 needs more memory, and the transcoding, too)
> * pack/unpack or any other "peeking" at the data might leak the fact that 
> Perl suddenly converted "\xfc" to "\xc3\xbc" underneath 

Good summary. Also, if you output it to an encodingless filehandle
before downgrading it again, the value may contain characters greater
than 127, and you'll get output that you probably did not intend.

> ** random binary data (see notes above why you do not want this treated as 
> ISO-8859-1 and "text"). Basically, you never want Perl to encode/decode it, 
> and any attempt in doing so should result in an warning/exception. (utf-8 
> flag off)


> ** ascii 7 bit data (utf-8 flag off)

The UTF8 flag can also off for 8 bit data. For ASCII data it will
typically be off, but it wouldn't matter if it were on. (That is, if you
treat ASCII data like text. You don't want to treat UTF8 carrying data
as binary, though, because you will want to mix binary data with other
binary data, without having it upgraded.)

> As you can see, there are four different types of data, but Perl has only 
> one bit flag to distiguish them. 

I'd say it has two types of data, and indeed that one bit.

With the bit on, it's unicode data that internally is encoded as UTF-8.
You're not supposed to access the UTF-8 encoded octet buffer. This
string should never be used with octet operations like vec or unpack "C"
or "n".

With the bit off, it's either unicode data that internally is encoded as
ISO-8859-1, or it is binary data. This string can safely be used for
octet operations (but of course, that doesn't make sense if the sting
was intended as text, with the exception of some ancient 8bit things

> So whenever you have data without the utf-8 flag, Perl needs to decide 
> between the three cases mentioned above. 

It doesn't do that. Every UTF8less string is treated the same.

> This is costly (scanning for hight bit characters to distiguish between 7bit 
> ascii and 8bit "something else")

I'm not aware of Perl scanning for high bit characters in UTF8less
strings, or any performance loss caused by that.

> As an author who inherited software that deals with random binary data (e.g. 
> JPEGs), this deficency concerns me.

I'm not aware of such a deficiency, and my Perl handles JPEG data just
fine as long as I don't let it touch unicode text data.
korajn salutojn,

  juerd waalboer:  perl hacker  <>  <>
  convolution:     ict solutions and consultancy <>

Ik vertrouw stemcomputers niet.
Zie <>.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About