Front page | perl.perl5.porters |
Postings from March 2007
perl, the data, and the tf8 flag
March 31, 2007 02:45
perl, the data, and the tf8 flag
Message ID: firstname.lastname@example.org
-----BEGIN PGP SIGNED MESSAGE-----
On Saturday 31 March 2007 00:20:52 Marc Lehmann wrote:
> On Sat, Mar 31, 2007 at 01:39:06AM +0000, Tels
> > My question was posed because I wanted to know how to *keep* a KOI8 (or
> > any other random binary) string in Perl without converting it to
> > Unicode. It seems to me this is not easily possible because there are
> > literally dozend places where your KOI8 string might get suddenly
> > upgraded to UTF-8 (and thus get corrupted because Perl treats it is
> > ISO-8859-1). Or did I get this wrong?
> Yes, you did get that wrong, liekly because Juerd wants users to care
> about that. But in fact, if you try it, nothing will get corrupted unless
> you use unpack "C" to get the first byte of your KOI8-string. Then you
> might get surprised (current perl) or an exception (Juerd's idea).
I should have said "random binary data" not "KOI8". "KOI8" implies the data
is some sort of text that can be "upgraded" to utf-8.
Now, you can *always* treat random binary datas f.i. ISO-8859-1, upgrade it
to UTF-8 and then downgrade it again, since this is a lossless
transformation. But that doesn't mean it is a good idea because:
* speed - useless transcodings
* memory (utf-8 needs more memory, and the transcoding, too)
* pack/unpack or any other "peeking" at the data might leak the fact that
Perl suddenly converted "\xfc" to "\xc3\xbc" underneath (as Marcs bugreport
So, yes, if Perl works perfectly in every place, converting you data always
on the fly whenever you look at it, you could stuff "KOI8" or any other
random binary data in, have it (maybe) converted to utf-8, and on
output/looking at converted back to the exact bytes you stuffed in.
However, as you demonstrated yourself, Perl doesn't work perfectly :)
What I was trying to get at is there are different types of data. Before any
encoding or data examination goes on you have:
** random binary data (see notes above why you do not want this treated as
ISO-8859-1 and "text"). Basically, you never want Perl to encode/decode it,
and any attempt in doing so should result in an warning/exception. (utf-8
** ascii 7 bit data (utf-8 flag off)
** 8bit data with an encoding (assumed is ISO-8859-1, but user can specify
other types of encoding during a call to "decode") (utf-8 flag off)
** utf-8 data (utf-8 flag on)
As you can see, there are four different types of data, but Perl has only
one bit flag to distiguish them.
So whenever you have data without the utf-8 flag, Perl needs to decide
between the three cases mentioned above. And since it cannot store the
decision of "already seen 7bit ASCII", it needs to do this again sometime
This is costly (scanning for hight bit characters to distiguish between 7bit
ascii and 8bit "something else"), and it overly simple, because Perl cannot
distiguish between "text data in ISO-8859-1 or whatever encoding is in
effect" and "binary data which shouldn't be treated as text".
As an author who inherited software that deals with random binary data (e.g.
JPEGs), this deficency concerns me.
Unfortunately, I am in no position to do anything about it except bitch on
some random mailing list :( Wheere is a time-machine whenever you need one?
> > As you said, the current warnings::encode can't decide between the case
> > of "BINARY + UTF_8" and "ISO-8859-1 + UTF_8" as Perl makes no
> > distinction between binary data and ISO-8859-1. And this missing
> > distinction is certainly a bother :)
> Only when you hit bugs, or unpack.
<sarcasm> and you never hit bugs, or use unpack </sarcasm> :)
All the best,
Signed on Sat Mar 31 11:28:57 2007 with key 0x93B84C15.
View my photo gallery: http://bloodgate.com/photos
PGP key on http://bloodgate.com/tels.asc or per email.
"Duke Nukem Forever will come out before Unreal 2."
-- George Broussard, 2001 (http://tinyurl.com/6m8nh)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
-----END PGP SIGNATURE-----