develooper Front page | perl.perl5.porters | Postings from March 2007

perl, the data, and the tf8 flag

Thread Previous | Thread Next
From:
Tels
Date:
March 31, 2007 02:45
Subject:
perl, the data, and the tf8 flag
Message ID:
200703311145.30860@bloodgate.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Moin,

On Saturday 31 March 2007 00:20:52 Marc Lehmann wrote:
> On Sat, Mar 31, 2007 at 01:39:06AM +0000, Tels 
<nospam-abuse@bloodgate.com> wrote:
> > My question was posed because I wanted to know how to *keep* a KOI8 (or
> > any other random binary) string in Perl without converting it to
> > Unicode. It seems to me this is not easily possible because there are
> > literally dozend places where your KOI8 string might get suddenly
> > upgraded to UTF-8 (and thus get corrupted because Perl treats it is
> > ISO-8859-1). Or did I get this wrong?
>
> Yes, you did get that wrong, liekly because Juerd wants users to care
> about that. But in fact, if you try it, nothing will get corrupted unless
> you use unpack "C" to get the first byte of your KOI8-string. Then you
> might get surprised (current perl) or an exception (Juerd's idea).

I should have said "random binary data" not "KOI8". "KOI8" implies the data 
is some sort of text that can be "upgraded" to utf-8.

Now, you can *always* treat random binary datas f.i. ISO-8859-1, upgrade it 
to UTF-8 and then downgrade it again, since this is a lossless 
transformation. But that doesn't mean it is a good idea because:

* speed - useless transcodings
* memory (utf-8 needs more memory, and the transcoding, too)
* pack/unpack or any other "peeking" at the data might leak the fact that 
Perl suddenly converted "\xfc" to "\xc3\xbc" underneath (as Marcs bugreport 
showed).

So, yes, if Perl works perfectly in every place, converting you data always 
on the fly whenever you look at it, you could stuff "KOI8" or any other 
random binary data in, have it (maybe) converted to utf-8, and on 
output/looking at converted back to the exact bytes you stuffed in.

However, as you demonstrated yourself, Perl doesn't work perfectly :)

What I was trying to get at is there are different types of data. Before any 
encoding or data examination goes on you have:

** random binary data (see notes above why you do not want this treated as 
ISO-8859-1 and "text"). Basically, you never want Perl to encode/decode it, 
and any attempt in doing so should result in an warning/exception. (utf-8 
flag off)

** ascii 7 bit data (utf-8 flag off)

** 8bit data with an encoding (assumed is ISO-8859-1, but user can specify 
other types of encoding during a call to "decode") (utf-8 flag off)

** utf-8 data (utf-8 flag on)

As you can see, there are four different types of data, but Perl has only 
one bit flag to distiguish them. 

So whenever you have data without the utf-8 flag, Perl needs to decide 
between the three cases mentioned above. And since it cannot store the 
decision of "already seen 7bit ASCII", it needs to do this again sometime 
later.

This is costly (scanning for hight bit characters to distiguish between 7bit 
ascii and 8bit "something else"), and it overly simple, because Perl cannot 
distiguish between "text data in ISO-8859-1 or whatever encoding is in 
effect" and "binary data which shouldn't be treated as text".

As an author who inherited software that deals with random binary data (e.g. 
JPEGs), this deficency concerns me.

Unfortunately, I am in no position to do anything about it except bitch on 
some random mailing list :( Wheere is a time-machine whenever you need one?

[snip]

> > As you said, the current warnings::encode can't decide between the case
> > of "BINARY + UTF_8" and "ISO-8859-1 + UTF_8" as Perl makes no
> > distinction between binary data and ISO-8859-1. And this missing
> > distinction is certainly a bother :)
>
> Only when you hit bugs, or unpack.

<sarcasm> and you never hit bugs, or use unpack </sarcasm> :)

All the best,

Tels

- -- 
 Signed on Sat Mar 31 11:28:57 2007 with key 0x93B84C15.
 View my photo gallery: http://bloodgate.com/photos
 PGP key on http://bloodgate.com/tels.asc or per email.

 "Duke Nukem Forever will come out before Unreal 2."

  -- George Broussard, 2001 (http://tinyurl.com/6m8nh)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg5J2ncLPEOTuEwVAQJTzwf/TH9JUUnoTOq8+sRpROPhb17oWRjLmNs4
+S+vuSldaCk0qxG6LB8NvoJW8BEX7ldz+4zTaEn0/WKi3e+v9YmWFMqblqnRLm5H
lEH7FbVCY+TAINJfVj24JJNaBtZc6ptqqYNzStuVD0T2aNutv5vIVgTdKtkgdYHM
gLuG53iqN70zqwOSnn/Acq91zC56/LvEkGRZzdBwwj+qWbC7UXLJhRtc3ZuCCI9m
DblbMiKoGzorDF7dQVeguBnyohvdCEvKqMPOvs6Wp/ZVReN/DDXhlsGh7kJ3Pjl2
9C9Nmds9KuFkmvsleXZEy5KPmGIKyJVX33llQKPj9woe0g2Iyjeaeg==
=4lLh
-----END PGP SIGNATURE-----

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About