develooper Front page | perl.perl5.porters | Postings from March 2007

Re: perl, the data, and the tf8 flag

Thread Previous | Thread Next
March 31, 2007 03:40
Re: perl, the data, and the tf8 flag
Message ID:
Hash: SHA1


On Saturday 31 March 2007 10:03:12 Juerd Waalboer wrote:
> Tels skribis 2007-03-31 11:45 (+0000):
> > I should have said "random binary data" not "KOI8". "KOI8" implies the
> > data is some sort of text that can be "upgraded" to utf-8.
> Not if "upgrading" refers to the process that Perl has when it goes from
> latin1 to utf-8, because this doesn't handle arbitrary encodings like
> koi8r.
> Your best bet is to treat koi8r encoded data as binary data. (And as
> such, not mix it with text data.) 

The "do not mix it" is the part where I am currently having problems with. 
As far as I can see, there is nothing in Perl that prevents this from 
happening, nor can I enable a warning when it happens. All you get is at 
some point corrupted data, or very inefficient code (since Perl internally 
uses UTF-8 while it could use just the raw bytes).

> If this is confusing, you may want to 
> gzip it first, and ungzip it afterwards ;)

It is not confusing to me, but gzip wouldn't actually help when Perl 
helpfully upgrades the gzippd data to utf-8 :)

> > Now, you can *always* treat random binary datas f.i. ISO-8859-1,
> > upgrade it to UTF-8 and then downgrade it again, since this is a
> > lossless transformation. But that doesn't mean it is a good idea
> Exactly.
> > * speed - useless transcodings
> > * memory (utf-8 needs more memory, and the transcoding, too)
> > * pack/unpack or any other "peeking" at the data might leak the fact
> > that Perl suddenly converted "\xfc" to "\xc3\xbc" underneath
> Good summary. Also, if you output it to an encodingless filehandle
> before downgrading it again, the value may contain characters greater
> than 127, and you'll get output that you probably did not intend.
> > ** random binary data (see notes above why you do not want this treated
> > as ISO-8859-1 and "text"). Basically, you never want Perl to
> > encode/decode it, and any attempt in doing so should result in an
> > warning/exception. (utf-8 flag off)
> Yep.
> > ** ascii 7 bit data (utf-8 flag off)
> The UTF8 flag can also off for 8 bit data. For ASCII data it will
> typically be off, but it wouldn't matter if it were on. (That is, if you
> treat ASCII data like text. You don't want to treat UTF8 carrying data
> as binary, though, because you will want to mix binary data with other
> binary data, without having it upgraded.)
> > As you can see, there are four different types of data, but Perl has
> > only one bit flag to distiguish them.
> I'd say it has two types of data, and indeed that one bit.
> With the bit on, it's unicode data that internally is encoded as UTF-8.
> You're not supposed to access the UTF-8 encoded octet buffer. This
> string should never be used with octet operations like vec or unpack "C"
> or "n".

I know what you mean, but the problem is that you are also proposing that 
the UTF-8 flag should be hidden from the user. So, how can I "not access 
the UTF-8 encoded" buffer when I don't know if the buffer I access is UTF-8 
or not?

I think this is also the problem Marc is having with your POV. You can't 
hide the internal encoding from the user, then telling him "do not mix 
these two different things even tho you do not know which one is which". 
That's a bit, er, unrealistic.

> With the bit off, it's either unicode data that internally is encoded as
> ISO-8859-1, or it is binary data. This string can safely be used for
> octet operations (but of course, that doesn't make sense if the sting
> was intended as text, with the exception of some ancient 8bit things
> crypt()).
> > So whenever you have data without the utf-8 flag, Perl needs to decide
> > between the three cases mentioned above.
> It doesn't do that. Every UTF8less string is treated the same.

And that is in efficient :)

> > This is costly (scanning for hight bit characters to distiguish between
> > 7bit ascii and 8bit "something else")
> I'm not aware of Perl scanning for high bit characters in UTF8less
> strings, or any performance loss caused by that.

	use Benchmark;
	use Encode qw/decode/;
	my $a = 'a' x 100_000_000;      # 7bit utf-8 off
	my $b = 'b' x 100_000_000;      # 7bit utf-8 off
	my $c = 'c' x 100_000_000;      # 7bit utf-8 flag on
	$c = decode('ISO-8859-1', $c);
	timethese (-3, {
	  'a eq b' => sub { $a eq $b; },
	  'a eq c' => sub { $a eq $c; },
	  } );

Benchmark: running a eq b, a eq c for at least 3 CPU seconds...
   a eq b: 4s (4.72 usr + -0.02 sys = 4.70 CPU) @7218655.96/s (n=33927683)
   a eq c: 3s (2.80 usr +  0.46 sys = 3.26 CPU) @ 2.76/s (n=9)

I rest my case. :)

All the best,


- -- 
 Signed on Sat Mar 31 12:28:15 2007 with key 0x93B84C15.
 View my photo gallery:
 PGP key on or per email.

 "Blogebrity: Wow, guess what this one stands for? Too easy. Hey, anyone
 can do it: take a blogger who's a chef, and you get: BLEF. A blogger
 who's a dentist? BENTIST. A female blogger with an itch? You guessed it:
 a BITCH."

  -- maddox from xmission
Version: GnuPG v1.4.2 (GNU/Linux)


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About