Front page | perl.perl5.porters |
Postings from March 2007
Re: perl, the data, and the tf8 flag
Thread Previous
|
Thread Next
From:
Tels
Date:
March 31, 2007 03:40
Subject:
Re: perl, the data, and the tf8 flag
Message ID:
200703311240.32014@bloodgate.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Moin,
On Saturday 31 March 2007 10:03:12 Juerd Waalboer wrote:
> Tels skribis 2007-03-31 11:45 (+0000):
> > I should have said "random binary data" not "KOI8". "KOI8" implies the
> > data is some sort of text that can be "upgraded" to utf-8.
>
> Not if "upgrading" refers to the process that Perl has when it goes from
> latin1 to utf-8, because this doesn't handle arbitrary encodings like
> koi8r.
>
> Your best bet is to treat koi8r encoded data as binary data. (And as
> such, not mix it with text data.)
The "do not mix it" is the part where I am currently having problems with.
As far as I can see, there is nothing in Perl that prevents this from
happening, nor can I enable a warning when it happens. All you get is at
some point corrupted data, or very inefficient code (since Perl internally
uses UTF-8 while it could use just the raw bytes).
> If this is confusing, you may want to
> gzip it first, and ungzip it afterwards ;)
It is not confusing to me, but gzip wouldn't actually help when Perl
helpfully upgrades the gzippd data to utf-8 :)
> > Now, you can *always* treat random binary datas f.i. ISO-8859-1,
> > upgrade it to UTF-8 and then downgrade it again, since this is a
> > lossless transformation. But that doesn't mean it is a good idea
>
> Exactly.
>
> > * speed - useless transcodings
> > * memory (utf-8 needs more memory, and the transcoding, too)
> > * pack/unpack or any other "peeking" at the data might leak the fact
> > that Perl suddenly converted "\xfc" to "\xc3\xbc" underneath
>
> Good summary. Also, if you output it to an encodingless filehandle
> before downgrading it again, the value may contain characters greater
> than 127, and you'll get output that you probably did not intend.
>
> > ** random binary data (see notes above why you do not want this treated
> > as ISO-8859-1 and "text"). Basically, you never want Perl to
> > encode/decode it, and any attempt in doing so should result in an
> > warning/exception. (utf-8 flag off)
>
> Yep.
>
> > ** ascii 7 bit data (utf-8 flag off)
>
> The UTF8 flag can also off for 8 bit data. For ASCII data it will
> typically be off, but it wouldn't matter if it were on. (That is, if you
> treat ASCII data like text. You don't want to treat UTF8 carrying data
> as binary, though, because you will want to mix binary data with other
> binary data, without having it upgraded.)
>
> > As you can see, there are four different types of data, but Perl has
> > only one bit flag to distiguish them.
>
> I'd say it has two types of data, and indeed that one bit.
>
> With the bit on, it's unicode data that internally is encoded as UTF-8.
> You're not supposed to access the UTF-8 encoded octet buffer. This
> string should never be used with octet operations like vec or unpack "C"
> or "n".
I know what you mean, but the problem is that you are also proposing that
the UTF-8 flag should be hidden from the user. So, how can I "not access
the UTF-8 encoded" buffer when I don't know if the buffer I access is UTF-8
or not?
I think this is also the problem Marc is having with your POV. You can't
hide the internal encoding from the user, then telling him "do not mix
these two different things even tho you do not know which one is which".
That's a bit, er, unrealistic.
> With the bit off, it's either unicode data that internally is encoded as
> ISO-8859-1, or it is binary data. This string can safely be used for
> octet operations (but of course, that doesn't make sense if the sting
> was intended as text, with the exception of some ancient 8bit things
> crypt()).
>
> > So whenever you have data without the utf-8 flag, Perl needs to decide
> > between the three cases mentioned above.
>
> It doesn't do that. Every UTF8less string is treated the same.
And that is in efficient :)
> > This is costly (scanning for hight bit characters to distiguish between
> > 7bit ascii and 8bit "something else")
>
> I'm not aware of Perl scanning for high bit characters in UTF8less
> strings, or any performance loss caused by that.
use Benchmark;
use Encode qw/decode/;
my $a = 'a' x 100_000_000; # 7bit utf-8 off
my $b = 'b' x 100_000_000; # 7bit utf-8 off
my $c = 'c' x 100_000_000; # 7bit utf-8 flag on
$c = decode('ISO-8859-1', $c);
timethese (-3, {
'a eq b' => sub { $a eq $b; },
'a eq c' => sub { $a eq $c; },
} );
Benchmark: running a eq b, a eq c for at least 3 CPU seconds...
a eq b: 4s (4.72 usr + -0.02 sys = 4.70 CPU) @7218655.96/s (n=33927683)
a eq c: 3s (2.80 usr + 0.46 sys = 3.26 CPU) @ 2.76/s (n=9)
I rest my case. :)
All the best,
Tels
- --
Signed on Sat Mar 31 12:28:15 2007 with key 0x93B84C15.
View my photo gallery: http://bloodgate.com/photos
PGP key on http://bloodgate.com/tels.asc or per email.
"Blogebrity: Wow, guess what this one stands for? Too easy. Hey, anyone
can do it: take a blogger who's a chef, and you get: BLEF. A blogger
who's a dentist? BENTIST. A female blogger with an itch? You guessed it:
a BITCH."
-- maddox from xmission
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
iQEVAwUBRg5Wv3cLPEOTuEwVAQKVMQf9G1RLUfo+fY+H8dn4Qa+ggbL/IRnOz3wi
sR4KAw32xrCvHPZYkQRPm1xVJiDwpMDgEgdVSEo6Ot9qA3TLXGadF4F9PMzPQRWM
4509df7yoEulvKsKNiqHFJSbxO8KlVaX4CO8Zr/8aCnM4IIajBuISRQUtLARRl/d
VQacgTOJwHCkaRqB8T+9kdP3U9OV72xXoYDHRXRbJOiav7QVGmmVib5M2ZQWj5zv
H8r1daSG7mFg3qCUE/KKYLAC2hmMMvC31zhMzWveAxlFE5hWg+EyYFzxbPk9sisT
69seb4XaXXrpM/jn7C3Gq2GKeEggeRDrAhw3DvlPrO0r1VZYvmFDwQ==
=YMp0
-----END PGP SIGNATURE-----
Thread Previous
|
Thread Next