Front page | perl.perl5.porters |
Postings from April 2007
Re: perl, the data, and the tf8 flag
April 2, 2007 04:12
Re: perl, the data, and the tf8 flag
Message ID: email@example.com
-----BEGIN PGP SIGNED MESSAGE-----
On Sunday 01 April 2007 23:30:20 Marc Lehmann wrote:
> On Sat, Mar 31, 2007 at 11:45:20AM +0000, Tels
> > I should have said "random binary data" not "KOI8". "KOI8" implies the
> > data is some sort of text that can be "upgraded" to utf-8.
> Well, KOI8 is a good example because you want to handle that, too, and it
> needs diferent treatment than binary data.
> > Now, you can *always* treat random binary datas f.i. ISO-8859-1,
> > upgrade it to UTF-8 and then downgrade it again, since this is a
> > lossless transformation. But that doesn't mean it is a good idea
> > because:
> Sure. But do you care that much about your scalars being in integer or in
> string form (much slower for arithmetics)?
> No, because you trust perl to do its best in avoiding conversions. You do
> not even have a way of knowing whats in your scalar (number of string)
> from perl, and thats the right thing.
> So you should trust perl on getting it right in enough cases, and perl
> should try hard to avoid unnecessary upgrades/downgrades (of course,
> you cna always watch out for that so important optimisations are being
> implemented, after all, that big character stuff is quite new).
I agree that the "do not convert data needlessly" is a second issue. It just
plays into this because *if* the data gets "upgraded", it also affects
unpack with C.
Plus, for large data it is *very* slow. Not only the conversion, but all
subsequent accesses to the data (owing to the fact that utf8 has
potentially more than one byte per character).
Converting a 7 bit ASCII string to UTF-8 is just wastefull.
> > * pack/unpack or any other "peeking" at the data might leak the fact
> > that Perl suddenly converted "\xfc" to "\xc3\xbc" underneath (as Marcs
> > bugreport showed).
> And eventually it will be fixed, I am sure.
I hope so :)
> > So, yes, if Perl works perfectly in every place, converting you data
> > always on the fly whenever you look at it, you could stuff "KOI8" or
> > any other random binary data in, have it (maybe) converted to utf-8,
> > and on output/looking at converted back to the exact bytes you stuffed
> > in.
> > However, as you demonstrated yourself, Perl doesn't work perfectly :)
> Yeah. It never does and never will, but it should work well enough that
> people git bugs rarely enough (I hit bugs with integer/string conversion
> in my life, but that doesn't mean I worry about it when using perl :).
Actually, in a related topic I do care very much about conversion, namely
BigInt vs integer. Her it is also *very* wastefull to convert "1" to
I agree that if you only write short scripts that deal with little data,
this shouldn't concern you. However, if you care about making Perl more
efficint so it can handle larger data without choking itself to death, then
you do worry. Like me :-)
> > ** random binary data (see notes above why you do not want this treated
> > as ISO-8859-1 and "text"). Basically, you never want Perl to
> > encode/decode it, and any attempt in doing so should result in an
> > warning/exception. (utf-8 flag off)
> Perl never encodes/decodes it without you knowing it (by calling a
> function to do it).
Only if I control all the data all the time. Unfortunately, I don't :)
All these can happen in random remote code places:
# example 1
$binary_7bit_data = 'a'; # this is probably upgraded
# example 2
$binary_7bit_data = 'A';
$binary_data_2 = decode('ISO-8859-1', $binary_7bit_data);
Both are now ugraded. Or might be not. Depending on cleverness of Perl.
> In fact, even for KOI8-R, for speed, some people might want their regexes
> to work in KOI8-space, not unicode-space.
Yes, this is another issue, that Perl can natively only handle basically
ISO-8859-1 and Unicode. However, one thing at a time :)
> > ** 8bit data with an encoding (assumed is ISO-8859-1, but user can
> > specify other types of encoding during a call to "decode") (utf-8 flag
> > off)
> > ** utf-8 data (utf-8 flag on)
> > As you can see, there are four different types of data, but Perl has
> > only one bit flag to distiguish them.
> There are many more types of data perl cannot differentiate. It is, in
> general, not productive to talk about types with perl, as perl has no
> types whatsoever for scalars, in the very language.
> typelessness is the defining aspect of perl (and many scirpting languages
> in fact).
However, Perl is able to differentiate between subtypes of data, f.i.
integer and float. And this distinction is *very* important. Not much to
the user, but to the Perl internally. (Otherwise we could just put
everything into a quad-float and call it a day, likewise stuff every string
into utf8 and go home :)
> > So whenever you have data without the utf-8 flag, Perl needs to decide
> > between the three cases mentioned above. And since it cannot store the
> > decision of "already seen 7bit ASCII", it needs to do this again
> > sometime later.
> No, perl never needs to decide. The programmer always has to tell it
(The following problem is strictly a performance (memory and CPU) problem,
it should work "transparently" for the programmer - except the visible
But if you have a string like "hello world", and compare it to an UTF-8
string, then Perl *needs* to decide wether "hello world" is only 7 bit
data, or not. If it already new it was only 7-bit, then it could skip the
conversion. Since it can't, Perl "upgrades" the string temp. And does so
again and again and again. This even happens if you compare "hello world"
to "hello perl" (with the UTF-8 flag on).
There are three solutions to this problem:
* dont upgrade "hello world" to utf8 if 7bit
* always upgrade it
* have another flag to distinguish this data from other data
The "solutions" have different problems in their own:
* if you don't upgrade it, the data needs to examined again, later
* if you ugprade it, other data that comes in contact with it needs to be
* a new flag is well, hard
The second solution (as it is currently implemented by my understanding)
means that a very single "A" with the utf8-flag set might cause all your
strings to get upgraded, eventually, in a chain like reaction. F.i.:
# or any other way to set the utf8 flag
$a = decode('ISO-8859-1','a');
$b = "$a"; # upgraded, too
$c = 'hello world' . $a; # too
if ($d eq $a) # $d temp. upgraded
print "hello $a"; # upgrades, then probably downgrades
# for output
and so on. Now imagine that $b, $c, and $d are long strings, and you can see
that Perl will spend a considerable time in upgrading and downgrading
strings, up to the point where all the conversions take more time than the
> > As an author who inherited software that deals with random binary data
> > (e.g. JPEGs), this deficency concerns me.
> It shouldn't, unless you have evidence that you encounter a problem.
Well, yeah, the not-yet-fixed C issue comes to mind :-)
> > > Only when you hit bugs, or unpack.
> > <sarcasm> and you never hit bugs, or use unpack </sarcasm> :)
> Well, bugs will be fixed eventually, and bugs are a common phenomena,
> perl is buggy as hell, but it works quite fine. Just as gcc is buggy as
> hell, but still is the basis for quite a lot of useful software, and
> nobody worries much about gcc bugs.
Yeah, but some people have to or the bugs never get fixed :)
All the best,
Signed on Mon Apr 2 12:51:16 2007 with key 0x93B84C15.
Get one of my photo posters: http://bloodgate.com/posters
PGP key on http://bloodgate.com/tels.asc or per email.
How to avoid sex in space: "Just send a married couple, two gays, two
lesbians, the Pope and Darl McBride on the mission. Since no one loves
Darl, and the Pope loves everyone but does not have sex, relationships
are stable." RedLaggedTeut (216304) on /. on 2005-10-22
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
-----END PGP SIGNATURE-----