develooper Front page | perl.perl5.porters | Postings from April 2007

Re: perl, the data, and the tf8 flag

Thread Previous | Thread Next
April 2, 2007 04:12
Re: perl, the data, and the tf8 flag
Message ID:
Hash: SHA1


On Sunday 01 April 2007 23:30:20 Marc Lehmann wrote:
> On Sat, Mar 31, 2007 at 11:45:20AM +0000, Tels 
<> wrote:
> > I should have said "random binary data" not "KOI8". "KOI8" implies the
> > data is some sort of text that can be "upgraded" to utf-8.
> Well, KOI8 is a good example because you want to handle that, too, and it
> needs diferent treatment than binary data.
> > Now, you can *always* treat random binary datas f.i. ISO-8859-1,
> > upgrade it to UTF-8 and then downgrade it again, since this is a
> > lossless transformation. But that doesn't mean it is a good idea
> > because:
> Sure. But do you care that much about your scalars being in integer or in
> string form (much slower for arithmetics)?
> No, because you trust perl to do its best in avoiding conversions. You do
> not even have a way of knowing whats in your scalar (number of string)
> from perl, and thats the right thing.
> So you should trust perl on getting it right in enough cases, and perl
> should try hard to avoid unnecessary upgrades/downgrades (of course,
> you cna always watch out for that so important optimisations are being
> implemented, after all, that big character stuff is quite new).

I agree that the "do not convert data needlessly" is a second issue. It just 
plays into this because *if* the data gets "upgraded", it also affects 
unpack with C.

Plus, for large data it is *very* slow. Not only the conversion, but all 
subsequent accesses to the data (owing to the fact that utf8 has 
potentially more than one byte per character).

Converting a 7 bit ASCII string to UTF-8 is just wastefull.

> > * pack/unpack or any other "peeking" at the data might leak the fact
> > that Perl suddenly converted "\xfc" to "\xc3\xbc" underneath (as Marcs
> > bugreport showed).
> And eventually it will be fixed, I am sure.

I hope so :)

> > So, yes, if Perl works perfectly in every place, converting you data
> > always on the fly whenever you look at it, you could stuff "KOI8" or
> > any other random binary data in, have it (maybe) converted to utf-8,
> > and on output/looking at converted back to the exact bytes you stuffed
> > in.
> >
> > However, as you demonstrated yourself, Perl doesn't work perfectly :)
> Yeah. It never does and never will, but it should work well enough that
> people git bugs rarely enough (I hit bugs with integer/string conversion
> in my life, but that doesn't mean I worry about it when using perl :).

Actually, in a related topic I do care very much about conversion, namely 
BigInt vs integer. Her it is also *very* wastefull to convert "1" to 

I agree that if you only write short scripts that deal with little data, 
this shouldn't concern you. However, if you care about making Perl more 
efficint so it can handle larger data without choking itself to death, then 
you do worry. Like me :-)

> > ** random binary data (see notes above why you do not want this treated
> > as ISO-8859-1 and "text"). Basically, you never want Perl to
> > encode/decode it, and any attempt in doing so should result in an
> > warning/exception. (utf-8 flag off)
> Perl never encodes/decodes it without you knowing it (by calling a
> function to do it).

Only if I control all the data all the time. Unfortunately, I don't :)
All these can happen in random remote code places:

	# example 1
	use utf8;
	$binary_7bit_data = 'a';	# this is probably upgraded

	# example 2
	$binary_7bit_data = 'A';
	$binary_data_2 = decode('ISO-8859-1', $binary_7bit_data);

Both are now ugraded. Or might be not. Depending on cleverness of Perl.

> In fact, even for KOI8-R, for speed, some people might want their regexes
> to work in KOI8-space, not unicode-space.

Yes, this is another issue, that Perl can natively only handle basically 
ISO-8859-1 and Unicode. However, one thing at a time :)

> > ** 8bit data with an encoding (assumed is ISO-8859-1, but user can
> > specify other types of encoding during a call to "decode") (utf-8 flag
> > off)
> >
> > ** utf-8 data (utf-8 flag on)
> >
> > As you can see, there are four different types of data, but Perl has
> > only one bit flag to distiguish them.
> There are many more types of data perl cannot differentiate. It is, in
> general, not productive to talk about types with perl, as perl has no
> types whatsoever for scalars, in the very language.
> typelessness is the defining aspect of perl (and many scirpting languages
> in fact).

However, Perl is able to differentiate between subtypes of data, f.i. 
integer and float. And this distinction is *very* important. Not much to 
the user, but to the Perl internally. (Otherwise we could just put 
everything into a quad-float and call it a day, likewise stuff every string 
into utf8 and go home :)

> > So whenever you have data without the utf-8 flag, Perl needs to decide
> > between the three cases mentioned above. And since it cannot store the
> > decision of "already seen 7bit ASCII", it needs to do this again
> > sometime later.
> No, perl never needs to decide. The programmer always has to tell it
> explicitly.

(The following problem is strictly a performance (memory and CPU) problem, 
it should work "transparently" for the programmer - except the visible 
speed issues:)

But if you have a string like "hello world", and compare it to an UTF-8 
string, then Perl *needs* to decide wether "hello world" is only 7 bit 
data, or not. If it already new it was only 7-bit, then it could skip the 
conversion. Since it can't, Perl "upgrades" the string temp. And does so 
again and again and again. This even happens if you compare "hello world" 
to "hello perl" (with the UTF-8 flag on).

There are three solutions to this problem:

	* dont upgrade "hello world" to utf8 if 7bit
	* always upgrade it
	* have another flag to distinguish this data from other data

The "solutions" have different problems in their own:

	* if you don't upgrade it, the data needs to examined again, later
	* if you ugprade it, other data that comes in contact with it needs to be
	* a new flag is well, hard

The second solution (as it is currently implemented by my understanding) 
means that a very single "A" with the utf8-flag set might cause all your 
strings to get upgraded, eventually, in a chain like reaction. F.i.:

	# or any other way to set the utf8 flag
	$a = decode('ISO-8859-1','a');

	$b = "$a";			# upgraded, too
	$c = 'hello world' . $a;	# too

	if ($d eq $a)			# $d temp. upgraded

	print "hello $a";		# upgrades, then probably downgrades
					# for output

and so on. Now imagine that $b, $c, and $d are long strings, and you can see 
that Perl will spend a considerable time in upgrading and downgrading 
strings, up to the point where all the conversions take more time than the 

> > As an author who inherited software that deals with random binary data
> > (e.g. JPEGs), this deficency concerns me.
> It shouldn't, unless you have evidence that you encounter a problem.

Well, yeah, the not-yet-fixed C issue comes to mind :-)

> > > Only when you hit bugs, or unpack.
> > <sarcasm> and you never hit bugs, or use unpack </sarcasm> :)
> Well, bugs will be fixed eventually, and bugs are a common phenomena,
> perl is buggy as hell, but it works quite fine. Just as gcc is buggy as
> hell, but still is the basis for quite a lot of useful software, and
> nobody worries much about gcc bugs.
> :)

Yeah, but some people have to or the bugs never get fixed :)

All the best,


- -- 
 Signed on Mon Apr  2 12:51:16 2007 with key 0x93B84C15.
 Get one of my photo posters:
 PGP key on or per email.

 How to avoid sex in space: "Just send a married couple, two gays, two
 lesbians, the Pope and Darl McBride on the mission. Since no one loves
 Darl, and the Pope loves everyone but does not have sex, relationships
 are stable." RedLaggedTeut (216304) on /. on 2005-10-22 

Version: GnuPG v1.4.2 (GNU/Linux)


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About