Front page | perl.perl5.porters |
Postings from April 2007
Re: perl, the data, and the tf8 flag
From: Marc Lehmann
April 2, 2007 09:30
Re: perl, the data, and the tf8 flag
Message ID: 20070402162953.GC1403@schmorp.de
On Mon, Apr 02, 2007 at 01:11:20PM +0000, Tels <email@example.com> wrote:
> > you cna always watch out for that so important optimisations are being
> > implemented, after all, that big character stuff is quite new).
> I agree that the "do not convert data needlessly" is a second issue. It just
> plays into this because *if* the data gets "upgraded", it also affects
> unpack with C.
> Plus, for large data it is *very* slow. Not only the conversion, but all
> subsequent accesses to the data (owing to the fact that utf8 has
> potentially more than one byte per character).
Nah, it is not *very* slow. Really. And it rarely happens, perl really does
just the right thing in most circumstances: Chances are pretty low it will
upgrade against your assumptions. Which is also why unpack "C" breakage is
rarely a practical concern, because most binary data still stays binary data.
It is a large concern when it is exposed in a module interface, because
users are bound to pass into more upgraded data into such modules than
they do now, and expect that to work.
The problem here is that the module user is not the module author, so he/she
can only assume what kind of scalar they get out of a module or put into a
I very often happen to have binary data that is upgraded, and I also know
what to do when. Other users might not have upgraded data I the first
place, and still others might not know they have it and hit spurious
And the premise is that nobody should need to know, perl must just do the
right thing(tm), and should do so efficiently.
> Converting a 7 bit ASCII string to UTF-8 is just wastefull.
Actually, not really. It can be made a simple memcpy with no change in
interpretation. But of course 8-bit data wastes some 50% or so on memory
(and somethign similar in cpu).
> > Yeah. It never does and never will, but it should work well enough that
> > people git bugs rarely enough (I hit bugs with integer/string conversion
> > in my life, but that doesn't mean I worry about it when using perl :).
> Actually, in a related topic I do care very much about conversion, namely
> BigInt vs integer. Her it is also *very* wastefull to convert "1" to
Yes, thats a very good example. The same applies: there are good reasons
for wanting that conversion done transparently (premise: Math::BigInt has
semantics identical to built-in scalars), and good reasons for keeping it
a normal scalar as long as possible.
Everybody likely agrees that if Math::BigInt doesn't work like an integer
in perl (with more bits) then Math::BigInt is buggy (even if that bug is
in fact a limitation in perl not being able to implement that semantics
And most people would agree that it is a good thing if that conversion was
kept to a minimum,f or efficiency reasons.
> I agree that if you only write short scripts that deal with little data,
> this shouldn't concern you. However, if you care about making Perl more
> efficint so it can handle larger data without choking itself to death, then
> you do worry. Like me :-)
The point is you do worry needlessly, as perl is very good at not hitting
that speed problem. Sure you can worry, but if chances are low enough then
you worrying is a waste of your precious time :) (Ok, I do not know wether
your time is precious, but mine is :->)
compare that with the recent substr-eats-memory thread: you probably
used substr a lot in your life, relying on its non-copying semantics for
speed. Did you worry about that? Probably not, but somebody did, and it
is considered a deficiency that should be fixed (wether it can easily and
when is another thing).
Fact is, you do not worry constantly about that kind of problems, and the
same thing applies to upgrades.
> > Perl never encodes/decodes it without you knowing it (by calling a
> > function to do it).
> Only if I control all the data all the time. Unfortunately, I don't :)
> All these can happen in random remote code places:
ah, but I meant *encode* or *decode*, not upgrade or downgrade.
(Of course, perl programmers should not need to know about down/upgrades anyways).
The problem with calling it encoding/decoding (which is almost exactly the
same thing in perl) is that the semantics do not change, so you do not
really encode or decode anything on the perl level.
> # example 1
> use utf8;
> $binary_7bit_data = 'a'; # this is probably upgraded
> # example 2
> $binary_7bit_data = 'A';
> $binary_data_2 = decode('ISO-8859-1', $binary_7bit_data);
> Both are now ugraded. Or might be not. Depending on cleverness of Perl.
Neither are upgraded, and *likely* never will be. No guarentees, but if
you implement it, you will find it likely doesn't make much sense so it is
not done (as long as the choise is UTF-X vs. octets).
Of course, if speed *were* the paramount problem, then one could store the
scalar twice (if needed), once in octet form and once in utf-x form, and
then use whatever is the most efficient form. Just as perl does when you
use a scalar as a string and an integer: perl will remember both values as
long as possible to save a conversion between them.
> > In fact, even for KOI8-R, for speed, some people might want their regexes
> > to work in KOI8-space, not unicode-space.
> Yes, this is another issue, that Perl can natively only handle basically
> ISO-8859-1 and Unicode. However, one thing at a time :)
No. No. No. Everybody says that, but, frankly, I am convinced thats not just
bullshit, but very detrimental.
You could say that perl 5.005 only handles latin1, and it would be wrong as
$koi8r = <STDIN>;
Nothing in perl assumes $koi8r is latin1. Nor koi8-r. Thats up to you. Pelr
only interprets something when you tell it do so:
$koi8r = <STDIN>;
$koi8r =~ /ü/ or die;
Now perl interprets $koi8r as unicode. Even in 5.005, it did so, except it
couldn't handle more than the first 255 unicode characters.
Perl really doesn't support *any* encoding for scalars. Assuming so (and
seperating the world into byte vs. text, 8-bit-text vs. unicode) is, IMHO,
detrimental to using it.
Perl just stores strings as concatenation of character indices. How it
interprets them is the jon of the programmer, and the programmer has to
specify thta explicitly:
read $fh, my $data, 64;
my $unicode = Encode::decode "iso-2022-jp", $data;
Here you *tell* Perl explicitly to interpret your data scalar as iso-2022-jp
and return something.
That something can then later be interpreted as unicode values *iff* and when
you tell Perl to do that, e.g. by matching it against a regex.
In the meantime, even though $unicode very likely contains unicode
characters, Perl does not assume so.
my $ch = chr 2**30;
Here, ch does not contain a unicode (5.0) character. Never ever, because
it is not in range for a unicode codepoint (the highest unicode codepoint
UTF-16 can store and is defined is U+10FFFF and UTF-16 would break down if
that would ever need to be exceeded).
But Perl complies, and does what I assume to be the right thing: it stores
that value in your string. How you interpret it is your job, by telling
Understand that, and that very few parts of perl are buggy (unpack "C"
:), and you will have a lot of less problems worrying about that unicode
string stuff, because it really is only "perls bytes got larger", but old
code using koi8-r or binary data will just work.
> > typelessness is the defining aspect of perl (and many scirpting languages
> > in fact).
> However, Perl is able to differentiate between subtypes of data, f.i.
> integer and float.
Neither can perl, the interpreter, nor can Perl, the language. If so, tell me
$x = 1.0;
$x += 0;
$x will now contain both an integer value and a double. Now tell me how to
decide which one is it, if even perl, the interpreter, cannot do so
(This relies very much on internals and vesions of perl, but perl has no
problems with scalars that are both doubles *and* integers at the same
time, or scalars that are *both* integers, strings *and* doubles at the
The lesson to learn is that Perl really doesn't know. As an optimisation,
it might store integers in IVs, but it might store it in an NV just as
well and it will still be an integer. Earlier perl versions actually were
buggy in thatm but those numerical bugs were (completely or at least
mostly) squashed in an effort by Nicholas, so Perl code has a hard time
knowing what perl really stores.
> And this distinction is *very* important. Not much to
If it is important, then tough game: perl istelf cannot do it, and does not
do it, and sees no need to do it.
> the user, but to the Perl internally. (Otherwise we could just put
> everything into a quad-float and call it a day, likewise stuff every string
> into utf8 and go home :)
Perl might actually do somwething like that when your integers become very
large and your nv's can store them while your iv's cannot.
> > No, perl never needs to decide. The programmer always has to tell it
> > explicitly.
> (The following problem is strictly a performance (memory and CPU) problem,
> it should work "transparently" for the programmer - except the visible
> speed issues:)
> But if you have a string like "hello world", and compare it to an UTF-8
> string, then Perl *needs* to decide wether "hello world" is only 7 bit
> data, or not.
No. From a performance standpoint, detecting wether a string is 7 bit, then
doing an optimised compare is not clearly faster than e.g. converting while
comparing. It might be faster, or slower, and in any case I would not expect
much of a difference.
> The "solutions" have different problems in their own:
> * if you don't upgrade it, the data needs to examined again, later
> * if you ugprade it, other data that comes in contact with it needs to be
> * a new flag is well, hard
Well, as it is very fast to (naively) compare two UTF-X strings for exact
equality (or even codepoint-wise less than etc.), it might be very profitable
to upgrade your data, especially if you do other operations requiring UTF-X.
A flag will not help. ooking into the future or some very keen global
optimisation pass might improve things.
But the assumption that if you do high-character operations to your string
means that you will do a lot of them is a very sane one. It is likely
faster than trying to avoid it at all costs.
> The second solution (as it is currently implemented by my understanding)
> means that a very single "A" with the utf8-flag set might cause all your
> strings to get upgraded, eventually, in a chain like reaction. F.i.:
> $a = decode('ISO-8859-1','a');
> $b = "$a"; # upgraded, too
> $c = 'hello world' . $a; # too
> if ($d eq $a) # $d temp. upgraded
> print "hello $a"; # upgrades, then probably downgrades
> # for output
> and so on. Now imagine that $b, $c, and $d are long strings, and you can see
> that Perl will spend a considerable time in upgrading and downgrading
> strings, up to the point where all the conversions take more time than the
Except that perl will likely not downgrade in the above example, and
there will likely be much fewer upgrades than you expect. Consider the
above code in a loop, then you wil have one upgrade of $d for the whole
loop. Thats liekly more efficient than doing a specilised comparison or
optimisation every time through the loop.
Disclaimer: my explanation of "the perl string model" as _I_ call it,
above, clashes with many other persons' explanation of the "perl unicode
model". _I_ found that keeping unicode seperate from strings causes much
less confusion with people who do not want to operate on unicode, or learn
how to do it. For them, telling them you can just store more characters
in a character than in older perls, you only need to worry about unicode
when *you* tell perl to do so, end of story, is more workable than any
"yeah, latin1 vs. unicode, but there is also that UTF-8 flag you have to
take into account, sometimes, and be very careful about your regexes"
model that is so often repeated in that or a less drastic version, even in
Thats my *personal* interpretation.
And current perls do *not* force unicode or latin1 interpretation on any
of your scalars (modulo bugs, excluding unpack "C", which is a different
kind of problem).
Thats a fact, I believe.
The choice of a
----==-- _ generation Marc Lehmann
---==---(_)__ __ ____ __ firstname.lastname@example.org
--==---/ / _ \/ // /\ \/ / http://schmorp.de/
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE