Front page | perl.perl5.porters |
Postings from March 2007
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
From: Marc Lehmann
March 30, 2007 16:34
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID: 20070330233355.GG18872@schmorp.de
On Sat, Mar 31, 2007 at 01:03:35AM +0200, Juerd Waalboer <email@example.com> wrote:
> I maintain a short list of some modules at
> http://juerd.nl/perluniadvice. If you encounter modules that I can test
> easily without setting up complete environments, please let me know!
> Compress::Zlib sounds like it uses zlib, which compresses byte streams.
> i.e. don't give it unicode strings, because unicode strings have no
> bytes (the bytes are internal only, but you don't know what encoding is
> used there). Encode explicitly.
The difference between us, and thats what it boils down to, is that you give
the internal UTF-X bit meaning. You equate UTF-X flag set == Unicode string.
To me, a unicode string is a concept outside of perl. I would consider any
text string using the unicode codepoints a unicode string. For example:
"hallo" is a unicode string. Any any binary string is not a unicode string.
The problem with your approach is that you have to expose the UTF-X flag
to users. Which comes with a lot of problems.
Please note that in the actual problem, nobody is passing unicode to
compress::zlib. Instead, a binary string is passed to Compress::Zlib that
happens to be UTF-X encoded internally because it was transferred using a
protocol that encodes bytes as UTF-8 (namely JSON), and the decoder opted
not to make another copy of the data for speed reasons.
Compress::Zlib is not buggy. Neither is the caller. The bug is that unpack
treats the same string differently depending on an internal flag that might
be set for a variety of reasons outside the programmers control.
Initially I thought you, too, wanted a unicode model where the UTF-X bit is
not exposed to the perl level. But in fact the opposite is true: you force
knowledge of the UTF-X bit on users, even though it should be transparent.
Thats the problem. As logn as you call UTF-X-encoded strings Unicode
strings and something else byte strings and try to give them meaning
the programmer has to know about it, as functions behave semantically
differently depending on that flag.
All I want is a perl that behaves semnatically consistent, regardless
of some internal flag that is documented not to be of concern to a Perl
> Their code is probably broken because they mix text strings with byte
> strings. This can be solved most easily by explicitly encoding your text
> string as soon as you feel you must join it with a byte string. The
> joined string as a byte string. Decoding it to make a text string may or
> may not make sense, depending on the data format.
my $bytestring = "zlib-encoded string";
my $transfer = Encode::encode_utf8 $bytestring;
my $bytes = Encode::decode_uf8 $transfer;
$bytes is the same string, but depending on implementation details of
Perl, it is treated diferently in different contexts, sometimes it is
treated like the binary string it is, sometimes it is trated as if it were
utf-8 encoded, which it isn't, as I decoded it.
> > I find "text strings" and "byte strings" not adequate either, as Perl
> > makes no difference between those two concepts (being typeless)
> Indeed. Programmers have to track this themselves. Sometimes that sucks,
> but in my experience, you need to know what kind of data your variable
> contains anyway.
the problem is you want them to track the UTF-X flag in addition to that.
Because putting a "byte string" into unpack should not work if that bit
happens to be set. So you force people who want to use unpack to learn about
that flag, when it is set, when they have to downgrade etc. etc.
> If you ++ a reference, you're in for trouble too. How come that's never
> been a problem?
Because perl treats it consistently.
> It's just that this is something you haven't needed to know before, so
> you're not /trained/ yet to think about it. But you can't go from 256
> characters to several thousands without changing the way you think :)
Yes. Thats not a problem, I understand unicode quite well, and I udnerstand
quite well how Perl stores unicode.
What the problem is is that I separate internal encoding (unicode can be
encoded both in UTF-X as well as in octets, as can byte strings) from the
unicode model in Perl, while you mix them together, forcing the user to
know their UTF-X bits on their scalars in addition to tracking wether they
are binary or not.
> > they do not map well to encoded/decoded text either
> Oh, but they do. Please read perlunitut, which tries to redefine the
> universe into four important definitions (and succeeds).
I do not have that manpage.
> 1. Byte strings (aka binary strings)
> 2. Text strings (aka unicode strings or "internal format" strings)
> 3. Decoding is byte --> text
> 4. Encoding is text --> byte
That doesn't reflect reality, of course, if it were so.
However, those four definitions, as I said, do not map well to
encoded/decoded text. Because "internal format" strings can store binary
data just as well, and often does.
I am talking purely about the perl level strings. If perlunitut confused
the issue by talking about internal encoding it completely failed its
> I don't get the causal connection you're illustrating.
> utf8::encode takes any text string (or unicode string, if you prefer
> that term) and turns it into a UTF-8 encoded byte string in place.
No. It converts characters to UTF-X encoded octets. Wether my characters
are bytes or not is of no consequence.
> Note that whenever a string has an encoding attach to it, conceptually,
> it's automatically a byte string.
Yes. And that encoding is completely independent of the internal UTF-X
flag. Or should be, but isn't, in current perls.
> Text strings don't have encodings,
> because encodings are a byte thing, and text strings don't have bytes;
> they have characters. (Text strings have encodings and bytes
Perl doesn't know about that. It only knows about characters. The problem
is that some parts of perl make a difference bewteen the very same string,
depending on how it is encoded internally, _even if the encoding is the
same on the Perl level_.
> /internally/, just like numbers do have bytes /internally/, encoded in
> one way or another, that allows values greater than 255 or less than 0.)
Exatcly. But nothing in perl forces those indices to be unicode characters.
Certainly not the indices 0..255. Yet still, the UTF-X flag might be set or
cleared, resulting in changes in interpretation.
I want those to go away and make perl treat my binary data as binary data,
regardless of how the interpreter treats them.
> utf8::encode is a text operation. It will assume that whatever you give
> it, is a text string. Its characters are considered Unicode codepoints.
Where does it say so?
> You shouldn't give it a byte string.
Please leave it up to me what I should or should not to. This whole
discussion of what I should or should not to is completey besides the
The point is that Perl treats my strings the same in utf8::encode, regardless
of how the UTF-X flag is set, because upgrading or downgrading does not
change the semantics of my characters.
But in unpack, it does. Thats the problem. Bot what I should or should not
do. The problem is givign unpack a binary strings makes it return garbage
sometimes (if the binary string happens to be encoded internally in UTF-X).
This whole "force the user to track the UTF-X bit is useless". If you
really want that, then go back to 5.005_5x, which forces you to track
your UTF-8 on your own. The whole point of the big change in 5.6 was that
programmers should not care about how perl internally encodes stuff, and I
certainly do not want to give this up. Thats what makes perl so good.
> To understand what happens if you do give utf8::encode a byte string,
A byte string is a string containing only octets, that is, values between 0
Without knowing any intenals, utf8::encode will encode it into a UTF-8
> you need to know some internals.
Wrong. I need know no internals, the result is always well-defined: put
characters into utf8::encoede, and get utf-8-encoded characters. No need for
internals knowledge, regardless of wether my characters are 0..255 or some of
them happen to be larger. Perl doesn't care, nor does UTf-8 care, nor do I
The problem is, perl cares in unpack, and when handing strings over to XS
> That makes no sense, because UTF-8 is a means of representing
> characters. Byte strings consist of bytes, not characters.
Not in C, which is what the documentation constantly refers to, mind
you. And no, a byte always has been a character. It is the very definition
of byte in C, regardless of how many bits it has. And the same is true in
perl: a single bate is represented by a single character, havign an index
no higher than 255.
> > (or my programs either). It might be a good and simplified advice to a
> > beginner
> The theory is very simple, but not simplified. It just isn't any harder.
It doesn't map to reality.
> I'm sorry if you want a more complex programming tool. But apparently
> you have found ways to make it hard for yourself already :)
Just stop your ad-hominem, please. I told you before that I find it
rather easy, but users of my module find it rather hard, for example. I
worked around a lot of bugs in 5.6 easily, and can slap an occasional
utf8::up/downgrade into my code. But I think its simply wrong to force
every programmer to know as much about the internals as I do.
> > The perl unicode model is rather simple, but leaves you in control,
> > and I found teaching people about how perl just allows more than
> > 0..255 for a character index works best (although people differ).
> That's a great explanation of how unicode strings work.
You think so? Then why do you want to force people to know about how
128..255 is encoded internally then? Because you do when say that UTF-X
always means text (which is not true in reality, mind you), and you want
unpack to fail on binary strings that happen to be UTF-X encoded?
> we never did all use exactly the same encoding. We've just chosen to
> remain ignorant all this time. Explicit re-encoding, or decoding and
> encoding has been necessary all this time. It's just that with more than
> 256 codepoints, it became much more apparent :)
Right. But at leats when dealing with decoded stuff (such as binary data),
Perl should behave consistently and correctly, but it doesn't.
The choice of a
----==-- _ generation Marc Lehmann
---==---(_)__ __ ____ __ firstname.lastname@example.org
--==---/ / _ \/ // /\ \/ / http://schmorp.de/
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE