develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
Marc Lehmann
March 30, 2007 16:53
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
On Sat, Mar 31, 2007 at 01:27:14AM +0200, Juerd Waalboer <> wrote:
> If a downgrade is "needed", it means that your byte string was
> accidentally upgraded. This should only happen if you mix it with a text
> string. If it happens without mixing it with a text string, that is a
> bug. Please report.

Thats extrenely far from reality. Lots of things can cause a text string
to be upgraded. Forcing people to learn all that is just stupid when you
could just make it work logically without telling people about internals
(note that the internals come into play by your peculiar efinition of
"tetx strings" having the UTF-X bit set, which isn't reality and in my
opinion is an extremely stupid limitation that 96% of perl does not

> Instead: "your code is broken, don't mix text strings with byte strings"
> or "it is a bug in perl that your string got upgraded in the first
> place."

See my json example. Nothing gets mixed.

> > Exactly. But "C" somehow works on UTF-8, while it shouldn't. 
> Agreed!
> Things that specifically handle bytes, and bytes only, should DIE (or at
> least warn) when used with a string that has the UTF-8 flag on.

So you force people to know about the internal flag, lest they cannot avoid
the die.

This completely contradicts your claim that you want to abstratc the UTF-X
flag away from the Perl level.

> still lets users get away with naively assuming that byte == character
> for latin1 strings, as designed, but at least catches the cases when you
> know that the user does something stupid.

But the user does not do anythign stupid when feeding binary strings (my
definition, indices 0..255) into Compress::Zlib. It is only your request
for a die that makes problems. Zlib would work just fine if perl gave
downgraded data to perl and XS code that wants it.

> > It should work on characters, as documented (just like in C, char
> > array[]; array[i] is one character, regardless of how many bits a
> > character in C has, or how it is encoded).
> A C "char" is a byte, not a multibyte character, ever.

Exactly. The same as in Perl I would assume, as Perl uses characters to
store bytes, it doesn't use multibyte characters on the Perl level.

Hope you get it this time :)

> Besides that, the "C" in Perl's pack() is documented as a single byte.

"A C "char" is a byte".

Your words.

But here you say a byte is not a character. Thats a contradiction.

You are deeply confusing the internal encoding Perl uses (Which might be
single octets for characters, or UTF-X encoded octets, for characters)
with the language proper.

In C, a single byte is a character, even if it happens to have a value
higher than 255 (although very few compilers allow that, usually, a byte
is an octet, although it is common on DSPs to have 32 bit bytes).

Even if Perl encoded a single character into multiple C bytes/octets, that
does not mean its more than a single character.

The documentation is completely contradictory when it comes to "C" and can
easily be interpreted to mean a single character in the C sense.

Fact is "even under Unicode" it doesn't work as advertised, becasue Unicode
can be internally represented in multiple ways in Perl.

> I think that "char value" should be either removed from perlfunc, or
> explained in more detail. It's NOT OBVIOUS to those who don't know C.

To those who do know C it has perfectly clear meaning, namely a single

> The earlier Perl versions didn't support character values greater than
> 255, and if you never have those characters, C still works perfectly.

Nothing in C limits you to 256 characters. A byte in C is exactly a
character. It can store at least 256 different values, but nothing in C
limits you to that, many compilers use larger bytes. And the same is true
in Perl: Perl only supported bytes 0..255 in earlier versiosn, and now the
perl byte can be up to 64 bits (or maybe a bit less, I forgot).

> But yes, if you're dealing with characters and want your program to be
> able to handle those fancy new >255 characters, you should change that C
> to a U.

I do not want to handle those fancy >255 characters. I only want to handle
a single octet. But unpack doesn't do that.

In fact, thats thr problem: all old code that uses unpack "C" would need
to be changed to use "U". Thats the compatibility breakage I was talking
about. Code that uses "C" expects the single-octet meaning form perl
5.005, it does not expect the "sometimes returns half of a utf-x encoded
character, sometimes not" meaning it has in current perls.

It is especially weird as it suddenly has become incompatible with regards
to the other template characters such as "n", which correctly decode
bytes regardless of internal encoding.

> > Besides, perl 5.8 does not follow that description:
> >    perl -e '$x = "\xc3\xbc"; die unpack "U*", $x'
> > This gives me 195188, two characters, although it is a single UTF-8
> > character, so why does it wrongly give me two? $x certainly is utf-8-encoded
> > (try Encode::encode_utf8 chr 252, it results in the above string).
> You asked for the codepoints U+00C3 and U+00BC, and got them.

No, I asked for UTF-8 encoded characters. Again, read the documentation:

          *       If the pattern begins with a "U", the resulting string will
          *       be treated as UTF-8-encoded Unicode.

thats for pack, unfortunately.

          U   A Unicode character number.  Encodes to UTF-8

uh, that internal thing again. So how many characters will pack "U", 200
give me? According to the documentation, 2, as UTF-8 requires that. That
is not what happens, though.

Thats the problem. Perfectly working code using unpack "CN" suddenly
stops working because "N" works on bytes, while "C" works on the internal
encoding, regardless of what that might be.

> It's a UTF-8 encoded byte string, alright, but "U" is for Unicode, not
> UTF-8.

You cna store unicode in UTF-8. IF you say "UTF-8 encoded unicode" then you
very well have UTF-8, even though it still is unicode.

> > Ok, so I will tell people to replace "C" by "U" in theor code then.
> If they do Unicode text strings, that's indeed very good advice.

Unfortunately, thats what they have to do when dealing with binary
strings, as C doesn't work on them.

> But you still want C for byte strings, simply because some protocols or
> formats expect a byte value. :)

Exactly. And then I have to use "U" to get it. Because a byte in perl is a
character. Is and always has been, just as in C.

And to get those bytes for use in such protocols you have to use "U" now,
instead of "C" as in earlier versions.

> > Right, while the documentation on unpack "U" disagrees with it, as it talks
> > about UTF-8.
> That would be a bug, but I can't find it in my copy (5.8.8). It only
> says "Encodes to UTF-8 internally" for pack(), which as far as I can
> tell, is true.

So it talks about using UTF-8, so, according to you, it is a bug. Fine
with me.

                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __
      --==---/ / _ \/ // /\ \/ /
      -=====/_/_//_/\_,_/ /_/\_\      XX11-RIPE

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About