develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
Marc Lehmann
March 30, 2007 22:56
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
On Sat, Mar 31, 2007 at 03:53:25AM +0200, Juerd Waalboer <> wrote:
> Juerd Waalboer skribis 2007-03-30 21:53 (+0200):
> > Personally, I think that unpack with a byte-specific signature should
> > die, or at least warn, when its operand has the UTF8 flag set.
> I've since this post changed my mind, and think it should only warn if

We are making progress, and I would actually be content with that
solution, but it does break "U". The solution, really, is to treat C like
an octet in the same way "n" is treated like two octets. That does not
break existing code and is what many perl programmers find naturally.

Since so many people are confused about why the unpack change breaks code, I
will explain it differently:

   my $k = "\x10\x00";
   die unpack "n", $k;

this gives me 4096. "n" is documented to take exactly 16 bits, two octets.

I get 4096 regardless of how perl chooses to represent it internally: If
perl goes to using UCS-4 (something that won't happen for sure, but has
been stated before to remind people that internal encoding can change), it
would still work.

Same thing for "L", which is documented to be exactly 32 bit.

Now, when people want an 8 bit value followed by a 16 bit big endian value,
they used "Cn" in the old times. In fact, they still use that, as "C"
always has been the octet companion to the 16 bit and 32 bit sSlLnNvV etc.

However, in a weird stroke, somebody decided that "C" no longer gives
you a single octet of your string, but, depending on internal encoding,
depending on an internal flag, part of that octet or the octet.

Now, what has been unpack "CCV" in perl 5.005 must be written as unpack
"UUV" in perl 5.8, as "U" has the right semantics for decoding a single
octet out of a binary string.

Thats weird, because now code that _doesn't_ want to deal with unicode at
all, but in fact only deals with binary data must use this unicode thingy
"U", even though the documentation for "C" clearly says its an octet, and
even says its an octet in C, which is exactly what those people decoding
structures or network packets want.

That is the problem.

Now, I don't mind at all if I get a die when trying "C" on a
byte=character that is >255 (i.e. not representable as an object). Or a die
when attempting that on a two byte=character string with "n".

I personally dislike the warning, because the warning only ever comes up
when there is a bug. It doesn't matter much to me persoanlly, though.

What matters to me is that binary-only code now needs to use "U" when
formerly "C" as meant to get correct behaviour. This *needs* to be fixed.

                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __
      --==---/ / _ \/ // /\ \/ /
      -=====/_/_//_/\_,_/ /_/\_\      XX11-RIPE

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About