develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
Juerd Waalboer
March 31, 2007 02:49
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
Marc Lehmann skribis 2007-03-31  7:55 (+0200):
> > > Personally, I think that unpack with a byte-specific signature should
> > > die, or at least warn, when its operand has the UTF8 flag set.
> > I've since this post changed my mind, and think it should only warn if
> We are making progress, and I would actually be content with that
> solution, but it does break "U".

No, breaking U does not occur, because it's not in my list of
byte-specific (un)pack templates. U is for unicode characters.

> The solution, really, is to treat C like
> an octet in the same way "n" is treated like two octets. 

It does that, but we're having a very different understanding of the
word "octet", and my hands hurt, so I'm not going through it all again.

> Since so many people are confused about why the unpack change breaks code, I
> will explain it differently:
>    my $k = "\x10\x00";
>    die unpack "n", $k;
> this gives me 4096. "n" is documented to take exactly 16 bits, two octets.

juerd@lanova:~$ perl -le'print unpack "n", "\x{20ac}"'

"\x{20ac}" is one character, but "n" works on octets, not characters.
This uses the internal buffer without warning, and picks the first two
octets of the three-octet secuence e2 82 ac. This octet sequence should
be hidden from the programmer, but it is too late for that. So instead,
let's warn the programmer that what's going on is very probably not what
they intended.

juerd@lanova:~$ perl -le'print unpack "n", "\xe2\x82"'

The annoying thing for people who don't know when Perl upgrades strings,
is when you started with a nice 2-octet byte string, and it got upgraded
somewhere. Here, forced for illustration, and using the same 2-octet
sequence so the difference in results is obvious:

juerd@lanova:~$ perl -le'$foo = "\xe2\x82"; utf8::upgrade($foo); print
unpack "n", $foo'

A warning about the wide characters here would be in order and save
people's butts.

> I get 4096 regardless of how perl chooses to represent it internally

Because Perl always uses latin1 or utf8 internally, in both of which
\x10 and \x00 are octets 0x10 and 0x00 respectively. 

> If perl goes to using UCS-4 (something that won't happen for sure, but
> has been stated before to remind people that internal encoding can
> change), it would still work.

Not as far as I can tell, because Perl uses the raw octets of the
internal encoding whenever you do byte-specific operations, and the
internal encoding for U+0010 and U+0000 changes when you go from UTF-8
to UCS-4.

That's why it's so darn useful to use latin1 when possible, because you
can then be pretty sure that "\x10\x00" will be the two octets you
expect. (Note that breaking this is the main breakage caused by

> However, in a weird stroke, somebody decided that "C" no longer gives
> you a single octet of your string, but, depending on internal encoding,
> depending on an internal flag, part of that octet or the octet.

What you call "octet", I call "character". And I'll never call that
"octet" or "byte" because then none of the documentation about all this
would still be right, and Perl would suddenly indeed be broken.

If you insist on calling the value of "\x{20ac}" a single octet, then
indeed pack/unpack will not do what you want, because what you want is
just not how it works.

"\x{20ac}" is one character. Internally, represented by three octets.
The internal representation is used, if you unpack with byte-specific
templates like "C" or "n".

Byte strings, i.e. strings with no character values >255 that have never
been in contact with UTF-8 encoded strings, may be interpreted as latin1
and internally converted to UTF-8 when you join them with text strings.
This causes unpack to see very different values, and that's one of the
reasons one should avoid mixing byte strings and text strings.

Note that my definition of "text string" excludes byte encoded strings,
such as the results of encode() or utf8::encode().

> Now, what has been unpack "CCV" in perl 5.005 must be written as unpack
> "UUV" in perl 5.8, as "U" has the right semantics for decoding a single
> octet out of a binary string.
> Thats weird

Weird only because you choose to use a different meaning of the word
"octet" than much of the rest of the world.

> Now, I don't mind at all if I get a die when trying "C" on a
> byte=character that is >255 (i.e. not representable as an object).

Just so other people know: since Perl has had Unicode support, there has
been a consistent effort to teach people that character != byte, and
that a single character may consist of several bytes.

In fact, this effort has been present in larger parts of computing than
just Perl, but for clarity's sake, I'm sticking to Perl because
sometimes Perl's definitions differ. (For example, in Perl, a character
is a single code point, while in Unicode, a character can be composed
out of several combining code points.)

Also, values greater than 255 do not fit in a single byte, according to
computer science that decided that byte==octet==8 bits. 8 bits simply
simply hold only 2**8==256 values. Hence the need for a distinction
between bytes, and things that *are* able to hold other values.

> I personally dislike the warning, because the warning only ever comes up
> when there is a bug.

I love warnings that only ever come up when I have a bug. In fact, I
generally dislike warnings that don't follow that pattern.
korajn salutojn,

  juerd waalboer:  perl hacker  <>  <>
  convolution:     ict solutions and consultancy <>

Ik vertrouw stemcomputers niet.
Zie <>.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About