develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
Juerd Waalboer
March 30, 2007 16:27
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
Marc Lehmann skribis 2007-03-31  0:41 (+0200):
> The reason I wanna know is because I want to know what to tell
> people. Either it is "your code is broken, unpack "C" without downgrade
> is a bug in your code" or "it is a bug in perl, you can work around by
> enabling ->shrink for the time being".

If a downgrade is "needed", it means that your byte string was
accidentally upgraded. This should only happen if you mix it with a text
string. If it happens without mixing it with a text string, that is a
bug. Please report.

So, neither "your code is broken, unpack "C" without downgrade is a bug
in your code" nor "it is a bug in perl".

Instead: "your code is broken, don't mix text strings with byte strings"
or "it is a bug in perl that your string got upgraded in the first

> Exactly. But "C" somehow works on UTF-8, while it shouldn't. 


Things that specifically handle bytes, and bytes only, should DIE (or at
least warn) when used with a string that has the UTF-8 flag on. This
still lets users get away with naively assuming that byte == character
for latin1 strings, as designed, but at least catches the cases when you
know that the user does something stupid.

> It should work on characters, as documented (just like in C, char
> array[]; array[i] is one character, regardless of how many bits a
> character in C has, or how it is encoded).

A C "char" is a byte, not a multibyte character, ever.

Besides that, the "C" in Perl's pack() is documented as a single byte.

I think that "char value" should be either removed from perlfunc, or
explained in more detail. It's NOT OBVIOUS to those who don't know C.

> > * The chr and ord functions work on characters
> >     chr(1).chr(20).chr(300).chr(4000) eq v1.20.3000.4000
> >   In other words, chr and ord are like pack("U") and unpack("U"), not like
> >   pack("C") and unpack("C"). In fact, the latter two are how you now emulate
> >   byte-orientated chr and ord if you're too lazy to use bytes.
> So due to that documentation insanity it is now suggested that all code that
> used "C" beforee muts use "U" now to get the same effect as in earlier perl
> versions?

The earlier Perl versions didn't support character values greater than
255, and if you never have those characters, C still works perfectly.

But yes, if you're dealing with characters and want your program to be
able to handle those fancy new >255 characters, you should change that C
to a U.

> Besides, perl 5.8 does not follow that description:
>    perl -e '$x = "\xc3\xbc"; die unpack "U*", $x'
> This gives me 195188, two characters, although it is a single UTF-8
> character, so why does it wrongly give me two? $x certainly is utf-8-encoded
> (try Encode::encode_utf8 chr 252, it results in the above string).

You asked for the codepoints U+00C3 and U+00BC, and got them.

It's a UTF-8 encoded byte string, alright, but "U" is for Unicode, not

> Ok, so I will tell people to replace "C" by "U" in theor code then.

If they do Unicode text strings, that's indeed very good advice.

But you still want C for byte strings, simply because some protocols or
formats expect a byte value. :)

> Right, while the documentation on unpack "U" disagrees with it, as it talks
> about UTF-8.

That would be a bug, but I can't find it in my copy (5.8.8). It only
says "Encodes to UTF-8 internally" for pack(), which as far as I can
tell, is true.
korajn salutojn,

  juerd waalboer:  perl hacker  <>  <>
  convolution:     ict solutions and consultancy <>

Ik vertrouw stemcomputers niet.
Zie <>.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About