develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
Ben Carter
March 31, 2007 03:16
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
On Sat, Mar 31, 2007 at 01:53:48AM +0200, Marc Lehmann wrote:
> In C, a single byte is a character, even if it happens to have a value
> higher than 255 (although very few compilers allow that, usually, a byte
> is an octet, although it is common on DSPs to have 32 bit bytes).
> Even if Perl encoded a single character into multiple C bytes/octets, that
> does not mean its more than a single character.
> The documentation is completely contradictory when it comes to "C" and can
> easily be interpreted to mean a single character in the C sense.
> Fact is "even under Unicode" it doesn't work as advertised, becasue Unicode
> can be internally represented in multiple ways in Perl.
> > I think that "char value" should be either removed from perlfunc, or
> > explained in more detail. It's NOT OBVIOUS to those who don't know C.
> To those who do know C it has perfectly clear meaning, namely a single
> character.

But that is not really relevant to the discussion.

Communication is difficult if you cannot express clearly what you are
trying to say.  Terminology is important to get correct, and it is easy
to confuse others or yourself if you are not precise when you need to

Unicode does not even HAVE characters, it has codepoints.  This did not
happen by accident and is an important distinction to make.

  $x = "ABCD";
  $x = "\x41\x42\x43\x44";
  $x = chr(65) . chr(66) . chr(67) . chr(68);
  $x = pack("C*", 65, 66, 67, 68);

All of these put the same data into $x. [1]  We can reasonably assume
that $x contains a sequence of 4 bytes, each 8 bits wide.  We do not
know anything about what $x is, if it has an encoding, if it is actually
the output of pack "V", or maybe it came after "HTTP/1.1 GET ".  The
only reasonable thing to assume is that it is just a sequence of octets,
aka binary data.

Now consider the case of

  $y = chr(1000);

Clearly whatever is in $y cannot be a single octet.  The way Perl
currently works (and this is my limited understanding here - someone
with more knowledge can feel free to step in and correct my errors)
is that now $y is considered to be a string of Unicode codepoints.  So
$y contains a single codepoint, U+03E8.  The internal flag is used to
indicate that the internal data pointer points to something that is a
"Unicode codepoint string".

What can we do with such a string?  We can try to print it, but if we
have not converted it we get a message like

  Wide character in print at - line 1.

and we get the bytes "cf a8" as output because that is the internal

  print unpack("H*", $y);

produces "cfa8" as output, again because we have been given access to
the string as it exists upgraded.

On the other hand,

  print unpack("H*", pack("C", 1000));

produces "e8".

So consider again:

  unpack("C*", $y);

This currently produces the list (207, 168) which is again the internal
encoding.  What else should it do?  If you expect values over 255, then
you should not use "C".  If you don't have values over 255, then why is
your string not just a sequence of bytes?  Something must have occurred
to upgrade it to "sequence of unicode codepoints".

Of course if you have values over 255 you have to use "U" in unpack,
that only makes sense!  On the other hand, if you are agnostic to your
string and just treat it as "data" then it will never get upgraded.  So
where is the issue?

It sounds to be that what you are trying to suggest is something along
the lines of another type of Sv for the case of "unicode codepoint
sequence", so that SvPV implicitly means "This scalar is not upgraded
and is just data" and SvP_UnicodeArrayValue_ would contain the upgraded
value.  Then for anything that wanted a SvPV (XS code, unpack "C") the
only sensible thing would be to try to downgrade the string at that
point and then emit a warning in the case of "wide characters" being

This is the point at which someone more familiar with internals chimes
in and says "This has problems [backwards compatibility, tuits, other]."
And of course this would preclude being able to inspect Perl's internal
Unicode representation using unpack "C".  :)

-Ben Carter
Human beings, who are almost unique in having the ability to learn from
the experience of others, are also remarkable for their apparent
disinclination to do so. - Douglas Adams, "Last Chance to See" 

[1] I am deliberately ignoring the box in the corner labeled "EBCDIC".

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About