develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
Marc Lehmann
March 30, 2007 17:29
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
On Sat, Mar 31, 2007 at 02:16:49AM +0200, Juerd Waalboer <> wrote:
> Marc Lehmann skribis 2007-03-31  2:12 (+0200):
> > Yes, and the exact same is true for unicode (both have a 1-1 mapping
> > between 0..255 and octets), trivially, of course, as unicode explicitly is
> > a superset of latin1.
> Unicode is a character set, not a character encoding.

As is latin1.

> A unicode string is a sequence of codepoints, not octets.

Nope. You can encode unicode codepoints into UTF-8 and still end up with a
unicode string. Encoding doesn't change the fact that it is unicode that
your are storing.

Since it seems hard to grasp, here is an example:

   my $s = "Hello, World!";
   $s = Encode::encode_utf8 $s;

$s contains the famous greeting before and after the encoding. It is still
an ASCII string, iso-8859-15 string, and a unicode string, and a text
string, regardless of wether it is encoded or not, that does not change
the fact that that string contaisn the message "Hello, World!".

If you drop ASCII, the same is true for "Hallöchen!", which looks
differently in UTF-8 then in an unencoded string, but it is still the same
message. And it is till using unicode to represent the characters.

The fact that you encode something does not change the something that you
encode. Making an arbitrary difference only confuses the issue.

> They don't map 1:1 to octets either. To express a unicode string
> in octects, you need to encode it. For this, there are several
> possibilities, including UTF-8, UTF-16, ...

Sure. Octets are just things that store numbers between 0 and 255. The
most compact way to do that in Perl is using a string. Thats also the most
natural way to represent bytes in Perl, closely followed by integers for
single bytes.

You do not store octets in latin1, or unicode, or whatever else in that
string. You are just using the most natural way to represent octets. And that
just happens to work, because Perl was designed to work that way.

The mapping between perl bytes and octets is 1:1.. ord and chr do it for
you, for example, and unpack "n" does it for you in case you encode/decode
two byte entities. unpack "C", however, does not map to octets in
perl. Thats the bug.

> Unicode is a superset of the latin1 character set, not the latin1
> character encoding. We'd need bigger bytes for the latter :)

Right. And Perl has those bigger bytes.

                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __
      --==---/ / _ \/ // /\ \/ /
      -=====/_/_//_/\_,_/ /_/\_\      XX11-RIPE

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About