develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
From:
Juerd Waalboer
Date:
March 30, 2007 16:04
Subject:
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
20070330230335.GB31277@c4.convolution.nl
Marc Lehmann skribis 2007-03-31  0:20 (+0200):
> > > If perl had the abstract model juerd dreams of
> > and uses in day-to-day coding, without encountering ANY of the problems
> > that you describe
> Frankly, that is not a very good sign. It means eitehr you are extremely
> lucky or you don't use any of the many XS modules that silently break, or
> even the Perl modules (such as the example from Compress::Zlib) that break
> less silently, but more miraciously.

Most of the time, it's a question of realising that the module doesn't
do the Perl unicode model, and considering communication with the module
I/O, i.e. only feed it bytes, and only get bytes back. Encode and decode
as appropriate.

I maintain a short list of some modules at
http://juerd.nl/perluniadvice. If you encounter modules that I can test
easily without setting up complete environments, please let me know!

Compress::Zlib sounds like it uses zlib, which compresses byte streams.
i.e. don't give it unicode strings, because unicode strings have no
bytes (the bytes are internal only, but you don't know what encoding is
used there). Encode explicitly.

> And they do so for some of my other modules doing that, too. And there are
> two options to me: either tlel them perl is broken w.r.t. to e.g. "C", or
> their code is broken becasue they do not call downgrade.

Their code is probably broken because they mix text strings with byte
strings. This can be solved most easily by explicitly encoding your text
string as soon as you feel you must join it with a byte string. The
joined string as a byte string. Decoding it to make a text string may or
may not make sense, depending on the data format.

> I find "text strings" and "byte strings" not adequate either, as Perl
> makes no difference between those two concepts (being typeless)

Indeed. Programmers have to track this themselves. Sometimes that sucks,
but in my experience, you need to know what kind of data your variable
contains anyway.

If you ++ a reference, you're in for trouble too. How come that's never
been a problem? Probably because programmers are pretty good at knowing
what functions their variables have.

It's just that this is something you haven't needed to know before, so
you're not /trained/ yet to think about it. But you can't go from 256
characters to several thousands without changing the way you think :)

> they do not map well to encoded/decoded text either

Oh, but they do. Please read perlunitut, which tries to redefine the
universe into four important definitions (and succeeds).

1. Byte strings (aka binary strings)

2. Text strings (aka unicode strings or "internal format" strings)

3. Decoding is byte --> text

4. Encoding is text --> byte

> Perl only knows how toc oncatenate characters, it does not know
> anything about byte or text, so utf8::encode does not necesarily
> create a byte string out of a text string.

I don't get the causal connection you're illustrating.

utf8::encode takes any text string (or unicode string, if you prefer
that term) and turns it into a UTF-8 encoded byte string in place.

That is,

    utf8::encode($foo);

is the efficient equivalent of:

    $foo = encode("utf8", $foo);

Note that whenever a string has an encoding attach to it, conceptually,
it's automatically a byte string. Text strings don't have encodings,
because encodings are a byte thing, and text strings don't have bytes;
they have characters. (Text strings have encodings and bytes
/internally/, just like numbers do have bytes /internally/, encoded in
one way or another, that allows values greater than 255 or less than 0.)

> It could juts as well create a text string out of a byte string (think
> JSON, which creates json _text_ out of e.g. byte strings by encoding
> them to UTF-8).

utf8::encode is a text operation. It will assume that whatever you give
it, is a text string. Its characters are considered Unicode codepoints.

You shouldn't give it a byte string.

To understand what happens if you do give utf8::encode a byte string,
you need to know some internals. But I stress that this is not required
knowledge, because it's so much easier to just remember not to do this
weird thing. Why would you try to encode a byte string to UTF-8, anyway?
That makes no sense, because UTF-8 is a means of representing
characters. Byte strings consist of bytes, not characters.

    Here's what happens internally: Any byte string used as a text
    string is considered to be encoded in latin1, because Perl doesn't
    know the difference.

> (or my programs either). It might be a good and simplified advice to a
> beginner

The theory is very simple, but not simplified. It just isn't any harder.

I'm sorry if you want a more complex programming tool. But apparently
you have found ways to make it hard for yourself already :)

> though, although I prefer to never tell people simplified (but wrong)
> things.

I agree. Whenever I use a simplified view, that will be obvious or
mentioned. Metadata ("this information is wrong, but useful anyway") is
very important.

> The perl unicode model is rather simple, but leaves you in control,
> and I found teaching people about how perl just allows more than
> 0..255 for a character index works best (although people differ).

That's a great explanation of how unicode strings work. But when people
write programs, these programs typically accept input and also have some
output. And then you're doing I/O, which is done with bytes, and
requires character encodings in order to communicate characters. You
used to be able to ignore this fact when everyone still used iso-8859-1,
I mean CP437, I mean CP850, I mean koi8-r, I mean Windows-1252. Right,
we never did all use exactly the same encoding. We've just chosen to
remain ignorant all this time. Explicit re-encoding, or decoding and
encoding has been necessary all this time. It's just that with more than
256 codepoints, it became much more apparent :)
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerd@juerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy <sales@convolution.nl>

Ik vertrouw stemcomputers niet.
Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About