develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
Marc Lehmann
March 30, 2007 15:20
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
On Fri, Mar 30, 2007 at 10:09:29PM +0200, Juerd Waalboer <> wrote:
> Marc Lehmann skribis 2007-03-30 14:24 (+0200):
> > In fact, I teach a lot of people about unicode in perl.
> At the German Perl Workshop, I saw your unicode presentation. I don't
> know if this is a good representation for your teaching of unicode, but

It is, if a bit short (and I consider it a matter of taste).

> > If perl had the abstract model juerd dreams of
> and uses in day-to-day coding, without encountering ANY of the problems
> that you describe

Frankly, that is not a very good sign. It means eitehr you are extremely
lucky or you don't use any of the many XS modules that silently break, or
even the Perl modules (such as the example from Compress::Zlib) that break
less silently, but more miraciously.

> It kind of makes one wonder if this dream might be reality (and your
> reality a dream?)

The dream isn't reality. If it ere, people would not report bugs against
JSON::XS because it happens to create scalar values with the UTF-X bit set.

And they do so for some of my other modules doing that, too. And there are
two options to me: either tlel them perl is broken w.r.t. to e.g. "C", or
their code is broken becasue they do not call downgrade.

Obviously, I prefer the former over the latter, but last time I was told
unpack "C" was mentioned to break the abstraction in the camelbook, so its

Which suddenly invalidates a lot of code.

> > then perl would have a very easy unicode model that boils down to
> > what I talked about on the perl workshop: encode/decode when doing
> > I/O, oherwise, enjoy.
> And keep text strings and byte strings separate!!!!!!!!!!!!!eleven

I find "text strings" and "byte strings" not adequate either, as Perl
makes no difference between those two concepts (being typeless), and
they do not map well to encoded/decoded text either. Perl only knows
how toc oncatenate characters, it does not know anything about byte or
text, so utf8::encode does not necesarily create a byte string out of a
text string. It could juts as well create a text string out of a byte
string (think JSON, which creates json _text_ out of e.g. byte strings by
encoding them to UTF-8).

> So, recap: encode/decode when doing I/O, keep text strings and byte
> strings separate, otherwise, enjoy.

I do not think that maps clearly to Perl (or my programs either). It might
be a good and simplified advice to a beginner, though, although I prefer
to never tell people simplified (but wrong) things. The perl unicode model
is rather simple, but leaves you in control, and I found teaching people
about how perl just allows more than 0..255 for a character index works
best (although people differ).

                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __
      --==---/ / _ \/ // /\ \/ /
      -=====/_/_//_/\_,_/ /_/\_\      XX11-RIPE

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About