develooper Front page | perl.perl5.porters | Postings from February 2008

Re: use encoding 'utf8' bug for Latin-1 range

Thread Previous | Thread Next
Juerd Waalboer
February 20, 2008 18:52
Re: use encoding 'utf8' bug for Latin-1 range
Message ID:
Jarkko Hietaniemi skribis 2008-02-20 21:23 (-0500):
> use encoding 'utf8';
> my $x = "\x{ff}"; has a broken design, and for that reason, any fix will
probably break almost all existing code using it.

Unfortunately, it applies \x escapes 00..ff before it decodes the source.
This means that for 8bit encodings, you can only use characters in the
latin1 range if the same character happens to be in the 0..255 range for
your chosen encoding. E.g. with "use encoding 'koi8r';" it is no longer
possible to have a literal é (U+00e9, eacute), not even with chr().

Because there are other problems with, that can also not be
fixed without breaking backward compatibility, I suggest the following
simple 4 step plan for the future, that is backwards compatible:

0. keep and ${^ENCODING} (the actual problem) broken
1. deprecate; complain loudly with a mandatory warning
2. do the same for ${^ENCODING}
3. advocate the use of utf8 and "use utf8" for non-latin1 source code
4. strongly discourage the use of non-latin1 non-utf8 source code
5. modify to provide a way to set *only* STDIN and STDOUT

> The \x{fffd} is the Unicode "lost in translation" character,
> in case people are wondering.

It gets scarier once you know *why* it was lost in translation:

    juerd@lanova:~$ perl -Mencoding=utf8 -e'my $foo = "\xe2\x82\xac"; printf "length=%d ord=%d (U+%04X)\n", length $foo, ord $foo, ord $foo'
    length=1 ord=8364 (U+20AC)

I have seen code that depends on this behavior. Admittedly, this code is
fundamentally broken in other ways too, regarding Perl's unicode model.

Given that it's not the only flaw in, and that the world is
rapidly converging on the idea of using UTF-8 as the de facto standard,
I think it would be unwise to waste tuits on fixing the issues.
Deprecation sounds like a better idea to me; we already have a good way
to use non-latin1: "use utf8;", and I don't think it's necessary to have
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <>  <>
  Convolution:     ICT solutions and consultancy <>

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About