develooper Front page | perl.perl5.porters | Postings from February 2008

Re: use encoding 'utf8' bug for Latin-1 range

Thread Previous | Thread Next
From:
Juerd Waalboer
Date:
February 20, 2008 18:52
Subject:
Re: use encoding 'utf8' bug for Latin-1 range
Message ID:
20080221025046.GV32395@c4.convolution.nl
Jarkko Hietaniemi skribis 2008-02-20 21:23 (-0500):
> use encoding 'utf8';
> my $x = "\x{ff}";

encoding.pm has a broken design, and for that reason, any fix will
probably break almost all existing code using it.

Unfortunately, it applies \x escapes 00..ff before it decodes the source.
This means that for 8bit encodings, you can only use characters in the
latin1 range if the same character happens to be in the 0..255 range for
your chosen encoding. E.g. with "use encoding 'koi8r';" it is no longer
possible to have a literal é (U+00e9, eacute), not even with chr().

Because there are other problems with encoding.pm, that can also not be
fixed without breaking backward compatibility, I suggest the following
simple 4 step plan for the future, that is backwards compatible:

0. keep encoding.pm and ${^ENCODING} (the actual problem) broken
1. deprecate encoding.pm; complain loudly with a mandatory warning
2. do the same for ${^ENCODING}
3. advocate the use of utf8 and "use utf8" for non-latin1 source code
4. strongly discourage the use of non-latin1 non-utf8 source code
5. modify open.pm to provide a way to set *only* STDIN and STDOUT

> The \x{fffd} is the Unicode "lost in translation" character,
> in case people are wondering.

It gets scarier once you know *why* it was lost in translation:

    juerd@lanova:~$ perl -Mencoding=utf8 -e'my $foo = "\xe2\x82\xac"; printf "length=%d ord=%d (U+%04X)\n", length $foo, ord $foo, ord $foo'
    length=1 ord=8364 (U+20AC)

I have seen code that depends on this behavior. Admittedly, this code is
fundamentally broken in other ways too, regarding Perl's unicode model.

Given that it's not the only flaw in encoding.pm, and that the world is
rapidly converging on the idea of using UTF-8 as the de facto standard,
I think it would be unwise to waste tuits on fixing the issues.
Deprecation sounds like a better idea to me; we already have a good way
to use non-latin1: "use utf8;", and I don't think it's necessary to have
another.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <#####@juerd.nl>  <http://juerd.nl/sig>
  Convolution:     ICT solutions and consultancy <sales@convolution.nl>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About