develooper Front page | perl.perl5.porters | Postings from February 2008

Re: use encoding 'utf8' bug for Latin-1 range

Thread Previous | Thread Next
Juerd Waalboer
February 26, 2008 18:19
Re: use encoding 'utf8' bug for Latin-1 range
Message ID:
Glenn Linderman skribis 2008-02-26 15:16 (-0800):
> Perhaps all uses in source code of characters outside of the ASCII range 
> should produce warnings in the 5.12, unless there is a pragma to specify 
> what locale/encoding.

Sounds useful, but I personally don't think that just assuming "use
utf8;" by default would be a problem if that would interpret invalid
UTF-8 as latin1. Really, actual latin1 data that happens to also be
valid UTF-8 is immensely rare in my experience. (Counter examples,
anyone?) To further reduce the risk, the fallback could be done per line
or per file, instead of per invalid sequence itself.

(e.g. utf8::decode($_) for @source_lines;)

In any case, I think that in 5.12, non-ASCII byte data should either
warn (as you suggest) or be interpreted as utf8 with latin1 fallback
(dmq's suggestion, but applied elsewhere), maybe also with a warning.

> But maybe a replacement for "use encoding" should be implemented
> simultaneously.

I do not object to this, but I do question whether it's worth the tuits.
Only the actual implementers can judge that.

> Implementing a special version of Perl on EBCDIC seems like a waste of
> programmer productivity...

Agreed, but again: those who implement things get to decide. It does,
however, sometimes keep me from contributing! I'm glad that perlunitut
and perlunifaq were accepted even though they pay no attention to EBCDIC
at all. (It did delay my work, before I decided to simply ignore the
entire EBCDIC world. I have not received even a single complaint about

> just default on EBCDIC platforms to "use encoding(EBCDIC);", decode
> the source (and data) from EBCDIC to UTF-8, and charge onward with
> UTF-8 internally.)

I was told that it's not that simple, but I forgot why.

> With the above in mind, it sounds simple to specify that \x defines a 
> character in the character set of the source code...

There are a few problems with this. First of all, in Perl we don't
usually talk about charsets. The only charset that Perl really supports
is Unicode. Other character sets are implemented as *encodings* of
Unicode. That's why we talk about encodings, not charsets, in Perl. All
translations are done with Unicode in between, at least conceptually.

This is where the ${^ENCODING} \x went wrong, too. \x is used with
character numbers (according to documentation), which are charset
thingies, and would thus be Unicode codepoints if it was compatible with
the rest of Perl. Instead, with ^ENCODING in effect, they're seen as
*bytes*, and then through *decoding* converted to unicode. Except, "of
course", if the given number is > 0xff, then decoding is skipped and the
value is used as a unicode codepoint directly.

chr suffers from the same problem. However, ord is unaffected, and
reports unicode codepoints. The symmetry between chr and ord is broken.

(Oh, in case anyone doesn't know:

charset (e.g. unicode) is  character <=> number
encoding (e.g. UTF-8)  is     number <=> byte sequence)

> this is probably what most programmers meant when they coded \x...

I think it is more useful to build on documentation than what people
"probably meant". It's incredibly hard to find out what people probably
meant, and it is also incredibly hard to change what people will
probably mean in the future. However, finding out what documentation
said and says is trivial, and changing documentation of the future
version is also pretty easy, in comparison to changing people's

The documentation of \x, in perlop, defines \x as "hex char" and "wide
hex char". It does not say if this is a unicode codepoint or a character
number in whatever "charset" that is loaded. Again, though, I stress
that currently there is no way to express the requested charset, just
the encoding. The character set of Perl appears to be Unicode,
unconfigurably. The documentation of chr, in perlfunc, mentions ASCII
and Unicode. Note especially that chr's documentation specifies that ord
is the reverse of chr! So let's see ord's doc. Again, no mention of
legacy charsets -- only ASCII, EBCDIC and Unicode.

I point out that I do not exactly know how EBCDIC works in Perl. I fear
that it's horribly incompatible with documentation, older perls, and
newer perls, for any given version of unicode supporting Perl.

encoding's documentation does not explicitly say what \x should do. It
does, however, give several examples that clearly and strongly suggest
that \x under "use encoding" creates BYTES, that are then DECODED. This
is consistent with what it actually does.

So what to do? Maybe indeed give up support for encodings the way it's
done now, and add "use charset", to indeed provide support for a
different charset. All character number reporting and taking operations
(including ord) should then use the given charset. Charset and encoding
can be specified separately:

    use charset 'unicode', encoded_as => 'utf-8';
    use charset 'latin1';  # implies encoded_as => 'latin1';
    use charset 'utf-8';   # warning: utf-8 is not a charset, did you
                           # mean: use charset 'unicode', encoded_as =>
                           # 'utf-8';?
    use charset 'CP1251', encoded_as => 'utf-8';
                           # may not immediately appear to make sense,
                           # but I think this falls under Jarkko's "The
                           # Perl Way", where Perl does not restrict you
                           # in your choices. After all, there's no
                           # reason that you could not encode your
                           # non-unicode 0 .. 255 as UTF-8. Apparently
                           # doing so is popular in the JSON community
                           # too.

So yes, \x may mean "character in the currently selected charset", but
we'd first need a pragma to define the charset! Currently we only have a
pragma to define the encoding, which at some points also changes the
charset, and at others does not, and in some weird way makes \x mean
"byte" rather than "char".

A huge gap in this idea is that most legacy encodings do not actually
define any kind of semantics, so which semantics would you use to
uppercase an é (eacute) under the CP1252 charset? CP1252 is not defined
in terms of Unicode codepoints (as far as I know; anyone have specs for
me?), so using Unicode semantics would be a bit weird. ISO-8859-1,
however, has been explicitly (retro-)defined in terms of Unicode
codepoints, but the specification is still not clear on what semantics
should be used. It depends on how recursively you read U+ numbers.

But really, I honestly think it'd be a waste of tuits to design and
implement all this. It'll probably just get designed and/or implemented
wrong again anyway, because it's hard to oversee everything.

Let's instead just deprecate ${^ENCODING} and If a new
mechanism is needed, it's much easier to make it encoding-based rather
than charset based, but this time implement that in a pure way: don't
let charset-stuff creep in, so keep chr and \x in their unaffected
unicody state.

> Modern usage of \x to specify Unicode characters is probably erroneous, 
> as \N{U+} should have been used.

Regardless of whether this design was correct, it is there, widely used,
and I strongly object against changing it now. Instead, I'd rather see
perlop more explicitly state what \x does as a post-facto definition.
I'll volunteer. Might as well change the misleading "wide char"
definition of \x{} too -- "wide chars" elsewhere in Perl do not refer to
the number of hex digits used when creating them :)

\N{U+} is way too much typing too, by the way. PHP 6 has \uXXXX and
\UXXXXXX that I secretly really like. Too bad our \u is taken already :)

> Perl text strings are one-byte or multi-bytes encoded number sequences. 

(Please excuse my liberal use of capital letters in the following

No. Perl text strings are sequences of characters.

The numbers, and their encoding, are INTERNAL. You can explicitly
request the number (ordinal value) of any character with "ord". The
characters themselves, are, in Perl's string model, NOT NUMBERS.

If 'A' was the number 65, then 'A' == 65 would be true, and 'A' + 1
would be equal to 'B'. This is true in C, but not in Perl.

> When all the characters are numerically less than 256, they may be 
> one-byte encoded sequences; when any of the characters are numerically 
> greater than 255, they must be multi-bytes encoded sequences.

Encoding is INTERNAL. In the programming language, we have Unicode text
strings, not UTF-8, not latin1, not ASCII. We don't have bytes, we have
characters. Internally, yes, there certainly are bytes. This shines
through in several places, and if the unpack discussion has given us
anything other than headaches and backwards incompatibility, it's
affirmation that perl5-porters thinks that such leakage of the internals
is wrong and ought to be repaired.

> The semantics are either ASCII (for one-byte encoded sequences) or UTF-8 
> (for multi-bytes encoded sequences).

Not everywhere, but only in some places. This is a bug in the string
model's design. It's certainly historically explicable, but nowadays
causes more trouble than it prevents.

Note that between 5.6 and 5.8 the string model was changed: (Caveat
porter: here follows a post-facto *simplified* view of history) unicode
semantics were made standard (removing the need for "use utf8") and
automatic internal encoding upgrading was added so that
internally-not-yet-utf8-strings could be used with
internally-already-utf8-strings. Theoretically, this removed the
difference between internally-not-utf8-encoded and
internally-utf8-encoded, but several operations, specifically those
mentioned in Unicode::Semantics' documentation, still lag behind.

Whether those operations were intentionally left to operate according to
5.6's string model, or if they were just forgotten, isn't really
important anymore. Given the current string model, and Perl's defaulting
to unicode semantics in all other places, lc, uc, charclasses, etcetera,
should be changed to also support the new model. Even unpack was
changed, and that didn't even necessarily have ANYTHING to do with text

> Expecting latin1 semantics for one-byte encoded sequences is a bug.

Only because the ISO-8859 standard does not specify semantics.

Assuming you meant to say "Expecting unicode semantics for one-byte
encoded sequences is a bug", I strongly and loudly say NO.

Do remember that strings that are internally encoded as latin1, are
(should be) Unicode strings for all text operations.

> This would be a useful extension: could "use utf8 semantics;" be 
> implemented which would affect this stuff?

Perl 5.6 had this, and it was called "use utf8;". This is gone. "use
utf8;" now only changes the way Perl interprets your source code, it no
longer changes semantics. This is good, because doing that lexically
really makes it hard to combine binary and string semantics in a single

Note that in Perl >= 5.8, the unicode-ness of a string is not stored
internally. Instead, operations are either binary oriented or string
oriented. This makes the type of the string CONTEXT SENSITIVE, rather
than stored within the string.

Looks like numbers, doesn't it? If you use the string "123" with a
numeric operation, it's automatically converted to a number. Not the
internal flags and representation, but the CONTEXT defines the
semantics. This model is also broken in a few places, specifically in
bitwise operators, and that can hurt a lot and force people to fall back
to type casting ("$foo" and 0+$foo). Perl 6 acknowledges that and
introduces separate bitwise operators for strings and numbers.
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <>  <>
  Convolution:     ICT solutions and consultancy <>

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About