develooper Front page | perl.perl5.porters | Postings from February 2008

Re: use encoding 'utf8' bug for Latin-1 range

Thread Previous | Thread Next
Glenn Linderman
February 26, 2008 21:33
Re: use encoding 'utf8' bug for Latin-1 range
Message ID:
On approximately 2/26/2008 6:15 PM, came the following characters from 
the keyboard of Juerd Waalboer:
> Glenn Linderman skribis 2008-02-26 15:16 (-0800):
>> Perhaps all uses in source code of characters outside of the ASCII range 
>> should produce warnings in the 5.12, unless there is a pragma to specify 
>> what locale/encoding.
> Sounds useful, but I personally don't think that just assuming "use
> utf8;" by default would be a problem if that would interpret invalid
> UTF-8 as latin1. Really, actual latin1 data that happens to also be
> valid UTF-8 is immensely rare in my experience. (Counter examples,
> anyone?) To further reduce the risk, the fallback could be done per line
> or per file, instead of per invalid sequence itself.
> (e.g. utf8::decode($_) for @source_lines;)
> In any case, I think that in 5.12, non-ASCII byte data should either
> warn (as you suggest) or be interpreted as utf8 with latin1 fallback
> (dmq's suggestion, but applied elsewhere), maybe also with a warning.

We're in pretty close agreement on this point.  The unfortunate part is 
that people with different locale's may use character values of 128..255 
without telling Perl.  When I speak of character values 128..255 I refer 
not only to \x sequences but also the literal characters in the source file.

>> But maybe a replacement for "use encoding" should be implemented
>> simultaneously.
> I do not object to this, but I do question whether it's worth the tuits.
> Only the actual implementers can judge that.

We're in total agreement on this.  I think the only practical way 
forward is Unicode; UTF-8 being one encoding of Unicode.  A bit more 
support for other Unicode encodings would be nice, but hard to put into 
one bit, I guess.  So the program(mer) has to keep track of that part.

>> Implementing a special version of Perl on EBCDIC seems like a waste of
>> programmer productivity...
> Agreed, but again: those who implement things get to decide. It does,
> however, sometimes keep me from contributing! I'm glad that perlunitut
> and perlunifaq were accepted even though they pay no attention to EBCDIC
> at all. (It did delay my work, before I decided to simply ignore the
> entire EBCDIC world. I have not received even a single complaint about
> that.)
>> just default on EBCDIC platforms to "use encoding(EBCDIC);", decode
>> the source (and data) from EBCDIC to UTF-8, and charge onward with
>> UTF-8 internally.)
> I was told that it's not that simple, but I forgot why.

We're in total agreement here too.  I'd like to know why it is not that 
simple; I understand there would be a performance penalty... but if 
there is anything else, I'd like to know what.  People that use EBCDIC 
systems should be glad to get out from the curse of EBCDIC by migrating 
to Unicode, just like people that have to deal with 40 other 
language-specific code pages are glad to get out from under the curse of 
code page juggling.

I was talking to a friend the other day, and she mentioned that one 
piece of software was producing the those funny rectangular blocks for 
all accented characters when she was reading books in Spanish.  A bit of 
discussion reveals that she had set her system to a Russian code page so 
that her Russian language tutoring software would work... but that broke 
her Spanish software.  So both are on Windows, but apparently both using 
legacy encodings, apparently one or both of them not doing it properly, 
and so they don't coexist very well.  Were both packages using Unicode 
properly, they should coexist fine.

>> With the above in mind, it sounds simple to specify that \x defines a 
>> character in the character set of the source code...
> There are a few problems with this. First of all, in Perl we don't
> usually talk about charsets. The only charset that Perl really supports
> is Unicode. Other character sets are implemented as *encodings* of
> Unicode. That's why we talk about encodings, not charsets, in Perl. All
> translations are done with Unicode in between, at least conceptually.

You can't have an encoding without an charset.

Perl supports ASCII and Unicode, they are compatible.  ASCII is the 
default.  Unicode is turned on with "use utf8;".

"use encoding" is broken, but attempts to support other charsets, that 
are defined as subsets of Unicode.

People sometimes use non-ASCII characters without declaring "use utf8;" 
and because their default charset on their system is an ASCII superset 
(but not numerically a Unicode subset, unless it happens to be Latin-1),
and that can be a problem.  However, it is a solvable problem... declare 
the charset, or the encoding, in terms of ASCII characters.  This 
declaration would declare the encoding of not only the source file, but 
the interpretation of \xXX semantics.  \x{} semantics could stay at 
Unicode, as it is likely only used for characters not in the \xXX range 

The definition of a single-byte charset is a simple lookup table with 
256 entries.  Double-byte charsets are harder, shift-in-shift-out 
charsets are harder.

However, it seems that Dan, the encode maintainer, already has all the 
conversions most everyone needs... the issue is to get people to declare 
the charset in which their source file is encoded, so that it can be 
translated appropriately before use.

If "use encoding" is broken, then either fix it or abandon it, and 
replace it with something else.

I'd agree that it is sufficient to replace it with "use utf8;" but it 
might be friendlier to replace it with something that supports a variety 
of other charset encodings.

> This is where the ${^ENCODING} \x went wrong, too. \x is used with
> character numbers (according to documentation), which are charset
> thingies, and would thus be Unicode codepoints if it was compatible with
> the rest of Perl. Instead, with ^ENCODING in effect, they're seen as
> *bytes*, and then through *decoding* converted to unicode. Except, "of
> course", if the given number is > 0xff, then decoding is skipped and the
> value is used as a unicode codepoint directly.

That's not so bad, for single-byte charsets, it gets more obscure for 
multi-byte and shift-in-shift-out charsets, but certainly workable.

But what the documentation _actually_ says ("encoding" page, "Do not mix 
multiple encodings" section), is that if, for any string, any \x{} is 
used with a value > 255, then all \x encodings will be Unicode codepoints.

So it appears (implementation wise) that first the character string is 
parsed, and then, once it is fully known, if the utf8 flag is off, then 
the decoding operation is performed, else no decoding is performed.

This is workable, although confusing.  And it only happens under "use 
encoding", so if we scrap that and create something new, then we can 
document what is expected for the something new.  Let's call the 
something new "use stnew;" for now, where stnew stands for SomeThing 
NEW, and would probably be some combination of charset and encoding to 
document what the source code actually is.

So with "use stnew;" in effect, all \xXX and \x{} would be treated as 
from the source charset encoding, and any attempt to use a number 
greater than 255 with \x{}, and any use of \N, would be disallowed... 
except a new q operator would be created to allow Unicode... qu{} would 
assume that the whole string is Unicode.

> chr suffers from the same problem. However, ord is unaffected, and
> reports unicode codepoints. The symmetry between chr and ord is broken.

I'd agree this is a bug.  chr and ord should deal only in decoded 
character values.

No generality is lost for source code; chr( constant ) could be replaced 
mechanically with "\xXX" with no semantic loss.  And for variable 
parameters, having chr deal with anything except Unicode is bogus.

> (Oh, in case anyone doesn't know:
> charset (e.g. unicode) is  character <=> number
> encoding (e.g. UTF-8)  is     number <=> byte sequence)

Yep.  And the numbers in those two lines are the same.  And most 
charsets have only one encoding.  A few bigger ones have several encodings.

So for most charsets, single-byte ones, you have

character == number == byte

And this is exactly why people using ASCII+ single-byte charsets found 
it so easy to simply go ahead and use their charset rather than ASCII in 
source without declaring it.

For all the rest of the charsets in the world, except Unicode, all the 
encodings were in terms of byte sequences (if this isn't true, please 
point me to the documentation of a charset that used something other 
than byte sequences; please omit the extinct IBM 6-bit code, it isn't 

Unicode is the first to suggest that bytes aren't good enough, and 
numbers should be bigger (and it is a great simplifying idea).

>> this is probably what most programmers meant when they coded \x...
> I think it is more useful to build on documentation than what people
> "probably meant". It's incredibly hard to find out what people probably
> meant, and it is also incredibly hard to change what people will
> probably mean in the future. However, finding out what documentation
> said and says is trivial, and changing documentation of the future
> version is also pretty easy, in comparison to changing people's
> intentions.

I totally agree that if the documentation is clear, that it should be 
used to interpret what people probably meant.

However, since the documentation says Perl source code is ASCII, but 
allowed arbitrary bytes, when people use non-ASCII, there is nothing 
left but to guess.  But for all the faults of "use encoding" it did give 
some guidelines... don't mix encodings, and \xXX is characters from the 
source encoding.

Pre-Perl-Unicode people didn't use \x{} nor \N, so they had only \xXX, 
and I still suspect that old code that uses \xXX and is written in 
ASCII+ (for some charset definition), uses \xXX byte sequences that 
represent characters in the same charset as the source code... which 
fits the "use encoding" guidelines.

> The documentation of \x, in perlop, defines \x as "hex char" and "wide
> hex char". It does not say if this is a unicode codepoint or a character
> number in whatever "charset" that is loaded. Again, though, I stress
> that currently there is no way to express the requested charset, just
> the encoding. The character set of Perl appears to be Unicode,
> unconfigurably. The documentation of chr, in perlfunc, mentions ASCII
> and Unicode. Note especially that chr's documentation specifies that ord
> is the reverse of chr! So let's see ord's doc. Again, no mention of
> legacy charsets -- only ASCII, EBCDIC and Unicode.

Right, but encoding pragma man page gives guidelines that are 
reasonable, even if its implementation is broken.

> I point out that I do not exactly know how EBCDIC works in Perl. I fear
> that it's horribly incompatible with documentation, older perls, and
> newer perls, for any given version of unicode supporting Perl.
> encoding's documentation does not explicitly say what \x should do. It
> does, however, give several examples that clearly and strongly suggest
> that \x under "use encoding" creates BYTES, that are then DECODED. This
> is consistent with what it actually does.

I found the section "Do not mix multiple encodings" to be quite explicit 
about what \x should do, and does do.

> So what to do? Maybe indeed give up support for encodings the way it's
> done now, and add "use charset", to indeed provide support for a
> different charset. 

Agreed to here.

> All character number reporting and taking operations
> (including ord) should then use the given charset. 

Disagree with this.  chr and ord should deal only in Unicode.  All 
source code and string constants should be in the specified source encoding.

> Charset and encoding
> can be specified separately:
>     use charset 'unicode', encoded_as => 'utf-8';
>     use charset 'latin1';  # implies encoded_as => 'latin1';
>     use charset 'utf-8';   # warning: utf-8 is not a charset, did you
>                            # mean: use charset 'unicode', encoded_as =>
>                            # 'utf-8';?
>     use charset 'CP1251', encoded_as => 'utf-8';
>                            # may not immediately appear to make sense,
>                            # but I think this falls under Jarkko's "The
>                            # Perl Way", where Perl does not restrict you
>                            # in your choices. After all, there's no
>                            # reason that you could not encode your
>                            # non-unicode 0 .. 255 as UTF-8. Apparently
>                            # doing so is popular in the JSON community
>                            # too.

That might be a workable syntax; I'll continue to refer to "use stnew;" 
in this email.

Note that "use charset 'unicode', encoded_as => 'utf-16';" might not be 
possible, because if the file is encoded as utf-16, perl might not 
understand it well enough to read the pragma.

> So yes, \x may mean "character in the currently selected charset", but
> we'd first need a pragma to define the charset! Currently we only have a
> pragma to define the encoding, which at some points also changes the
> charset, and at others does not, and in some weird way makes \x mean
> "byte" rather than "char".
> A huge gap in this idea is that most legacy encodings do not actually
> define any kind of semantics, so which semantics would you use to
> uppercase an é (eacute) under the CP1252 charset? CP1252 is not defined
> in terms of Unicode codepoints (as far as I know; anyone have specs for
> me?), so using Unicode semantics would be a bit weird. ISO-8859-1,
> however, has been explicitly (retro-)defined in terms of Unicode
> codepoints, but the specification is still not clear on what semantics
> should be used. It depends on how recursively you read U+ numbers.
> But really, I honestly think it'd be a waste of tuits to design and
> implement all this. It'll probably just get designed and/or implemented
> wrong again anyway, because it's hard to oversee everything.

You might be right about the tuits, but most of the encoding/decoding 
code already exists, and it is hardest.  Note that the only semantics we 
need is "translation from that there charset/encoding to UTF-8 
internally" ... from there on the use of Unicode semantics is very 
obvious, practical, and well-defined.

> Let's instead just deprecate ${^ENCODING} and If a new
> mechanism is needed, it's much easier to make it encoding-based rather
> than charset based, but this time implement that in a pure way: don't
> let charset-stuff creep in, so keep chr and \x in their unaffected
> unicody state.
>> Modern usage of \x to specify Unicode characters is probably erroneous, 
>> as \N{U+} should have been used.
> Regardless of whether this design was correct, it is there, widely used,
> and I strongly object against changing it now. Instead, I'd rather see
> perlop more explicitly state what \x does as a post-facto definition.
> I'll volunteer. Might as well change the misleading "wide char"
> definition of \x{} too -- "wide chars" elsewhere in Perl do not refer to
> the number of hex digits used when creating them :)
> \N{U+} is way too much typing too, by the way. PHP 6 has \uXXXX and
> \UXXXXXX that I secretly really like. Too bad our \u is taken already :)

Yeah, \N{U+} is cumbersome, but \x{} isn't much better (2 chars), and I 
think \x{} should be Unicode only, \xXX should be source encoding only, 
and the mixture prohibited in the same constant string.

>> Perl text strings are one-byte or multi-bytes encoded number sequences. 
> (Please excuse my liberal use of capital letters in the following
> paragraphs.)
> No. Perl text strings are sequences of characters.

 From past discussions, I know you have different opinions about that. 
My viewpoint is more liberal than yours, in what is allowed and 
disallowed; your viewpoint is a workable and useful subset.

> The numbers, and their encoding, are INTERNAL. You can explicitly
> request the number (ordinal value) of any character with "ord". The
> characters themselves, are, in Perl's string model, NOT NUMBERS.

Yeah, they are internal until you use ord to get them and chr to put 
them back.  It is like putting dogs in a string of cage.  They are still 
a dog, but they can't bite me.  So when you put your numbers in a string 
(chr), they can't be compared to numbers any more, but they are still 
numbers.  When you let the one of the dogs out of one of the cages 
(ord), then it can bite me... and the character can be used as a number.

> If 'A' was the number 65, then 'A' == 65 would be true, and 'A' + 1
> would be equal to 'B'. This is true in C, but not in Perl.

Just a matter of syntax (or cages).

>> When all the characters are numerically less than 256, they may be 
>> one-byte encoded sequences; when any of the characters are numerically 
>> greater than 255, they must be multi-bytes encoded sequences.
> Encoding is INTERNAL. In the programming language, we have Unicode text
> strings, not UTF-8, not latin1, not ASCII. We don't have bytes, we have
> characters. Internally, yes, there certainly are bytes. This shines
> through in several places, and if the unpack discussion has given us
> anything other than headaches and backwards incompatibility, it's
> affirmation that perl5-porters thinks that such leakage of the internals
> is wrong and ought to be repaired.

Yes, the leakage of the internals is poor form.

>> The semantics are either ASCII (for one-byte encoded sequences) or UTF-8 
>> (for multi-bytes encoded sequences).
> Not everywhere, but only in some places. This is a bug in the string
> model's design. It's certainly historically explicable, but nowadays
> causes more trouble than it prevents.

You've never been able to give me an example of any place that has other 
than ASCII or Unicode/UTF-8 semantics.

There are certainly examples of weird, documented behavior when 
attempting to obtain Unicode semantics on a string that is internally 
one-byte encoded sequences.

It would be nice to have a pragma to fix all those places, and apply 
Unicode semantics universally, regardless of the internal encoding.

> Note that between 5.6 and 5.8 the string model was changed: (Caveat
> porter: here follows a post-facto *simplified* view of history) unicode
> semantics were made standard (removing the need for "use utf8") and
> automatic internal encoding upgrading was added so that
> internally-not-yet-utf8-strings could be used with
> internally-already-utf8-strings. Theoretically, this removed the
> difference between internally-not-utf8-encoded and
> internally-utf8-encoded, but several operations, specifically those
> mentioned in Unicode::Semantics' documentation, still lag behind.
> Whether those operations were intentionally left to operate according to
> 5.6's string model, or if they were just forgotten, isn't really
> important anymore. Given the current string model, and Perl's defaulting
> to unicode semantics in all other places, lc, uc, charclasses, etcetera,
> should be changed to also support the new model. Even unpack was
> changed, and that didn't even necessarily have ANYTHING to do with text
> data!

Pack and unpack do have text data operations.

>> Expecting latin1 semantics for one-byte encoded sequences is a bug.
> Only because the ISO-8859 standard does not specify semantics.
> Assuming you meant to say "Expecting unicode semantics for one-byte
> encoded sequences is a bug", I strongly and loudly say NO.

With the current implementation, my statement stands.  I agree with your 
goal of implementing something so that Unicode semantics are used 
universally regardless of the internal encoding of strings.

> Do remember that strings that are internally encoded as latin1, are
> (should be) Unicode strings for all text operations.

Wrong.  Strings are either byte sequences, or Unicode.  Some operations 
implement ASCII semantics, others Unicode semantics.  There is nothing 
Latin1 in Perl, except the implied conversion of byte sequences to 
Unicode... if non-ASCII characters are encountered, the binary value of 
the number for that character position is used as the Unicode character 
it is translated to... that happens to be equivalent to a Latin1 to 
Unicode conversion, but it is purely because of the definition of 
Latin1, not because Perl implements any Latin1 semantics.

>> This would be a useful extension: could "use utf8 semantics;" be 
>> implemented which would affect this stuff?
> Perl 5.6 had this, and it was called "use utf8;". This is gone. "use
> utf8;" now only changes the way Perl interprets your source code, it no
> longer changes semantics. This is good, because doing that lexically
> really makes it hard to combine binary and string semantics in a single
> program.

Seems like it would only be a problem if you attempted to apply 
character semantics to a binary string ... but you don't admit to the 
existence of binary strings, so your code is safe.  But for those of us 
that admit to the existence of binary strings, applying character 
operations to them is just silly.  Uppercasing 43 is meaningless.

Perhaps I'm missing your point here, though.  What specifically do you 
see as hard to combine?

> Note that in Perl >= 5.8, the unicode-ness of a string is not stored
> internally. Instead, operations are either binary oriented or string
> oriented. This makes the type of the string CONTEXT SENSITIVE, rather
> than stored within the string.

Well, most code gets it wrong.  Implementing "use utf8 semantics" to 
apply to a particular scope seems better than playing the guessing game 
about "how is my data internally encoded, so what will this operator 
do".  People that already play the guessing game, and get it right (if 
there are any that do), won't have to use the pragma.  People that want 
simpler code, and uniform semantics, can use the pragma.

> Looks like numbers, doesn't it? If you use the string "123" with a
> numeric operation, it's automatically converted to a number. Not the
> internal flags and representation, but the CONTEXT defines the
> semantics. This model is also broken in a few places, specifically in
> bitwise operators, and that can hurt a lot and force people to fall back
> to type casting ("$foo" and 0+$foo). Perl 6 acknowledges that and
> introduces separate bitwise operators for strings and numbers.

Glenn --
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About