develooper Front page | perl.perl5.porters | Postings from February 2008

Re: use encoding 'utf8' bug for Latin-1 range

Thread Previous | Thread Next
From:
Juerd Waalboer
Date:
February 27, 2008 02:23
Subject:
Re: use encoding 'utf8' bug for Latin-1 range
Message ID:
20080227101914.GT13615@c4.convolution.nl
Glenn Linderman skribis 2008-02-26 21:32 (-0800):
> >>With the above in mind, it sounds simple to specify that \x defines a 
> >>character in the character set of the source code...
> >There are a few problems with this. First of all, in Perl we don't
> >usually talk about charsets. The only charset that Perl really supports
> >is Unicode. Other character sets are implemented as *encodings* of
> >Unicode. That's why we talk about encodings, not charsets, in Perl. All
> >translations are done with Unicode in between, at least conceptually.
> You can't have an encoding without an charset.

While this is true, Perl currently handles only encodings, and keeps the
charset at unicode. Except \x and chr under ${^ENCODING}, but note that
many, many other operations use unicode codepoints and semantics. \x and
chr are unfortunate exceptions.

> Perl supports ASCII and Unicode, they are compatible.  ASCII is the 
> default.  Unicode is turned on with "use utf8;".

No, unicode is always on. "use utf8" only tells Perl that it should
decode your source code as utf8, and enables non-ASCII identifiers. It
has nothing to do with string values in general. You appear to be stuck
in the Perl 5.6 era.

> The definition of a single-byte charset is a simple lookup table with 
> 256 entries.  Double-byte charsets are harder, shift-in-shift-out 
> charsets are harder.

My point was that character sets do not have bytes. Encodings have bytes.

Specifically, ${^ENCODING} makes \x with a value under 0x100 define a
byte, complete sequences of which are then decoded using the specified
encoding. Here, encoding and charset only happen to match eachother when
you have a legacy (max.) 8-bit encoding.

> If "use encoding" is broken, then either fix it or abandon it, and 
> replace it with something else.

Well, yes. I think I covered all three options, and my opinion is clear.

> But what the documentation _actually_ says ("encoding" page, "Do not mix 
> multiple encodings" section), is that if, for any string, any \x{} is 
> used with a value > 255, then all \x encodings will be Unicode codepoints.

Indeed, thanks for clarifying this.

> So with "use stnew;" in effect, all \xXX and \x{} would be treated as 
> from the source charset encoding

Do you mean bytes or characters, this time? "charset encoding" is
ambiguous at best.

> except a new q operator would be created to allow Unicode... qu{} would 
> assume that the whole string is Unicode.

Don't forget to add qqu//, mu//, su///, tru///, ...

Unicodeness is not part of the string value, it is context sensitive. If
you use something with codepoints above 0xff with binary operations
(like print), you get warnings and UTF-8 encoding is used instead of
bytes.

> And for variable parameters, having chr deal with anything except
> Unicode is bogus.

Either using legacy encodings/charsets is supported, or it is not.
Saying it makes sense for source encoding, but not for computer ordinal
character values, to me makes little sense.

> Unicode is the first to suggest that bytes aren't good enough, and 
> numbers should be bigger (and it is a great simplifying idea).

Such a statement makes one wonder how Chinese could ever be used on a
computer without Unicode. Unicode was not the first character set with
more than 256 characters. And obviously every encoding that supports
more than 256 characters will need to support sequences of multiple
bytes to encode certain or all characters.

> Pre-Perl-Unicode people didn't use \x{} nor \N, so they had only \xXX, 
> and I still suspect that old code that uses \xXX and is written in 
> ASCII+ (for some charset definition), uses \xXX byte sequences that 
> represent characters in the same charset as the source code... which 
> fits the "use encoding" guidelines.

In other words: the current design with \x is not a mistake, not a
problem, and we can keep "use encoding" the way it is. I'm entirely fine
by that, but I will continue to vocally discourage its use.

However if anyone is interested in fixing the problem, then by all means
please do it right, and make \x mean "character number" again.

> Note that "use charset 'unicode', encoded_as => 'utf-16';" might not be 
> possible, because if the file is encoded as utf-16, perl might not 
> understand it well enough to read the pragma.

Correct, both encoding and charset must be ASCII-compatible at least to
a certain extent. UTF-16 and UTF-32 with all their nullbytes are not
sufficiently ASCII compatible, although Perl could be made to skip
nullbytes in source code.

Again: I think it's a waste of tuits to support non-ascii non-utf8
literal text strings and/or identifiers.

> >\N{U+} is way too much typing too, by the way. PHP 6 has \uXXXX and
> >\UXXXXXX that I secretly really like. Too bad our \u is taken already :)
> Yeah, \N{U+} is cumbersome, but \x{} isn't much better (2 chars), and I 
> think \x{} should be Unicode only, \xXX should be source encoding only, 
> and the mixture prohibited in the same constant string.

Regardless of what you think, this is simply not how \x works without
"use encoding", and I would really hate to see its meaning get turned
around to create bytes like that -- that might break insane amounts of
code all at once.

> >No. Perl text strings are sequences of characters.
> From past discussions, I know you have different opinions about that. 
> My viewpoint is more liberal than yours, in what is allowed and 
> disallowed

I refuse to argue over this again. The documentation is pretty clear
about this: modern Perl text strings consist of characters, not bytes.

Byte strings are also still supported, although they may not work
correctly anymore once you use them in text context, because they may
get upgraded if you do that.

> > > The semantics are either ASCII (for one-byte encoded sequences) or
> > > UTF-8 (for multi-bytes encoded sequences).
> > Not everywhere, but only in some places. This is a bug in the string
> > model's design. It's certainly historically explicable, but nowadays
> > causes more trouble than it prevents.
> You've never been able to give me an example of any place that has other 
> than ASCII or Unicode/UTF-8 semantics.

This is not what I meant. I referred to the parts in parens: semantics
are chosen based on the state of the UTF8 flag only in some places.
Indeed all text operations do use ASCII or Unicode semantics.

> It would be nice to have a pragma to fix all those places, and apply 
> Unicode semantics universally, regardless of the internal encoding.

I'd rather enable unicode semantics by default, as per the 5.8 string
model changes. If needed, a pragma could be used to force ASCII
semantics for some operations. I am not at all convinced that such
functionality is needed. It would only be necessary if people do
lower/upper casing or charset stuff on combined ASCII/binary strings.
Combining text and binary in a single string is a bad idea and I would
hate to see people waste tuits on stretching perl to support it in some
places. It can never be supported everywhere anyway, simply because it
fails to make sense. Marc Lehmann will probably disagree, I will not
discuss it because it's tiring. If anyone wants to implement support for
mixed binary/text strings, by all means go ahead, just don't break
support for pure binary strings and for pure text strings and I won't
even notice the difference.

> Pack and unpack do have text data operations.

Unfortunately, yes. I agree with the changes made, but didn't think they
were entirely necessary. Over time I have slightly changed my mind.
I still do think that one mess was replaced with another, but perhaps
this was the best the porters could do without introducing new keywords.

In any case, the changes were made and programmers will have to deal
with it. The easiest way to deal with it is to continue to fully
separate binary strings from text operations.

> > Do remember that strings that are internally encoded as latin1, are
> > (should be) Unicode strings for all text operations.
> Wrong. Strings are either byte sequences, or Unicode.

Indeed, I should have said:

    text strings that are internally encoded as latin1

instead of:

    strings that are internally encoded as latin1

for added clarity.

> but you don't admit to the existence of binary strings, so your code
> is safe.

You may want to read my document "perlunitut" again, which clearly
"admits" that binary strings exist, and define them as:

    Binary strings, or byte strings are made of bytes. Here, you don't
    have characters, just bytes. All communication with the outside
    world (anything outside of your current Perl process) is done in
    binary.

See also "What about binary data, like images?" in perlunifaq.

> Perhaps I'm missing your point here, though.  What specifically do you 
> see as hard to combine?

Binary data with unicode text data or semantics. The text data must be
*encoded* in some way to be binary, or the binary data will have to be
interpreted as text by *decoding* it. In any case, one of the two values
must concede to being coerced. In the absense of explicit decoding and
encoding, Perl will assume that binary data is latin1 encoded text data.

Except, again, the buggy operations listed in Unicode::Semantics.

> Well, most code gets it wrong.  Implementing "use utf8 semantics" to 
> apply to a particular scope seems better than playing the guessing game 
> about "how is my data internally encoded, so what will this operator 
> do".

I agree that this guessing game is very problematic, but not about your
suggested way of solving it.

> People that already play the guessing game, and get it right (if 
> there are any that do), won't have to use the pragma.

Or they could explicitly force unicode sematics on a per-string basis by
upgrading the strings. See also Unicode::Semantics.

This is, however, a work around for a bug.

> People that want simpler code, and uniform semantics, can use the
> pragma.

What you want to put into a pragma, I want enabled by default, just
unicode semantics are already the default in the rest of Perl. These
things were just left out, but that mistake can be corrected.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <#####@juerd.nl>  <http://juerd.nl/sig>
  Convolution:     ICT solutions and consultancy <sales@convolution.nl>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About