develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs

Thread Previous | Thread Next
From:
demerphq
Date:
May 20, 2008 05:10
Subject:
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs
Message ID:
9b18b3110805200510u78e88996tf7446ab0e541b196@mail.gmail.com
2008/5/20 John Peacock <john.peacock@havurah-software.org>:
> Marc Lehmann wrote:
..
>> Of course, this gets you in trouble:
>>
>>   my $s = chr 200; # not unicode, but native 8-bit(??)
>>   substr $s, 0, 0, chr 500;
>>   $s =~ /ΓΌ/; # now interpreted as unicode
>>
>> This is the insane part - I wouldn't expect even an expert perl programmer
>> to predict how $s gets interpreted here.
>
> This is a contrived example because you are going out of your way to
> manufacture bad code.  Just because you *can* use chr() with values > 255
> and Perl turns on the UTF8 flag in the supreme hope that you knew what you
> were doing, doesn't make this irredeemably broken.  You broke $s by mixing
> your string-types using a low-level function that has no knowledge of
> unicode semantics, *nor should it*.
>
> A more realistic example is a PV containing ASCII text has a UTF8 string
> concatanated to it.  This works as designed - the original string is
> upgraded to UTF8 and the second string appended and well-formed UTF8 is
> assured.

I think Marcs point was that Perl really has no business assuming the
string is actually latin 1.

As Glen said part of the problem with dealing with Marc on this
subject is he doesn't use the terms most commonly used here or he uses
them in different ways than tend to be used here, and he doesn't
explain precisely how he is using them until after the debate has
become heated.

Hopefully i can try to summarize his point, which I think I finally
get. (With help from Glen and Ben)

He says: string data has no character set association at all. It is
either an array of octets or it is an array of integers encoded as
utf8. the fact that the string may be encoded using utf8 sequences
does not mean that it actually contains Unicode data.

So for instance if i took a string that contained the bytes which
represented "Hello World" in Chinese using big5 and concatenated a
string containing char(256) to it, the octets would be reencoded as
utf8 directly, octet for octet, without an understanding of how big5
actually represents strings, and on an abstract level the string still
contains big5, just now strangely double encoded as utf8.

Where this gets confusing is that Perl does in fact assume Latin-1
semantics for its octet based strings in a number of common cases,
such as case insensitive matching and upper and lower case. Etc. This
is OK because these are places where the programmer explicitly says
"assume that this is character data encoded somehow or another". But
the "auto upgrade" behaviour is dangerous as it means that binary data
is sometimes blindly re-encoded as utf8, even though it may have been
pure binary data.

The core of the problem is that the old C habit of conflating arrays
of octets with strings of characters has carried over to Perl in such
a way that we have a big mess, and it doesn't look easily resolvable.
Although i suspect that we are making a mountain out of a mole hill
about the Win32 aspect of this problem.

I think Marc is right, the utf8 flag being off doesn't say "this data
is latin1" and the utf8 flag being on doesn't say "this data is
Unicode". The flag instead says (when off) "this is array of
characters" or "this is an array of integers encoded as utf8" (when
on).  The additional step of ascribing a character set to the encoding
is incorrect, and one that evolves out of the heritage of supporting
character set style operations on pure octet encodings.

Basically we have to remember that encoding and character set are
different. ANSI is a character set, Latin-1 is a character set.
Unicode is a character set. Octets are an encoding, and utf8 is an
encoding.  We can have Latin-1 data encoded as utf8, indistinguisable
from Unicode encoded as utf8, and we can have ANSI data encoded as
utf8, which is not the same thing as converting ANSI to Unicode stored
as utf8.

Its all very ripe for confusion. I think Marc is right. We should
really think about this. We have different parts of the code base
thinking about these issues in different ways and a lot of confusion
involved. I personally think that if we can sort them out, even in a
not 100% backwards compatible way then we will have made good
progress.

The issues i see are this:

1. We don't have a binary data type.  (We dont distinguish character
data from octet data and its easy to inadvertently cause one to be
treated as the other with surprising results.)
2. We don't associated character set to a string we associate encoding
to a string. Character set and encoding are orthogonal concepts
despite being related.
3. We use the name of an encoding of Unicode as the name of for the
encoding of a string causing confusion.

Im not sure how we get out of this mess. Maybe by making PV's store
more information about their character set. With that information we
can convert strings correctly to Unciode when we need to.

Cheers,
yves


-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About