2008/5/20 John Peacock <john.peacock@havurah-software.org>: > Marc Lehmann wrote: .. >> Of course, this gets you in trouble: >> >> my $s = chr 200; # not unicode, but native 8-bit(??) >> substr $s, 0, 0, chr 500; >> $s =~ /ΓΌ/; # now interpreted as unicode >> >> This is the insane part - I wouldn't expect even an expert perl programmer >> to predict how $s gets interpreted here. > > This is a contrived example because you are going out of your way to > manufacture bad code. Just because you *can* use chr() with values > 255 > and Perl turns on the UTF8 flag in the supreme hope that you knew what you > were doing, doesn't make this irredeemably broken. You broke $s by mixing > your string-types using a low-level function that has no knowledge of > unicode semantics, *nor should it*. > > A more realistic example is a PV containing ASCII text has a UTF8 string > concatanated to it. This works as designed - the original string is > upgraded to UTF8 and the second string appended and well-formed UTF8 is > assured. I think Marcs point was that Perl really has no business assuming the string is actually latin 1. As Glen said part of the problem with dealing with Marc on this subject is he doesn't use the terms most commonly used here or he uses them in different ways than tend to be used here, and he doesn't explain precisely how he is using them until after the debate has become heated. Hopefully i can try to summarize his point, which I think I finally get. (With help from Glen and Ben) He says: string data has no character set association at all. It is either an array of octets or it is an array of integers encoded as utf8. the fact that the string may be encoded using utf8 sequences does not mean that it actually contains Unicode data. So for instance if i took a string that contained the bytes which represented "Hello World" in Chinese using big5 and concatenated a string containing char(256) to it, the octets would be reencoded as utf8 directly, octet for octet, without an understanding of how big5 actually represents strings, and on an abstract level the string still contains big5, just now strangely double encoded as utf8. Where this gets confusing is that Perl does in fact assume Latin-1 semantics for its octet based strings in a number of common cases, such as case insensitive matching and upper and lower case. Etc. This is OK because these are places where the programmer explicitly says "assume that this is character data encoded somehow or another". But the "auto upgrade" behaviour is dangerous as it means that binary data is sometimes blindly re-encoded as utf8, even though it may have been pure binary data. The core of the problem is that the old C habit of conflating arrays of octets with strings of characters has carried over to Perl in such a way that we have a big mess, and it doesn't look easily resolvable. Although i suspect that we are making a mountain out of a mole hill about the Win32 aspect of this problem. I think Marc is right, the utf8 flag being off doesn't say "this data is latin1" and the utf8 flag being on doesn't say "this data is Unicode". The flag instead says (when off) "this is array of characters" or "this is an array of integers encoded as utf8" (when on). The additional step of ascribing a character set to the encoding is incorrect, and one that evolves out of the heritage of supporting character set style operations on pure octet encodings. Basically we have to remember that encoding and character set are different. ANSI is a character set, Latin-1 is a character set. Unicode is a character set. Octets are an encoding, and utf8 is an encoding. We can have Latin-1 data encoded as utf8, indistinguisable from Unicode encoded as utf8, and we can have ANSI data encoded as utf8, which is not the same thing as converting ANSI to Unicode stored as utf8. Its all very ripe for confusion. I think Marc is right. We should really think about this. We have different parts of the code base thinking about these issues in different ways and a lot of confusion involved. I personally think that if we can sort them out, even in a not 100% backwards compatible way then we will have made good progress. The issues i see are this: 1. We don't have a binary data type. (We dont distinguish character data from octet data and its easy to inadvertently cause one to be treated as the other with surprising results.) 2. We don't associated character set to a string we associate encoding to a string. Character set and encoding are orthogonal concepts despite being related. 3. We use the name of an encoding of Unicode as the name of for the encoding of a string causing confusion. Im not sure how we get out of this mess. Maybe by making PV's store more information about their character set. With that information we can convert strings correctly to Unciode when we need to. Cheers, yves -- perl -Mre=debug -e "/just|another|perl|hacker/"Thread Previous | Thread Next