demerphq skribis 2008-05-20 14:10 (+0200): > Where this gets confusing is that Perl does in fact assume Latin-1 > semantics for its octet based strings in a number of common cases, I think you mean "ASCII semantics" there. In these cases, the second half of latin1 is ignored and left alone. Latin1 was (re)defined as a Unicode encoding in 1998, which means that 0xE9 is no longer just something that looks like é, but defined as U+00E9 LATIN SMALL LETTER E WITH ACUTE. This does, of course, have the implication that the letter now has an uppercase variant in U+00C9 LATIN CAPITAL LETTER E WITH ACUTE which is encoded as the 0xC9 byte in latin1. Perl ignores this part of the specification, and that's why I think it's incorrect to call what Perl does "Latin-1 semantics". In fact, latin1 semantics are pretty hard to describe because uc("\xff") (\xFF is U+00FF LATIN SMALL LETTER Y WITH DIAERESIS) cannot be expressed in latin1, because the uppercase of U+00FF is U+0178 which has no representation in latin1. > I think Marc is right, the utf8 flag being off doesn't say "this data > is latin1" and the utf8 flag being on doesn't say "this data is > Unicode". The flag instead says (when off) "this is array of > characters" or "this is an array of integers encoded as utf8" (when > on). You're making a distinction between "characters" (SvUTF8 off) and "integers" (SvUTF8 on) that I don't understand. Could you explain why there is a difference and what that is? > Latin-1 is a character set. Latin-1 is both a character set and an encoding. The character set is defined as equal to the first 256 characters in Unicode (U+0000 .. U+00FF), and the encoding is defined as a straight forward 8 bit encoding: U+0000 => 0x00 .. U+00FF => 0xFF. They even went as far as describing how the individual bits are to be layed out in the byte. Not surprisingly, the 8 bits have weights from 128 to 1, where each subsequent bit is half the value of the one before it :) The specification uses the term "coded representation" rather than "encoding". > The issues i see are this: > 1. We don't have a binary data type. I intend to release a module that handles this in Perl space in a way that is backward compatible to 5.000. Its name is BLOB. One thing that it doesn't do, is avoid concatenation with non-BLOBs. I'd like to learn if this can be done at all. > 3. We use the name of an encoding of Unicode as the name of for the > encoding of a string causing confusion. Indeed. Maybe it would be wise to start calling the internal representation SvUTF8 encoding, rather than UTF8 encoding. Or maybe a wholly different name. > Maybe by making PV's store more information about their character set. The Encode suite treats character sets as properties of encodings; the user only has to deal with a single character set, namely Unicode. I think that's the only sane approach. Information about the charset/encoding does not have to be in the string, but belongs to operations as Marc aptly describes the first post carrying this subject. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> Convolution: ICT solutions and consultancy <sales@convolution.nl> 1;Thread Previous | Thread Next