demerphq skribis 2008-05-20 18:03 (+0200): > > The Encode suite treats character sets as properties of encodings; > Given how perl works internally does Encode have any other choice? Sure. Since a string in Perl is just a sequence of numbered characters, it could theoretically be used to represent any character set, not just Unicode. We tend to call Perl strings Unicode strings, but in reality the unicode-ness is not part of the string, but of the operation done on it. It's a fair coincidence that the multibyte encoding chosen happens to be a unicode encoding ;) > > the user only has to deal with a single character set, namely Unicode. > Except er, they dont. As weve been discussing for ages now. Encode combines "character set" and "byte encoding" into a single mapping, which it calls "encoding". Perl users can treat binary data as encoded text. A Perl programmer decodes the binary data, and later encodes the text data back to binary. They only specify the "encoding", and the character set is handled transparently. Let's call the latin1 character set "l1cs" and the latin1 encoding "l1enc". The real transformation from UTF-8 to l1enc would be: UTF-8 -> unicode -> l1cs -> l1enc However, Perl provides a unified view of encodings, and bundles the charset in them. What you're actually doing is UTF-8 -> (string of unicode codepoints) -> latin1 And you don't have to care about the difference between l1cs and l1enc. That's what I meant by: the character set is Unicode, and all other character sets are handled by their encoding implementations. > > I think that's the only sane approach. Information about the > > charset/encoding does not have to be in the string, but belongs to > > operations as Marc aptly describes the first post carrying this subject. > I dont get you really. If you dont know what type of a data is > contained in a string how can you know what the correct behaviour is > for it for a given operation? By declaring what you expect, so you don't have to know or guess. Perl operators would expect unicode text. uc(), lc(), character classes, etcetera are all text operations. You don't use them on binary data. Perl assumes that the character set of the string is Unicode, and uses Unicode semantics. Or, it should. In fact, I couldn't even *find* any other character set with clearly defined semantics for things like upper/lower case. Unicode appears to be unique in that. Oh, and ASCII of course :). -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> Convolution: ICT solutions and consultancy <sales@convolution.nl> 1;Thread Previous | Thread Next