On Sat, Mar 31, 2007 at 02:16:49AM +0200, Juerd Waalboer <juerd@convolution.nl> wrote: > Marc Lehmann skribis 2007-03-31 2:12 (+0200): > > Yes, and the exact same is true for unicode (both have a 1-1 mapping > > between 0..255 and octets), trivially, of course, as unicode explicitly is > > a superset of latin1. > > Unicode is a character set, not a character encoding. As is latin1. > A unicode string is a sequence of codepoints, not octets. Nope. You can encode unicode codepoints into UTF-8 and still end up with a unicode string. Encoding doesn't change the fact that it is unicode that your are storing. Since it seems hard to grasp, here is an example: my $s = "Hello, World!"; $s = Encode::encode_utf8 $s; $s contains the famous greeting before and after the encoding. It is still an ASCII string, iso-8859-15 string, and a unicode string, and a text string, regardless of wether it is encoded or not, that does not change the fact that that string contaisn the message "Hello, World!". If you drop ASCII, the same is true for "Hallöchen!", which looks differently in UTF-8 then in an unencoded string, but it is still the same message. And it is till using unicode to represent the characters. The fact that you encode something does not change the something that you encode. Making an arbitrary difference only confuses the issue. > They don't map 1:1 to octets either. To express a unicode string > in octects, you need to encode it. For this, there are several > possibilities, including UTF-8, UTF-16, ... Sure. Octets are just things that store numbers between 0 and 255. The most compact way to do that in Perl is using a string. Thats also the most natural way to represent bytes in Perl, closely followed by integers for single bytes. You do not store octets in latin1, or unicode, or whatever else in that string. You are just using the most natural way to represent octets. And that just happens to work, because Perl was designed to work that way. The mapping between perl bytes and octets is 1:1.. ord and chr do it for you, for example, and unpack "n" does it for you in case you encode/decode two byte entities. unpack "C", however, does not map to octets in perl. Thats the bug. > Unicode is a superset of the latin1 character set, not the latin1 > character encoding. We'd need bigger bytes for the latter :) Right. And Perl has those bigger bytes. -- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPEThread Previous | Thread Next