Moin, On Monday 19 May 2008 17:26:55 Marc Lehmann wrote: > On Sat, May 17, 2008 at 10:50:12AM -0700, Jan Dubois <jand@activestate.com> wrote: [snip] > > The brokenness right now is that when Perl automatically upgrades > > this data to UTF8, it assumes that the data is Latin1 instead of > > ANSI, > > Uhm, no, you are totally confused about how character handling is > done in perl, and I cannot blame you (the many bugs and documentation > mistakes combined make it hard to see what is meant). > > Strings in perl are simply concatenated characters, which in turn are > represented by numbers. > > Perl doesn't store an encoding together with strings, only the > programmer knows the encoding of strings. > > This is the correct way to approach unicode because it frees the > programmer from tracking both external and internal encodings. Uhm, excuse me? I don't think this actually frees the programmer from tracking internal encodings and especially not tracking of external encodings. Perl's "one-encoding-for-all" approach has the real world problem that you cannot easily mix strings without being very very very very careful, or you get garbage. Automatically and without warning. And most of the problems when you want to work with Unicode (even if you _only_ want to use UTF-8, not even throwing UTF-16 into the mix), is that it is very very easy to have data that is not encoded in UTF-8 nor latin1, and you mix it with UTF-8 (or encode it twice or whatever) and you end up with garbage. Which is usually bad as this very discussion about ansi shows :) Or in other words, Perls "frees the programmer from traking encodings" by making him carefully track all strings as they come in and go out and then track which strings internally are in which encoding, and even then sometimes you mix fire with water unintentionally. Which I don't think is ideal as the many many bugs I have found in my own (supposedly working bugfree) utf-8 using Perl code. Not to mention that you actually lose the information what original encoding the string had - "aa" looks the same in latin1 and utf-8, but depending on which encoding it "has", acts differently. (at least thats what I remember from regexps discussions) It would be _much_ easier if all strings in Perl carried their encoding with them, and Perl would be able to simple mix two strings by automatically upgrading them according to their encoding. Then you'd also be able to query the encoding, btw. No more guesswork based upon a single bit. The current way (everything is either Latin-1 or UTF-8 and we only have a single bit to distinguish between these two cases) is just a pain, especially if you need something else than utf-8. Here is an example what bit me today, just in case people think this is a theoretical discussion: You have a UTF-8 regeps like the following: my $skip = qr/Quarantäne/i; You read in data and manually decode it to utf-8 to match it against the regexp: my $data = decode('utf-8',from_file()); # much later in the file if ($data =~ $skip) { ... do something ... }; Now, some time later (maybe much later, and a different person), replaces the hand-rolled from_file() routine with something that pre-parses the data. As a side-effect, the data now comes already decoded in UTF-8 format. The second decode() then destroys the data, because Perl does not know that the data was already in UTF-8 and encodes it twice. Oops, new bug. And this bug could have been prevented entirely if the string was properly tagged with its encoding, and thus a double encoding would have been never possible. So while the current situation is "working" somehow, please do not describe it as "ideal" :) All the best, Tels -- Signed on Mon May 19 18:11:35 2008 with key 0x93B84C15. Get one of my photo posters: http://bloodgate.com/posters PGP key on http://bloodgate.com/tels.asc or per email. "My glasses, my glasses. I cannot see without my glasses." - "My glasses, my glasses. I cannot be seen without my glasses."