On Sat, May 17, 2008 at 10:50:12AM -0700, Jan Dubois <jand@activestate.com> wrote: > actual strings without SvUTF8 set are encoded in the system default ANSI > codepage Not in perl, no. Strings in Perl aren't encoded at all. Thats a basic fact. They are encoded *inside* the perl interpreter,. but on the Perl level, strings are simply not encoded in any way. Thats the whole point of unicode (and string handling in general) in Perl. > text returned by qx() will be ANSI encoded, Why? programs can output whatever they want, it doesn't need to be ansi-encoded. > filenames returned by readdir() will be ANSI encoded and so on. Possibly (but of course only on windows). > This is just the nature of the 8-bit OS API. Well, it isn't. On windows, some part of the API use encodings, and others do not. On unix, its simpler in that low-level OS interfaces generally do not care for character encodings. > The brokenness right now is that when Perl automatically upgrades this > data to UTF8, it assumes that the data is Latin1 instead of ANSI, Uhm, no, you are totally confused about how character handling is done in perl, and I cannot blame you (the many bugs and documentation mistakes combined make it hard to see what is meant). Strings in perl are simply concatenated characters, which in turn are represented by numbers. Perl doesn't store an encoding together with strings, only the programmer knows the encoding of strings. This is the correct way to approach unicode because it frees the programmer from tracking both external and internal encodings. Perl *cannot* know the encoding of a string unless the user tells it in a case-by-case basis. > potentially garbling the data if it contained codepoints where the > current ANSI codepage and Latin1 are different. I don't understand this at all. > How would you want to "fix" this then? There is nothing to fix. Your suggested change would completely break perl, it is that simple. You assume perl knows how strings are encoded, which isn't reasonable (and certainly not how it is implemented). Strings in perl are just concatenations of codepoints. > Only code making the explicit assumption that 8-bit strings are encoded > in Latin1 is going to break. No, you don't understand how perl tretas strings at all, sorry. And this is one of the problems with perl5-porters: too many people who have no clue about the perl unicode model have opinions, and thats why perl is currently so broken: parts of the API (win32) effectively implement a model that the perl core doesn't support, and vice versa. > All code relying on the implicit conversion between 8-bit and UTF8 will > actually be fixed and not broken by this change. :) I really don't have the time to lecture on unicode again and again to the amsses who don't understand it. I suggets working with unicode and encodings in perl for a few years *first*, as I did. I learned a great deal during that time, as I use these features daily. And as long as perl5-porters have little clue about unicode, perl will always stay too buggy for anything but experts to use (the experts supposedly working around the bugs, whcih is extreemly annoying). -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / pcg@goof.com -=====/_/_//_/\_,_/ /_/\_\