On Mon, 19 May 2008, Marc Lehmann wrote: > On Sat, May 17, 2008 at 10:50:12AM -0700, Jan Dubois > <jand@activestate.com> wrote: > > actual strings without SvUTF8 set are encoded in the system default > > ANSI codepage > > Not in perl, no. > > Strings in Perl aren't encoded at all. Thats a basic fact. They are > encoded *inside* the perl interpreter,. but on the Perl level, strings > are simply not encoded in any way. I guess this is where the confusion originates. I'm not talking about the Perl language level at atll, I'm only talking about Perl interpreter internals. I thought this was obvious from the reference to SvUTF8, which doesn't exist at the Perl language level. Inside the Perl internals there is an implicit assumption about the encoding, at least on operating systems that care about this stuff at the low level (e.g. Windows). On Windows all 8-bit APIs to the filesystem expect filenames to be encoded in the CP_ACP (ANSI) codepage (yes, I know you can switch this for some APIs to CP_OEM, but please let's ignore that). Since Perl (interpreter internals) doesn't perform any encoding changes for filenames passed to and from the operating system APIs, it therefore follows that strings internally are assumed to be ANSI encoded. If you call open(my $fh, "<", $filename) or die; the octets stored in $filename are interpreted according to the ANSI codepage. This is documented in perluniintro.pod as using "native" encoding for 8-bit strings. With the introduction of the SvUTF8 flag we started having an alternate encoding in the internals. Problems arise when we have to combine strings from the native encoding with the UTF8 encoding. Since UTF8 is able to encode all code points encoded by the native encoding, but not vice versa, we have to re-encode the native strings into UTF8 before we can perform operations like concatenation. This re-encoding cannot be done without knowing (or assuming) the encoding of the native strings. Currently Perl doesn't have any code to treat the native encoding on Windows correctly and therefore (incorrectly) assumes that all 8-bit strings are Latin1 encoded. This is the part that needs to be fixed if strings with SvUTF8 are going to be correct. [...] > I really don't have the time to lecture on unicode again and again to > the amsses who don't understand it. I guess I don't understand why you are engaging in a discussion about Unicode implementation on perl5-porters if you don't have the time to actually argue your points instead of your general handwaving about how things should work at the high level and then calling people who point out problems in the actual implementation clueless. Are you just trolling? Cheers, -JanThread Previous | Thread Next