develooper Front page | perl.perl5.porters | Postings from May 2008

RE: on the almost impossibility to write correct XS modules

Thread Previous | Thread Next
From:
Jan Dubois
Date:
May 19, 2008 11:46
Subject:
RE: on the almost impossibility to write correct XS modules
Message ID:
06b201c8b9e0$8ff54d50$afdfe7f0$@com
On Mon, 19 May 2008, Marc Lehmann wrote:
> On Sat, May 17, 2008 at 10:50:12AM -0700, Jan Dubois
> <jand@activestate.com> wrote:
> > actual strings without SvUTF8 set are encoded in the system default
> > ANSI codepage
>
> Not in perl, no.
>
> Strings in Perl aren't encoded at all. Thats a basic fact. They are
> encoded *inside* the perl interpreter,. but on the Perl level, strings
> are simply not encoded in any way.

I guess this is where the confusion originates. I'm not talking about
the Perl language level at atll, I'm only talking about Perl interpreter
internals. I thought this was obvious from the reference to SvUTF8,
which doesn't exist at the Perl language level.

Inside the Perl internals there is an implicit assumption about the
encoding, at least on operating systems that care about this stuff at
the low level (e.g. Windows). On Windows all 8-bit APIs to the
filesystem expect filenames to be encoded in the CP_ACP (ANSI) codepage
(yes, I know you can switch this for some APIs to CP_OEM, but please
let's ignore that).

Since Perl (interpreter internals) doesn't perform any encoding changes
for filenames passed to and from the operating system APIs, it
therefore follows that strings internally are assumed to be ANSI
encoded. If you call

    open(my $fh, "<", $filename) or die;

the octets stored in $filename are interpreted according to the ANSI
codepage. This is documented in perluniintro.pod as using "native"
encoding for 8-bit strings.

With the introduction of the SvUTF8 flag we started having an alternate
encoding in the internals.  Problems arise when we have to combine strings
from the native encoding with the UTF8 encoding.  Since UTF8 is able to encode
all code points encoded by the native encoding, but not vice versa, we
have to re-encode the native strings into UTF8 before we can perform
operations like concatenation.  This re-encoding cannot be done without
knowing (or assuming) the encoding of the native strings.

Currently Perl doesn't have any code to treat the native encoding on
Windows correctly and therefore (incorrectly) assumes that all 8-bit
strings are Latin1 encoded.  This is the part that needs to be fixed
if strings with SvUTF8 are going to be correct.

[...]

> I really don't have the time to lecture on unicode again and again to
> the amsses who don't understand it.

I guess I don't understand why you are engaging in a discussion about
Unicode implementation on perl5-porters if you don't have the time to
actually argue your points instead of your general handwaving about
how things should work at the high level and then calling people who 
point out problems in the actual implementation clueless.  Are you
just trolling?

Cheers,
-Jan



Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About