develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

From:
Marc Lehmann
Date:
May 19, 2008 08:27
Subject:
Re: on the almost impossibility to write correct XS modules
Message ID:
20080519152655.GD28206@schmorp.de
On Sat, May 17, 2008 at 10:50:12AM -0700, Jan Dubois <jand@activestate.com> wrote:
> actual strings without SvUTF8 set are encoded in the system default ANSI
> codepage

Not in perl, no.

Strings in Perl aren't encoded at all. Thats a basic fact. They are encoded
*inside* the perl interpreter,. but on the Perl level, strings are simply not
encoded in any way.

Thats the whole point of unicode (and string handling in general) in Perl.

> text returned by qx() will be ANSI encoded, 
   
Why? programs can output whatever they want, it doesn't need to be
ansi-encoded.

> filenames returned by readdir() will be ANSI encoded and so on.

Possibly (but of course only on windows).

> This is just the nature of the 8-bit OS API.

Well, it isn't. On windows, some part of the API use encodings, and others
do not. On unix, its simpler in that low-level OS interfaces generally do
not care for character encodings.

> The brokenness right now is that when Perl automatically upgrades this
> data to UTF8, it assumes that the data is Latin1 instead of ANSI,

Uhm, no, you are totally confused about how character handling is done in
perl, and I cannot blame you (the many bugs and documentation mistakes
combined make it hard to see what is meant).

Strings in perl are simply concatenated characters, which in turn are
represented by numbers.

Perl doesn't store an encoding together with strings, only the programmer
knows the encoding of strings.

This is the correct way to approach unicode because it frees the programmer
from tracking both external and internal encodings.

Perl *cannot* know the encoding of a string unless the user tells it in a
case-by-case basis.

> potentially garbling the data if it contained codepoints where the
> current ANSI codepage and Latin1 are different.

I don't understand this at all.

> How would you want to "fix" this then?

There is nothing to fix. Your suggested change would completely break
perl, it is that simple.

You assume perl knows how strings are encoded, which isn't reasonable
(and certainly not how it is implemented). Strings in perl are just
concatenations of codepoints.

> Only code making the explicit assumption that 8-bit strings are encoded
> in Latin1 is going to break. 

No, you don't understand how perl tretas strings at all, sorry.

And this is one of the problems with perl5-porters: too many people who have
no clue about the perl unicode model have opinions, and thats why perl is
currently so broken: parts of the API (win32) effectively implement a model
that the perl core doesn't support, and vice versa.

> All code relying on the implicit conversion between 8-bit and UTF8 will
> actually be fixed and not broken by this change. :)

I really don't have the time to lecture on unicode again and again to
the amsses who don't understand it. I suggets working with unicode and
encodings in perl for a few years *first*, as I did. I learned a great
deal during that time, as I use these features daily.

And as long as perl5-porters have little clue about unicode, perl will
always stay too buggy for anything but experts to use (the experts
supposedly working around the bugs, whcih is extreemly annoying).

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      pcg@goof.com
      -=====/_/_//_/\_,_/ /_/\_\



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About