develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

From:
Marc Lehmann
Date:
May 19, 2008 13:20
Subject:
Re: on the almost impossibility to write correct XS modules
Message ID:
20080519201911.GC28949@schmorp.de
On Mon, May 19, 2008 at 04:50:42AM +0200, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:
> > > encoding whereas they really are ANSI encoded. So once the
> > > automatic upgrading assumes ANSI encoding instead of Latin-1,
> > > everything should be working correctly, no?
> > 
> > Uhm.... that one can even suggest such brokenness :)
> > 
> > Of course basically everything will break, you mean, because
> > the assumption that its not latin1 of course breaks roughly all
> > code dealing with unicode in perl, which doesn't expect that
> > perl suddenly uses ANSI instead of unicode codepoints (they
> > differ!).
> 
> Backtracking a bit here, why would this break anything? For
> strings coming out of the Win32 API, immediately decode them to
> characters; for strings going in, upgrade them to characters if
> necessary, then encode them to ANSI at the last moment.

great idea, the basic question is "what are characters"? Obviously,
you cannot mean charatcers in the sense of "lettery/glyphs, character
codepoints etc." because perl doesn't store this information (for example,
when you load a jpg image into some scalar, you don't have a string
composed of "characters", but only octets).

If you mean "codepoints/numerical values" with "characters", then you lose
the information about their encoding.

However, *your* idea would mostly work *iff* you only ever used operating
system interfaces when dealing with filenames.

This is, however, not the case: consider prompting the user for a filename
using a Gtk+ entry to acquire the filename, using a commandline argument as a
filename.

In all those cases, perl cannot know that those strings are filenames, and
when asked to "open" them, might assume they are encoded in "characters"
(whatever they are), when in fact, they are encoded in "utf-8", "koi8-r",
"euc-jp" or so.

Such a model is workable, but there would need to be a defined way to convert
external filenames (e.g. on the comamndline) into something perl's open
understands.

> That way, no one ever needs to care that filenames are in ANSI,
> because as far as Perl code is concerned it always gets them as
> character strings.

If that were possible, sure.

_however_

note that jan didn't propose that, he said the "automatic upgrading"
should change its interpretation from currently 0..255 become 0..255 to
something else, where e.g. character values suddenly change codepoints
(or, equally worse, change their interpretation).

As the name "automatic" implies, perl does this kind of upgrading
automatically, and to my knowledge it is not documented anywhere where this
happens (nor are there any guarantees that it doesn't happen). This is
because "automatic upgrading" is assumed to be something that doesn't change
the string itself.

Jan proposes to actualyl change the string itself (on the perl level) on
those automatic upgrades, and this is what breaks perl, because suddenly
all the internals are exposed to perl code and, worse, your string
interpretation changes at undocumented points that you have to track
yourself.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      pcg@goof.com
      -=====/_/_//_/\_,_/ /_/\_\



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About