develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

Marc Lehmann
May 19, 2008 13:40
Re: on the almost impossibility to write correct XS modules
Message ID:
On Sun, May 18, 2008 at 08:40:33AM -0700, Jan Dubois <> wrote:
> > So you would have
> >     "\xff"
> >     substr "\xff\x{100}", 0, 1
> > be different?
> Potentially, yes.  That's what you get for mixing byte and character semantics.
> > Or would you have ord("\xff") != 0xff?
> No, "\xff" is guaranteed to have byte semantics for backwards compatibility:

So the \xff in the substr example has different semantics for backwards
compatibility but still you get different results?

Come on, it cannot work that way.

> perluniintro.pod:

perluniintro is simply wrong w.r.t. the current implementation, and
inconsistent overall. quoting it is not helpful.

> > I would say Win32 should exclusively use the Unicode APIs, and treat
> > 8-bit strings the same as their upgraded equivalents (that is, as
> > ISO8859-1). This may break code that reads ANSI-encoded data from a file
> > under the assumption it will be passed to the 8-bit API, of course.
> This is exactly the problem: the C runtime library on Windows assumes
> that every char* argument is encoded in the systems ANSI codepage, and
> not in Latin1.

Note that unix basically does the same (figure in "locale"/"user
expectancy" etc. instead of "ANSI").

This issue has nothing to do with windows.

> So every XS extension would have to not only check for the UTF8 flag and
> use a Unicode API when available, but also convert all strings passed without
> UTF8 flag from Latin1 to ANSI when calling third-party libraries that don't
> provide a Unicode API.

The problem is that the utf-8 flag simply has no meaning w.r.t. encoding or
not. It is an internal flag, and forcing the user to track its state is just

Note that perl doesn't implement this, either, perl's open for example
treats filenames without utf-8 flag not as latin1 on unix, but as octet
strings (in the encodign the user wants), while it treats filenames with
the utf-8 flag set as "utf-8" encoded.

This is in direct contradiction to the perluniintro, like so many others.

Quoting perluniintro when its so fundamentally broken w.r.t. to the existing
implementation is not helpful.

> > Perl explicitly documents that 8-bit data is treated as ISO8859-1,
> > except on EBCDIC platforms.
> I know that this is the way it works now, but that was not the original
> intend.  If you read perluniintro, you'll see:

again, perluniintro doesn't describe the original model, nor current

> Note the explicit reference to "whatever the native character set is".

Yes, the reference is there, but the manpage is simply wrong, no matter how
often you quote it.

Here is a typical example what confused users when they read such crap:

"well, how do I handle koi8-r data?"

well, you cannot, because perl, accoridng to that manpage, forces all strings
to either native character encoding or unicode. This is fortunately not so:
manpage wrong.

another gem:

   The principle is that Perl tries to keep its data
   as eight-bit bytes for as long as possible, but as soon as
   Unicodeness cannot be avoided, the data is transparently upgraded to

"how does perl know how my string is encoded when it transparently upgrades

well, it doesn't, this is why perl doesn't change the string w.r.t. the
perl level when it "transparently" upgrades. Oh yes, except in open (where
it suddenly becomes utf-8, althoguh it is not utf-8 in perl), many xs
modules (where you don't know what it does) or in the Win32 module, where
is interprets filenames either as unicode (not utf-8 as open does) or a
local encoding.

perluniintro is confusing, self-contradicting, and not helpful. you cna do
a lot with it, but not prove a point.

> If we expected all external data to always be converted to Latin1, then
> we could have saved us the trouble of having 2 different internal
> representations and always gone straight to UTF8.

Welcome to the real world: handling binary data as utf-8 is extremely
inefficient. The trouble was made so perl is still capable of producing
something thats not extreemly slow by design, not to allow diferent

                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_    
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /
      -=====/_/_//_/\_,_/ /_/\_\ Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About