develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

Thread Previous | Thread Next
From:
Robin Redeker
Date:
May 22, 2008 17:22
Subject:
Re: on the almost impossibility to write correct XS modules
Message ID:
20080522145302.GA6341@elmex
On Mon, May 19, 2008 at 06:37:11PM +0200, Tels wrote:
> Moin,
> 
> On Monday 19 May 2008 17:26:55 Marc Lehmann wrote:
> > On Sat, May 17, 2008 at 10:50:12AM -0700, Jan Dubois 
> <jand@activestate.com> wrote:
> [snip]
> 
> > > The brokenness right now is that when Perl automatically upgrades
> > > this data to UTF8, it assumes that the data is Latin1 instead of
> > > ANSI,
> >
> > Uhm, no, you are totally confused about how character handling is
> > done in perl, and I cannot blame you (the many bugs and documentation
> > mistakes combined make it hard to see what is meant).
> >
> > Strings in perl are simply concatenated characters, which in turn are
> > represented by numbers.
> >
> > Perl doesn't store an encoding together with strings, only the
> > programmer knows the encoding of strings.
> >
> > This is the correct way to approach unicode because it frees the
> > programmer from tracking both external and internal encodings.
> 
> Uhm, excuse me? I don't think this actually frees the programmer from 
> tracking internal encodings and especially not tracking of external 
> encodings.
> 
> Perl's "one-encoding-for-all" approach has the real world problem that 
> you cannot easily mix strings without being very very very very 
> careful, or you get garbage. Automatically and without warning.
> 

I always understood strings on the Perl level as very mighty "Unicode"
strings. Strings where characters can have values above 255, and even
above 32bit integers.

And I think this is the only way Perl can go. All other solutions,
introducing implicit or explicit typing will feel very very awkward IMO.

Tracking internal encodings should and MUST NOT be necessary, not even
for a XS programmer that just wants to write a simple binding.

Also all the noise about the utf8 flag is just confusing the users. It's
after all just an optimization, that we can store character values in
the range 0-255 with less overhead INTERNALLY.

So, why not just fix all the places where this simple and useful
model breaks?

Tracking the external encoding will always be neccessary, programmers
should know what data they process, and if they don't, they need to
write code to track it anyways:

  binmode STDIN;
  my $jpeg = do { local $/; <STDIN> };
  $jpeg = uc $jpeg; # programmers fault, I expect that uc interprets my
                    # $jpeg as Unicode string and uppercases the chars.

  my $unicodestring = ...;
  my $string = encode ('cp1250', $unicodestring);

It's not perlish to introduce typing or anything, because we already
have to take care what our scalars contain:

   my $string = 213492;
   $string = uc $string; # makes no sense, does it?

   my $string = "Hello there!";
   $string = ($string * 30) + ($string / 20); # no sense either

Those cases don't die or croak or warn the programmer, Perl is not Java.
And I expect similar treatment with my binary data and Unicode strings.

The model of strings that contain characters/integers in a very big
range, leaving the interpretation to the Programmer, feels like the
completely right model to me.

Regexes should also work on these strings. Lets try to interpret this
construct:

   if ($jpeg =~ /Adobe/s) {
      # ...
   }

On the left we have a JPEG with 'characters' in the range 0-255,
presenting a JPEG encoded image. On the right we have a regex with the
(Unicode) characters 'Adobe', which came from the source code, encoded in the
source code encoding, stored in a string. All characters are in the
range 0-255, no surprises. One could also have written:

   my $mark = 'Adobe';
   if ($jpeg =~ /$mark/s) {
      # ...
   }

$mark is a string. $jpeg is a string. Everything is fine. The programmer
just has to remember that the first 128 code points of Unicode have to
match the text in the $jpeg image, which is probably ascii encoded, or
latin1 or whatever. Imagine that he wants to search for a Japanese
character sequence encoded in shiftjis in the JPEG:

   use utf8; # source code is utf 8 encoded

   my $mark = encode ('shiftjis', 'におんじん');
   if ($jpeg =~ /$mark/s) {
      # ...
   }

$mark contains shiftjis encoded stuff, all in the range from 0-255.
The programmer assumes that somewhere in the JPEG such a shiftjis encoded
sequence is present.

No surprises, the programmer knows what he does if he is dealing with
data. If he doesn't I would have serious doubts in his ability to write
correct programs.

Also regexes come with no surprises in our simple string model:

  my $string = 'Hello There!';

  if ($string =~ /\P{IsWord}/) {
     # ...
  }

The regex searches for characters/integers that are in the Unicode
character set described by \P{IsWord}.

IMO where ever this model breaks it should be fixed.
I/O with the outside can only happen with characters in the byte range
0-255, so we have to encode our stuff with the Encode module or by
setting the encoding for the file handle with binmode:


   use utf8; # source code is in UTF-8
   my $string = 'あたり'; # this string is utf8 encoded in the source code,
                          # and decoded by the parser
   binmode STDOUT, ':encoding(shiftjis)';
                          # but the terminal expects shiftjis encoded
                          # strings
   print "$string\n";
  

This is the model that Marc and Juerd seem to describe, or at least
as far as I understand it.
What is with 'open'? Lets have a look what semantic a Unicode string as
filename has:

   use utf8;
   my $filename = 'ほし.txt';
   open my $fh, $filename or die;

Do you know a OS that has real Unicode filenames? Maybe Windows has, I
don't know, but at least my Linux box and all other I worked with
had no Unicode filenames. So writing the above should not give me the
desired, if any, result. I need to know the encoding of my filenames.

Unfortunately Linux doesn't know any encoding, it just has some
restrictions on the bytes that can be in a filename (eg no '/').
Does this model seem familiar? Linux doesn't enforce the encoding
of filenames on me. I can have utf-8 encoded filenames and latin1
filenames. Isn't it great? The freedom!!!

But Ok, I need to know what data I'm processing, and in this case we
assume I only have to deal with filenames that are encoded in utf-8,
I need to adjust the example above:

   use utf8;
   my $filename = encode ('utf8', 'ほし.txt');
   open my $fh, $filename or die;

Nice, my assumption about my filenames is correct!

Ok, what does my code do on Windows and other platforms? What will
happen with the string in $filename?

Assume Windows has full blown real Unicode filenames. There my first
example, where I don't encode the stuff was completely right. And
suddenly my code doesn't work anymore.

Fixing those portability issues without breaking the string model
seems hard. But IMO a 'default encoding' for filenames could work in
'most' cases, utf-8 or maybe locale encoding, for Linux platforms and
no encoding for platforms with unicode filenames, so that I can write:

   use utf8;
   my $filename = 'ほし.txt';
   open my $fh, $filename or die;

when I don't care about my platform. But if I know I have to deal
with a special platform, with specially encoded filenames,
I need a way to either disable the implicit encoding or set the correct
encoding.

Same with readdir, readdir should also use this 'default encoding'
maybe, and if I disable the encoding, I should get a string with
characters/integers in the range 0-255, so that I can decode the
filenames if I know which encoding they have. (Or whether they maybe
come already in Unicode, on platforms with Unicode filenames).

Thats all for now, IMO this model makes pretty much sense, and it leaves
the things the Programmer HAS to know up to the Programmer and doesn't
enforce weird restrictions. It clean in most, if not all, cases, and
when not, for example when dealing with portability issues, a solution
that does not break this model is needed.

So the main topic should revolve around the questions:

Can Perl be fixed to support this model better?
Can the places where it does not support this model be fixed?
What about XS code? What about the broken XS code on CPAN?
Who does write the patches? Who does erase or update 'perluniintro'?


Robin

-- 
Robin Redeker                         | Deliantra, the free code+content MORPG
elmex@ta-sa.org / r.redeker@gmail.com | http://www.deliantra.net
http://www.ta-sa.org/                 |

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About