develooper Front page | perl.perl5.porters | Postings from May 2008

RE: on the almost impossibility to write correct XS modules

Thread Previous | Thread Next
From:
Jan Dubois
Date:
May 18, 2008 08:40
Subject:
RE: on the almost impossibility to write correct XS modules
Message ID:
05da01c8b8fd$83811290$8a8337b0$@com
On Sat, 17 May 2008, Ben Morrow wrote:
> Quoth jand@activestate.com ("Jan Dubois"):
> >
> > The brokenness right now is that when Perl automatically upgrades this
> > data to UTF8, it assumes that the data is Latin1 instead of ANSI,
> > potentially garbling the data if it contained codepoints where the
> > current ANSI codepage and Latin1 are different.
> 
> So you would have
> 
>     "\xff"
> 
> and
> 
>     substr "\xff\x{100}", 0, 1
> 
> be different?

Potentially, yes.  That's what you get for mixing byte and character semantics.

> Or would you have ord("\xff") != 0xff?

No, "\xff" is guaranteed to have byte semantics for backwards compatibility:

perluniintro.pod:
| Note that C<\x..> (no C<{}> and only two hexadecimal digits), C<\x{...}>,
| and C<chr(...)> for arguments less than C<0x100> (decimal 256)
| generate an eight-bit character for backward compatibility with older
| Perls.  For arguments of C<0x100> or more, Unicode characters are
| always produced. If you want to force the production of Unicode
| characters regardless of the numeric value, use C<pack("U", ...)>
| instead of C<\x..>, C<\x{...}>, or C<chr()>.

> More generally,
> how would you handle the relationship between bytes, characters and
> Unicode codepoints, if not as it is done now?

The same as it is being done now, just using a different encoding for
the 8-bit character strings.

> If integers less than 256
> are mapped to their ANSI codepoints rather than their ISO8859-1
> codepoints, how do you get those characters in Latin1 that aren't in the
> ANSI codepage?

As perluniintro.pod above points out, the only reliable way to do this
is pack("U", $codepoint).  Or you can use named characters using charnames.pm.
 
> > How would you want to "fix" this then? Translate all 8-bit data when it
> > is read from the OS from ANSI to Latin1? That seems a lot harder, and
> > will also be quite unintuitive.
> 
> I would say Win32 should exclusively use the Unicode APIs, and treat
> 8-bit strings the same as their upgraded equivalents (that is, as
> ISO8859-1). This may break code that reads ANSI-encoded data from a file
> under the assumption it will be passed to the 8-bit API, of course.

This is exactly the problem: the C runtime library on Windows assumes
that every char* argument is encoded in the systems ANSI codepage, and
not in Latin1.

So every XS extension would have to not only check for the UTF8 flag and
use a Unicode API when available, but also convert all strings passed without
UTF8 flag from Latin1 to ANSI when calling third-party libraries that don't
provide a Unicode API.
 
> > Only code making the explicit assumption that 8-bit strings are encoded
> > in Latin1 is going to break. All code relying on the implicit conversion
> > between 8-bit and UTF8 will actually be fixed and not broken by this
> > change. :)
> 
> Perl explicitly documents that 8-bit data is treated as ISO8859-1,
> except on EBCDIC platforms.

I know that this is the way it works now, but that was not the original
intend.  If you read perluniintro, you'll see:

| =head2 Perl's Unicode Model
|
| Perl supports both pre-5.6 strings of eight-bit native bytes, and
| strings of Unicode characters.  The principle is that Perl tries to
| keep its data as eight-bit bytes for as long as possible, but as soon
| as Unicodeness cannot be avoided, the data is transparently upgraded
| to Unicode.
|
| Internally, Perl currently uses either whatever the native eight-bit
| character set of the platform (for example Latin-1) is, defaulting to
| UTF-8, to encode Unicode strings. Specifically, if all code points in
| the string are C<0xFF> or less, Perl uses the native eight-bit
| character set.  Otherwise, it uses UTF-8.

Note the explicit reference to "whatever the native character set is".

If we expected all external data to always be converted to Latin1, then
we could have saved us the trouble of having 2 different internal
representations and always gone straight to UTF8.

Cheers,
-Jan



Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About