develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

Thread Previous | Thread Next
Ben Morrow
May 17, 2008 11:48
Re: on the almost impossibility to write correct XS modules
Message ID:

Quoth ("Jan Dubois"):
> The brokenness right now is that when Perl automatically upgrades this
> data to UTF8, it assumes that the data is Latin1 instead of ANSI,
> potentially garbling the data if it contained codepoints where the
> current ANSI codepage and Latin1 are different.

So you would have



    substr "\xff\x{100}", 0, 1

be different? Or would you have ord("\xff") != 0xff ? More generally,
how would you handle the relationship between bytes, characters and
Unicode codepoints, if not as it is done now? If integers less than 256
are mapped to their ANSI codepoints rather than their ISO8859-1
codepoints, how do you get those characters in Latin1 that aren't in the
ANSI codepage?

> How would you want to "fix" this then? Translate all 8-bit data when it
> is read from the OS from ANSI to Latin1? That seems a lot harder, and
> will also be quite unintuitive.

I would say Win32 should exclusively use the Unicode APIs, and treat
8-bit strings the same as their upgraded equivalents (that is, as
ISO8859-1). This may break code that reads ANSI-encoded data from a file
under the assumption it will be passed to the 8-bit API, of course.

> Only code making the explicit assumption that 8-bit strings are encoded
> in Latin1 is going to break. All code relying on the implicit conversion
> between 8-bit and UTF8 will actually be fixed and not broken by this
> change. :)

Perl explicitly documents that 8-bit data is treated as ISO8859-1,
except on EBCDIC platforms. 


_)}};*_=sub{for(@_){$|=(!$|||$_||&)(q) )));&((q:\:\::,q,,,\$_);$_&&&)("\n")}}}_
$J::u::s::t, $a::n::o::t::h::e::r, $P::e::r::l, $h::a::c::k::e::r, $,

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About