Quoth jand@activestate.com ("Jan Dubois"): > > The brokenness right now is that when Perl automatically upgrades this > data to UTF8, it assumes that the data is Latin1 instead of ANSI, > potentially garbling the data if it contained codepoints where the > current ANSI codepage and Latin1 are different. So you would have "\xff" and substr "\xff\x{100}", 0, 1 be different? Or would you have ord("\xff") != 0xff ? More generally, how would you handle the relationship between bytes, characters and Unicode codepoints, if not as it is done now? If integers less than 256 are mapped to their ANSI codepoints rather than their ISO8859-1 codepoints, how do you get those characters in Latin1 that aren't in the ANSI codepage? > How would you want to "fix" this then? Translate all 8-bit data when it > is read from the OS from ANSI to Latin1? That seems a lot harder, and > will also be quite unintuitive. I would say Win32 should exclusively use the Unicode APIs, and treat 8-bit strings the same as their upgraded equivalents (that is, as ISO8859-1). This may break code that reads ANSI-encoded data from a file under the assumption it will be passed to the 8-bit API, of course. > Only code making the explicit assumption that 8-bit strings are encoded > in Latin1 is going to break. All code relying on the implicit conversion > between 8-bit and UTF8 will actually be fixed and not broken by this > change. :) Perl explicitly documents that 8-bit data is treated as ISO8859-1, except on EBCDIC platforms. Ben -- BEGIN{*(=sub{$,=*)=sub{print@_};local($#,$;,$/)=@_;for(keys%{ #ben@morrow.me.uk $#}){/m/&&next;**=${$#}{$_};/(\w):/&&(&(($#.$_,$;.$+,$/),next);$/==\$*&&&)($;.$ _)}};*_=sub{for(@_){$|=(!$|||$_||&)(q) )));&((q:\:\::,q,,,\$_);$_&&&)("\n")}}}_ $J::u::s::t, $a::n::o::t::h::e::r, $P::e::r::l, $h::a::c::k::e::r, $,Thread Previous | Thread Next