2008/5/17 Ben Morrow <ben@morrow.me.uk>: > > Quoth jand@activestate.com ("Jan Dubois"): >> >> The brokenness right now is that when Perl automatically upgrades this >> data to UTF8, it assumes that the data is Latin1 instead of ANSI, >> potentially garbling the data if it contained codepoints where the >> current ANSI codepage and Latin1 are different. > > So you would have > > "\xff" > > and > > substr "\xff\x{100}", 0, 1 > > be different? Or would you have ord("\xff") != 0xff ? More generally, > how would you handle the relationship between bytes, characters and > Unicode codepoints, if not as it is done now? If integers less than 256 > are mapped to their ANSI codepoints rather than their ISO8859-1 > codepoints, how do you get those characters in Latin1 that aren't in the > ANSI codepage? > >> How would you want to "fix" this then? Translate all 8-bit data when it >> is read from the OS from ANSI to Latin1? That seems a lot harder, and >> will also be quite unintuitive. > > I would say Win32 should exclusively use the Unicode APIs, and treat > 8-bit strings the same as their upgraded equivalents (that is, as > ISO8859-1). This may break code that reads ANSI-encoded data from a file > under the assumption it will be passed to the 8-bit API, of course. > >> Only code making the explicit assumption that 8-bit strings are encoded >> in Latin1 is going to break. All code relying on the implicit conversion >> between 8-bit and UTF8 will actually be fixed and not broken by this >> change. :) > > Perl explicitly documents that 8-bit data is treated as ISO8859-1, > except on EBCDIC platforms. I dont know about that. We make such assumptions in the regex engine, and possibly in terms of the expected encoding of source files without use locale, but i dont think we actually do mandate that it is latin-1 generally. And Im unconvinced that the suggestion made by Jan is as problematic as either your or Marc have said. If we used Win32 API calls to convert/acccess system data as widechar (UTF16) and then converted the result to utf8 then we should be in the clear. And I dont believe that the problem is in reading data from a *file*. That type of issue is a) not win32 specific b) extremely common and c) well soved by the proper application of Encode and friends. (We DO NOT document that all data files operated on by perl must be in Latin-1). The problem is how we access the file API's and other system apis (the most commonly used registry code is not widechar aware for instance). Currently as far as I know there is no way using perl to use the Win32 widechar apis to create unicode filenames and directories. And if i understood Jan right then his suggestion would resolve that problem. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"