develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

From:
Glenn Linderman
Date:
May 19, 2008 13:34
Subject:
Re: on the almost impossibility to write correct XS modules
Message ID:
4831E445.4010706@NevCal.com
On approximately 5/19/2008 11:45 AM, came the following characters from 
the keyboard of Jan Dubois:
> On Mon, 19 May 2008, Marc Lehmann wrote:
>> On Sat, May 17, 2008 at 10:50:12AM -0700, Jan Dubois
>> <jand@activestate.com> wrote:

> Currently Perl doesn't have any code to treat the native encoding on
> Windows correctly and therefore (incorrectly) assumes that all 8-bit
> strings are Latin1 encoded.  This is the part that needs to be fixed
> if strings with SvUTF8 are going to be correct.

The gist of the problem here is that

1) The "automatic" conversion of 8-bit to UTF-8 "assumed" Latin1 because 
it was (a) easy numerically (b) worked well on platforms that use Latin1 
as their native encoding.

2) Windows assumes ANSI code page for 8-bit data, but Perl on Windows, 
for quite a few releases now, has not... instead, it "assumes" Latin1 
when "automatically" converting 8-bit to UTF-8.

Retrofitting Perl on Windows to assume 8-bit data is ANSI will break all 
code that attempts to work with the constraints of 1 and 2.  Had Perl on 
Windows come supplied with that assumption on or shortly after UTF-8 was 
implemented in Perl, it might have been the best solution, albeit 
somewhat lower performance than assuming Latin1.  And it would possibly 
have prevented, by example of a widely-used platform, the assumption 
throughout lots of Perl code, that all 8-bit data is assumed to be 
Latin1 implicitly.

Since Perl on Windows has for some years now been making the assumption 
that all 8-bit data being converted to UTF-8 is Latin1, changing that 
assumption seems inappropriate.

It might be possible that for a list of documented 8-bit Windows APIs, 
that changes could be made to automatically convert ANSI to UTF-8 or the 
reverse, when the 8-bit APIs is used by the Perl internals, but this 
would break all the code that is already explicitly working around the 
Latin1 assumption, by encoding from UTF-8 to ANSI before passing data to 
the 8-bit APIs, and decoding from ANSI to UTF-8 after receiving data 
from the 8-bit APIs.

It would be better, in my opinion, that if incompatible changes are made 
to the way Perl for Windows interacts with Windows APIs, that instead of 
attempting to work within the constraints of the ANSI encoding, that 
rather Perl for Windows should begin using the UTF-16 APIs, and 
converting the data to/from UTF-8 for internal use by Perl, and setting 
the UTF8 flag.

The only way I can see to make either change without breaking lots of 
existing code, is to invent a new variety of Perl for Windows in 
parallel with today's "assume Latin1 except for certain file operations" 
variety, that either implements default ANSI encoding for 8-bit APIs and 
internal Perl use, or (preferred) that implements UTF-16 API usage and 
converts data to and from UTF-8 as appropriate, or both by implementing 
two new versions of Perl.

The other "solution" would be to put your head in the sand, assume that 
all code written for Perl for Windows since UTF-8 support was added, 
really only used ASCII for all the ANSI interfaces, rather than 
ANSI-with-conversions, and make an incompatible version of Perl for 
Windows based on that assumption.

The ramifications of doing any of the above solution on existing 
application and CPAN code are frightening.  And the status quo is 
extremely limiting for multi-lingual applications built with Perl for 
Windows.  I continually chafe at the restrictions of ANSI characters for 
  Windows interfaces.

The only compatible solution I can see would be somewhat complex, but 
made somewhat easier by the recent implementation of lexically scoped 
pragmas.  Envision a Perl that:

A) implements a lexically scoped pragma that enables proper behavior, as 
defined by the remaining points here, possibly requiring 3 values in 
order to support Windows 9x without Unicode.

The three settings (I really don't care what they are called) I will 
herein call "compatibility", "8bit", and "Unicode".

The default, initial setting of the pragma would be "compatibility", on 
Windows platforms.  I'm not sure what the appropriate default would be 
on other platforms, possibly "Unicode".  Possibly the needs of other 
platforms might create benefits of creating other settings for this pragma.

For this usage, the pragma would be set to reflect the desired semantics 
for API calls within the scope of the setting.

The goal would be to eventually migrate all Perl application and CPAN 
code to work properly with the pragma set to "Unicode"... any other 
settings of the pragma would be for compatibility, or special cases.


B) If the pragma is set to "compatibility", Perl would continue 
implementing today's broken semantics for Windows ANSI APIs, 
interpreting string parameters passed to Windows APIs as ANSI, 
regardless of the setting of the UTF8 flag, and passing the data 
directly to the Windows 8-bit APIs, and returning the data directly from 
the Windows 8-bit APIs.  This would allow users that are already working 
around the broken features to continue to, until they can upgrade their 
code to use one of the other, better, values for the pragma.  API 
operations attempted on non-Windows platforms with this setting of the 
pragma should fail.

C) If the pragma is set to "8bit", then 8-bit Windows APIs would be used 
(allowing use on all versions of Windows from 95 onward, but also being 
restricted to characters from the current code page on the APIs), and 
string data would be automatically translated from Latin1 or UTF-8 to 
the current code page for use with the APIs, and API results would be 
translated from the current code page back to Latin1 (if possible) or 
UTF-8.  There is the potential for data loss in this mode of operation, 
if characters outside the subset defined by the current code page are 
passed into the APIs.  However, that is not an unexpected feature of 
8-bit Windows APIs.  API operations attempted on non-Windows platforms 
with this setting of the pragma may fail or succeed, depending on the 
nature of the platform, and if it has a "native 8bit" character set, and 
if the operations have been appropriately coded to convert the data 
to/from the native character set.  (This would be the right way to 
support EBCDIC, if there is anyone out there that wants to support EBCDIC.)

D) If the pragma is set to "Unicode", then 16-bit Windows APIs would be 
used, allowing use on all versions of Windows NT, and some versions of 
Windows 9x with Unicode extensions, although there may be some 
restrictions of characters and APIs used on some such systems), and 
string data would be automatically translated from Latin1 or UTF-8 to 
UTF-16 for use with the APIs, and API results would be translated from 
UTF-16 back to Latin1 (if possible) or UTF-8.  API operations attempted 
on a non-Unicode version of Windows with this setting of the pragma 
should fail.



The above discussion only deals with Windows APIs.  However, since the 
Perl internals has a list of operations ("\U" "\u" "\L" "\l", certain 
regexp matching operations and case-insensitivity, search the archives 
for the complete list) that are likewise broken and presently depend on 
the UTF8 flag to guide their functionality, it would be possible to use 
this same pragma to deal with those cases.  In this case, the pragma 
would be set to reflect the desired semantics for this list of 
operations within the scope of the setting.

E) If the pragma is set to "compatibility" or "8bit", today's surprising 
semantics would be used, based on the setting of the UTF8 flag on 
various string parameters to the various operations in the list.

F) If the pragma is set to "Unicode", all strings would have Unicode 
semantics applied, regardless of the setting of the UTF8 flag, although 
the UTF8 flag would still indicate whether the internal character is an 
octet or a UTF-8 encoded number.

G) Perhaps an extra setting of the pragma "binary" would be useful for 
certain regexp operations, where matching is desired, but no character 
semantics should be applied in any way.  This setting would be expected 
to be used only in very local scopes containing regexp operations.  API 
operations with character string parameters attempted with this setting 
should fail, as this setting would preclude interpreting strings as 
characters.


-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About