Front page | perl.perl5.porters |
Postings from May 2008
Re: on the almost impossibility to write correct XS modules
From:
Glenn Linderman
Date:
May 19, 2008 13:34
Subject:
Re: on the almost impossibility to write correct XS modules
Message ID:
4831E445.4010706@NevCal.com
On approximately 5/19/2008 11:45 AM, came the following characters from
the keyboard of Jan Dubois:
> On Mon, 19 May 2008, Marc Lehmann wrote:
>> On Sat, May 17, 2008 at 10:50:12AM -0700, Jan Dubois
>> <jand@activestate.com> wrote:
> Currently Perl doesn't have any code to treat the native encoding on
> Windows correctly and therefore (incorrectly) assumes that all 8-bit
> strings are Latin1 encoded. This is the part that needs to be fixed
> if strings with SvUTF8 are going to be correct.
The gist of the problem here is that
1) The "automatic" conversion of 8-bit to UTF-8 "assumed" Latin1 because
it was (a) easy numerically (b) worked well on platforms that use Latin1
as their native encoding.
2) Windows assumes ANSI code page for 8-bit data, but Perl on Windows,
for quite a few releases now, has not... instead, it "assumes" Latin1
when "automatically" converting 8-bit to UTF-8.
Retrofitting Perl on Windows to assume 8-bit data is ANSI will break all
code that attempts to work with the constraints of 1 and 2. Had Perl on
Windows come supplied with that assumption on or shortly after UTF-8 was
implemented in Perl, it might have been the best solution, albeit
somewhat lower performance than assuming Latin1. And it would possibly
have prevented, by example of a widely-used platform, the assumption
throughout lots of Perl code, that all 8-bit data is assumed to be
Latin1 implicitly.
Since Perl on Windows has for some years now been making the assumption
that all 8-bit data being converted to UTF-8 is Latin1, changing that
assumption seems inappropriate.
It might be possible that for a list of documented 8-bit Windows APIs,
that changes could be made to automatically convert ANSI to UTF-8 or the
reverse, when the 8-bit APIs is used by the Perl internals, but this
would break all the code that is already explicitly working around the
Latin1 assumption, by encoding from UTF-8 to ANSI before passing data to
the 8-bit APIs, and decoding from ANSI to UTF-8 after receiving data
from the 8-bit APIs.
It would be better, in my opinion, that if incompatible changes are made
to the way Perl for Windows interacts with Windows APIs, that instead of
attempting to work within the constraints of the ANSI encoding, that
rather Perl for Windows should begin using the UTF-16 APIs, and
converting the data to/from UTF-8 for internal use by Perl, and setting
the UTF8 flag.
The only way I can see to make either change without breaking lots of
existing code, is to invent a new variety of Perl for Windows in
parallel with today's "assume Latin1 except for certain file operations"
variety, that either implements default ANSI encoding for 8-bit APIs and
internal Perl use, or (preferred) that implements UTF-16 API usage and
converts data to and from UTF-8 as appropriate, or both by implementing
two new versions of Perl.
The other "solution" would be to put your head in the sand, assume that
all code written for Perl for Windows since UTF-8 support was added,
really only used ASCII for all the ANSI interfaces, rather than
ANSI-with-conversions, and make an incompatible version of Perl for
Windows based on that assumption.
The ramifications of doing any of the above solution on existing
application and CPAN code are frightening. And the status quo is
extremely limiting for multi-lingual applications built with Perl for
Windows. I continually chafe at the restrictions of ANSI characters for
Windows interfaces.
The only compatible solution I can see would be somewhat complex, but
made somewhat easier by the recent implementation of lexically scoped
pragmas. Envision a Perl that:
A) implements a lexically scoped pragma that enables proper behavior, as
defined by the remaining points here, possibly requiring 3 values in
order to support Windows 9x without Unicode.
The three settings (I really don't care what they are called) I will
herein call "compatibility", "8bit", and "Unicode".
The default, initial setting of the pragma would be "compatibility", on
Windows platforms. I'm not sure what the appropriate default would be
on other platforms, possibly "Unicode". Possibly the needs of other
platforms might create benefits of creating other settings for this pragma.
For this usage, the pragma would be set to reflect the desired semantics
for API calls within the scope of the setting.
The goal would be to eventually migrate all Perl application and CPAN
code to work properly with the pragma set to "Unicode"... any other
settings of the pragma would be for compatibility, or special cases.
B) If the pragma is set to "compatibility", Perl would continue
implementing today's broken semantics for Windows ANSI APIs,
interpreting string parameters passed to Windows APIs as ANSI,
regardless of the setting of the UTF8 flag, and passing the data
directly to the Windows 8-bit APIs, and returning the data directly from
the Windows 8-bit APIs. This would allow users that are already working
around the broken features to continue to, until they can upgrade their
code to use one of the other, better, values for the pragma. API
operations attempted on non-Windows platforms with this setting of the
pragma should fail.
C) If the pragma is set to "8bit", then 8-bit Windows APIs would be used
(allowing use on all versions of Windows from 95 onward, but also being
restricted to characters from the current code page on the APIs), and
string data would be automatically translated from Latin1 or UTF-8 to
the current code page for use with the APIs, and API results would be
translated from the current code page back to Latin1 (if possible) or
UTF-8. There is the potential for data loss in this mode of operation,
if characters outside the subset defined by the current code page are
passed into the APIs. However, that is not an unexpected feature of
8-bit Windows APIs. API operations attempted on non-Windows platforms
with this setting of the pragma may fail or succeed, depending on the
nature of the platform, and if it has a "native 8bit" character set, and
if the operations have been appropriately coded to convert the data
to/from the native character set. (This would be the right way to
support EBCDIC, if there is anyone out there that wants to support EBCDIC.)
D) If the pragma is set to "Unicode", then 16-bit Windows APIs would be
used, allowing use on all versions of Windows NT, and some versions of
Windows 9x with Unicode extensions, although there may be some
restrictions of characters and APIs used on some such systems), and
string data would be automatically translated from Latin1 or UTF-8 to
UTF-16 for use with the APIs, and API results would be translated from
UTF-16 back to Latin1 (if possible) or UTF-8. API operations attempted
on a non-Unicode version of Windows with this setting of the pragma
should fail.
The above discussion only deals with Windows APIs. However, since the
Perl internals has a list of operations ("\U" "\u" "\L" "\l", certain
regexp matching operations and case-insensitivity, search the archives
for the complete list) that are likewise broken and presently depend on
the UTF8 flag to guide their functionality, it would be possible to use
this same pragma to deal with those cases. In this case, the pragma
would be set to reflect the desired semantics for this list of
operations within the scope of the setting.
E) If the pragma is set to "compatibility" or "8bit", today's surprising
semantics would be used, based on the setting of the UTF8 flag on
various string parameters to the various operations in the list.
F) If the pragma is set to "Unicode", all strings would have Unicode
semantics applied, regardless of the setting of the UTF8 flag, although
the UTF8 flag would still indicate whether the internal character is an
octet or a UTF-8 encoded number.
G) Perhaps an extra setting of the pragma "binary" would be useful for
certain regexp operations, where matching is desired, but no character
semantics should be applied in any way. This setting would be expected
to be used only in very local scopes containing regexp operations. API
operations with character string parameters attempted with this setting
should fail, as this setting would preclude interpreting strings as
characters.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking