Front page | perl.perl5.porters |
Postings from May 2008
Re: on the almost impossibility to write correct XS modules
From: Glenn Linderman
May 19, 2008 18:28
Re: on the almost impossibility to write correct XS modules
Message ID: 4832292C.6060900@NevCal.com
On approximately 5/19/2008 2:22 PM, came the following characters from
the keyboard of Marc Lehmann:
> On Mon, May 19, 2008 at 01:34:13PM -0700, Glenn Linderman <perl@NevCal.com> wrote:
>> The gist of the problem here is that
>> 1) The "automatic" conversion of 8-bit to UTF-8 "assumed" Latin1 because
>> it was (a) easy numerically (b) worked well on platforms that use Latin1
>> as their native encoding.
> Which platform is that? I really don't know *any* such platform.
You don't have to know of one to figure out that the present scheme
works fine on such a platform if it exists.
Since it was done this way, I would assume it must have been useful
somewhere... but perhaps it was just ASCII platforms for which it worked
> Note also that the automatic conversion in perl doesn't assume any
> encoding *at all*, so this is simply not true.
Perl assumes an encoding for various operations; you've stated that. My
saying that Perl assumes an encoding, is simply a collection: the set of
all Perl operations that assume an encoding.
The conversion of internal string formats does assume that all the
characters representable by various numbers in the octet format
(internal UTF8 flag turned off) convert to the same number in the
multi-bytes format (internal UTF8 flag turned on). This is equivalent
to converting from Latin1 to Unicode (UTF-8) for the range of numbers
corresponding to Unicode code points (which applies to all the numbers
that are representable in the octet format).
If you are able to disagree with that, then you are simply being
disagreeable, which doesn't help get the bugs fixed.
>> 2) Windows assumes ANSI code page for 8-bit data, but Perl on Windows,
>> for quite a few releases now, has not... instead, it "assumes" Latin1
>> when "automatically" converting 8-bit to UTF-8.
> This is not what happens. Perl simply does not assume any encoding. If
> you have an 8-bit filename encoded in latin1 then perl doesn't treat it
> any different than an 8-bit filename encoded in koi8-r (another "ANSI"
The conversion of numeric characters from an 8-bit representation to a
UTF8 multi-byte representation within Perl is often referred to as
"assuming a latin1 encoding" by many discussions on this list. You
know, and I know, that it is simply two different representations of the
list of numbers that make up a string. But describing it the other way
helps other people understand it, and it is not particularly false. If
you want to convince people of things, you should attempt to use their
terminology as much as possible, and explain the problems in a way
they'll understand it, rather than telling them they don't know what
they are talking about...
> upgrading and downgrading doesn't change that, or at least shouldn't
> change that. where it does, it affects unix as much as any other platform.
It could; are you referring to a particular version of Unix here? And
what is its native 8-bit encoding? I can neither agree nor disagree
with your statement here, without knowing more facts about the unix you
are referring to. And that is why I allowed for other possible settings
for the pragma in my design...
>> Retrofitting Perl on Windows to assume 8-bit data is ANSI will break all
>> code that attempts to work with the constraints of 1 and 2.
> This would probably be true if 1) and 2) were real, but they are not.
They are real; they are just stated in different terms than you prefer
>> somewhat lower performance than assuming Latin1. And it would possibly
>> have prevented, by example of a widely-used platform, the assumption
>> throughout lots of Perl code, that all 8-bit data is assumed to be
>> Latin1 implicitly.
> Perl doesn't do that anywhere on any platform, to my knowledge. Make an
> example of a platform that expects filenames as latin1.
Every time Perl alters the internal UTF8 flag, and correspondingly the
representation of the string data, it makes the assumption that there is
no numeric difference between the octet encoding and the multi-bytes
encoding. The only character sets for which this is true is Latin1 and
Unicode, AFAIK; hence the description of Perl assuming Latin1 encoding
for octet data is equivalent, even if you don't care for that
description. I also prefer the description of "numeric equality" for the
numbers in strings, but that is not widely used in this forum, and is
therefore less useful for general discussion.
> (you can select this under unix, yes, but you can do so under windows as
So there you have answered your own question about platforms. But the
issue arises because Perl for Windows does not require Windows to be
configured to use Latin1 as the default code page; neither does it
convert to or from Latin1 (or anything else) when calling Windows APIs;
but it does assume numerical equality when converting between octet and
multibytes strings, and that is only valid for Latin1 and Unicode.
Hence, it assumes Latin1 during that conversion.
> (the rest of the mail is either true, or depends on these critical but
> wrong assumptions. It is still use that decodes encoding).
I think you omitted a word or more from that last sentence, which I
cannot comprehend. However, I will take your first statement, that the
rest of the email might be true, as an encouraging sign that at least
you read it... but I would be interested, if, setting aside the
disagreements you stated above, if you think a scheme such as I outlined
could be a helpful solution for Perl, using your mental model of
strings, implicit internal format conversions, and such, which I think
is reasonably accurate, even if it doesn't use the same terminology that
most people on this forum use.
Glenn -- http://nevcal.com/
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking