develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

Thread Previous | Thread Next
From:
Glenn Linderman
Date:
May 19, 2008 18:28
Subject:
Re: on the almost impossibility to write correct XS modules
Message ID:
4832292C.6060900@NevCal.com
On approximately 5/19/2008 2:22 PM, came the following characters from 
the keyboard of Marc Lehmann:
> On Mon, May 19, 2008 at 01:34:13PM -0700, Glenn Linderman <perl@NevCal.com> wrote:
>> The gist of the problem here is that
>>
>> 1) The "automatic" conversion of 8-bit to UTF-8 "assumed" Latin1 because 
>> it was (a) easy numerically (b) worked well on platforms that use Latin1 
>> as their native encoding.
> 
> Which platform is that? I really don't know *any* such platform.


You don't have to know of one to figure out that the present scheme 
works fine on such a platform if it exists.

Since it was done this way, I would assume it must have been useful 
somewhere... but perhaps it was just ASCII platforms for which it worked 
well.


> Note also that the automatic conversion in perl doesn't assume any
> encoding *at all*, so this is simply not true.


Perl assumes an encoding for various operations; you've stated that.  My 
saying that Perl assumes an encoding, is simply a collection: the set of 
all Perl operations that assume an encoding.

The conversion of internal string formats does assume that all the 
characters representable by various numbers in the octet format 
(internal UTF8 flag turned off) convert to the same number in the 
multi-bytes format (internal UTF8 flag turned on).  This is equivalent 
to converting from Latin1 to Unicode (UTF-8) for the range of numbers 
corresponding to Unicode code points (which applies to all the numbers 
that are representable in the octet format).

If you are able to disagree with that, then you are simply being 
disagreeable, which doesn't help get the bugs fixed.


>> 2) Windows assumes ANSI code page for 8-bit data, but Perl on Windows, 
>> for quite a few releases now, has not... instead, it "assumes" Latin1 
>> when "automatically" converting 8-bit to UTF-8.
> 
> This is not what happens. Perl simply does not assume any encoding. If
> you have an 8-bit filename encoded in latin1 then perl doesn't treat it
> any different than an 8-bit filename encoded in koi8-r (another "ANSI"
> encoding).


The conversion of numeric characters from an 8-bit representation to a 
UTF8 multi-byte representation within Perl is often referred to as 
"assuming a latin1 encoding" by many discussions on this list.  You 
know, and I know, that it is simply two different representations of the 
list of numbers that make up a string.  But describing it the other way 
helps other people understand it, and it is not particularly false.  If 
you want to convince people of things, you should attempt to use their 
terminology as much as possible, and explain the problems in a way 
they'll understand it, rather than telling them they don't know what 
they are talking about...


> upgrading and downgrading doesn't change that, or at least shouldn't
> change that. where it does, it affects unix as much as any other platform.


It could; are you referring to a particular version of Unix here?  And 
what is its native 8-bit encoding?  I can neither agree nor disagree 
with your statement here, without knowing more facts about the unix you 
are referring to.  And that is why I allowed for other possible settings 
for the pragma in my design...


>> Retrofitting Perl on Windows to assume 8-bit data is ANSI will break all 
>> code that attempts to work with the constraints of 1 and 2.
> 
> This would probably be true if 1) and 2) were real, but they are not.


They are real; they are just stated in different terms than you prefer 
to use.


>> somewhat lower performance than assuming Latin1.  And it would possibly 
>> have prevented, by example of a widely-used platform, the assumption 
>> throughout lots of Perl code, that all 8-bit data is assumed to be 
>> Latin1 implicitly.
> 
> Perl doesn't do that anywhere on any platform, to my knowledge. Make an
> example of a platform that expects filenames as latin1.


Every time Perl alters the internal UTF8 flag, and correspondingly the 
representation of the string data, it makes the assumption that there is 
no numeric difference between the octet encoding and the multi-bytes 
encoding.  The only character sets for which this is true is Latin1 and 
Unicode, AFAIK; hence the description of Perl assuming Latin1 encoding 
for octet data is equivalent, even if you don't care for that 
description. I also prefer the description of "numeric equality" for the 
  numbers in strings, but that is not widely used in this forum, and is 
therefore less useful for general discussion.


> (you can select this under unix, yes, but you can do so under windows as
> well).


So there you have answered your own question about platforms.  But the 
issue arises because Perl for Windows does not require Windows to be 
configured to use Latin1 as the default code page; neither does it 
convert to or from Latin1 (or anything else) when calling Windows APIs; 
but it does assume numerical equality when converting between octet and 
multibytes strings, and that is only valid for Latin1 and Unicode. 
Hence, it assumes Latin1 during that conversion.


> (the rest of the mail is either true, or depends on these critical but
> wrong assumptions. It is still use that decodes encoding).


I think you omitted a word or more from that last sentence, which I 
cannot comprehend.  However, I will take your first statement, that the 
rest of the email might be true, as an encouraging sign that at least 
you read it... but I would be interested, if, setting aside the 
disagreements you stated above, if you think a scheme such as I outlined 
could be a helpful solution for Perl, using your mental model of 
strings, implicit internal format conversions, and such, which I think 
is reasonably accurate, even if it doesn't use the same terminology that 
most people on this forum use.


-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About