develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

From:
Glenn Linderman
Date:
May 20, 2008 02:10
Subject:
Re: on the almost impossibility to write correct XS modules
Message ID:
48329586.702@NevCal.com
On approximately 5/19/2008 9:20 PM, came the following characters from 
the keyboard of Marc Lehmann:

>> This is equivalent to converting from Latin1 to Unicode (UTF-8) for the
>> range of numbers corresponding to Unicode code points (which applies to
>> all the numbers that are representable in the octet format).
> 
> No, it is not. If the source data isn't latin1-encoded to begin with than
> converting from latin1 to unicode is not a sensible operation to apply.
> 
> automatic upgrade, however, is, and thats because it does not apply any such
> interpretation to the scalar. this is a subtle but crucial difference.


I see the point you are making -- semantics only.

The operations of "automatic upgrade" and "conversion of Latin1 to 
Unicode UTF-8" are equivalent -- numerically, they are not only 
equivalent, but identical transformations -- it is only the semantics of 
them that is different.

So the point you are making is that using precise semantics is important 
to true understanding of the issue; the point I am making is that the 
semantics used by most participants of this forum is sufficient to 
describe the problem in a way that might encourage bugs to be fixed, 
without attempting to fully re-educate them all in a flamewar... their 
model of the world, and of the use of strings, maybe be more limited 
than yours and mine, regarding strings being sequences of numbers 
instead of sequences of characters, but that is OK... the bugs, and the 
fixes, are the same in both cases!


>> If you are able to disagree with that, then you are simply being 
>> disagreeable, which doesn't help get the bugs fixed.
> 
> "If you don't agree to me you are not helpful"? Now that's a nice strawmen
> argument :/


Yep :)  Thought you'd like that one!


>>> This is not what happens. Perl simply does not assume any encoding. If
>>> you have an 8-bit filename encoded in latin1 then perl doesn't treat it
>>> any different than an 8-bit filename encoded in koi8-r (another "ANSI"
>>> encoding).
>> The conversion of numeric characters from an 8-bit representation to a 
>> UTF8 multi-byte representation within Perl is often referred to as 
>> "assuming a latin1 encoding" by many discussions on this list.
> 
> In an informal way, you may well do that. When talking about unicode
> semantics in perl, then being so sloppy will not do, however, because it
> is important that the upgrade process works regardless of any encoding
> (and is reversible).
> 
>> know, and I know, that it is simply two different representations of the 
>> list of numbers that make up a string.
> 
> Unfortunately, perl doesn't really handle it that way. regexes for example
> treat the same number on the perl level differently depending on how its
> encoded internally.
> 
> And this is a problem.


Yep, that's one of them.  And that is possibly addressed by the proposal 
I made.


> 
>> But describing it the other way 
>> helps other people understand it, and it is not particularly false.
> 
> In my (not small) experience in explaining it to people, telling them
> "not particular wrong" things about perels unicode handling scares them,
> because they do not want that perl interprets their, say, koi8-r data as
> latin1 in any way.


I think it is fine for you to teach Perl programmers the one true string 
model inside Perl.  I just wish there was one :)  And so do you.  And so 
your students will be smarter than the rest of the Perl programmers.  We 
need more smart Perl programmers.

Maybe someday we can fix the documentation.  I'm more interested in 
seeing if we can explain the problem in a manner that is understandable, 
so that we can get Perl fixed.  I wish I was smart enough and had time 
enough to fix Perl myself, but failing that, I attempt to communicate 
with those that are smart enough to fix Perl.


>> you want to convince people of things, you should attempt to use their 
>> terminology as much as possible, and explain the problems in a way 
>> they'll understand it, rather than telling them they don't know what 
>> they are talking about...
> 
> Well, some people, like jan, clearly don't understand the issues.


Well, it seems Jan is smart enough to produce Perl for Windows, buggy 
though it is... and I bet he's smart enough to fix this problem, 
although I don't know if he has the time or inclination.


> Also, my terminology is their terminology. Perl simply doesn't interpret
> your string as latin1 when upgrading. Thats a fact. In your or my
> terminology.


The upgrade operation is exactly that of upgrading Latin1 to Unicode 
UTF-8.  The semantics may or may not be, depending on the encoding of 
the underlying data.  But since folks want to talk about Latin1 to 
Unicode UTF-8 upgrade operations, I'm happy to point out the problems we 
encounter in those terms.



>> Every time Perl alters the internal UTF8 flag, and correspondingly the 
>> representation of the string data, it makes the assumption that there is 
>> no numeric difference between the octet encoding and the multi-bytes 
>> encoding.
> 
> Exactly. It makes no assumption about the character encoding itself, because
> the function is encoding-agnostic. It doesn't interpret your data as latin1.
> 
>> The only character sets for which this is true is Latin1 and 
>> Unicode, AFAIK
> 
> It is true for other encodings as well, such as ascii.


Yet you discarded my ASCII platform argument earlier...  I'm well aware 
that a subset of a subset is a subset of the original... that ASCII is a 
subset of Latin1, and Latin1 is a subset of Unicode, and therefore ASCII 
is a subset of Unicode.


> In fact, here is a good example why forcing an encoding interpretation on
> upgrading/downgrading is wrong: the assumption of no numeric difference
> between upgrading and downgrading is true for *any* 8-bit encoding
> and it is also true for *any* codeset, simply because the numbers do not
> change.
> 
> If you have koi8-r data (which is not compatible to latin1), then
> upgrading and downgrading will not alter the fact that it is koi8-r data
> (in current perls, and outside e.g. the buggy win32 module which enforces
> different interpretation and breaks if strings get upgraded).


If you have koi8-r data, you don't ever need to upgrade or downgrade it, 
and if you do, you don't have koi8-r data any more.  koi8-r is an 8-bit 
encoding.  http://en.wikipedia.org/wiki/KOI8-R

So if you upgrade a string containing koi8-r data, you have "numbers 
expressed in Unicode UTF-8 structural format (multibytes) that 
correspond to the 8-bit encoding of koi8-r", not koi8-r itself.  Nor do 
you have Unicode UTF-8.  And if you downgrade it, you have koi8-r string 
data again.

However, if you have Latin1, and perform the same transformation, you 
have Unicode UTF-8.

And if you have numbers, you still have the same numeric values.

All three are equivalent transformations.


> This is why your enforcing of such an interpretation is scary, because many
> people still handle such data, and they need the safety that perl doesn't
> tinker with their characters silently.


Sure.  But it does tinker with them, if you upgrade and/or downgrade. 
The nice thing about Perl's upgrade and downgrade is that for Latin1 (or 
the set of numbers between 0 and 255 inclusive) the operations are 
inverses... if you perform both, you have the same thing you started with.



> The transformation as it is was not chosen because of your two reasons. It
> was chosen because it doesn't alter the string on the perl level. If you take
> a string and upgrade it and dissect it, it will contain the same codepoints.


I don't know this for certain; I wasn't the one that made the decision, 
nor was I involved with any prior discussion, nor have I read the 
discussion from that timeframe.  I doubt you have either.

I strongly suspect that the real reason for choosing this transformation 
was because it was easy, and could be justified as a Latin1 to Unicode 
UTF-8 conversion, when the transformation is perceived as operating on a 
string, rather than a sequence of small numbers.


> Any other transformation (like the one proposed by jan) doesn't have this
> property, and since it isn't documented when perl does these upgrades and
> downgrades, this is exactly why that proposal is broken by design.


Jan's basic proposal could have been made to work, had it been done when 
Unicode was introduced to Perl.  It seems to be much to late to 
implement Jan's proposal now, many years later, as too much code would 
break.  That is why, when I perceived that there could be a proposal 
that could work, with minimal to no breakage, and providing a migration 
path to a usable "Perl for Windows supporting Unicode", that I spent the 
time to propose it.

I am working on applications now that I may have to abandon Perl for, 
because of the limitations of Perl for Windows handling of Unicode. 
This saddens me, because otherwise Perl is a great language.


> The problem is mainly bugs, so while valid, I don't see how one could keep
> compatibility, because the question is what to keep compatibility to -
> 5.8, 5.6, 5.005, 5.10? choose one, all are different.


Seems like the primary concern is compatibility with the prior release. 
  So 5.10 is the target.  And the secondary concern is compatibility 
with CPAN, and there again, 5.10 is the target.

Sure there are bugs.  And Of course, Perl for Windows inherits the other 
bugs in Perl at large with respect to Unicode handling, such as regexp, 
and toupper, etc.  However, it seems to be that the primary set of 
Windows-specific bugs that prevents Perl for Windows from handling 
Unicode properly is in the use of APIs that deal with file names, and 
the primary reason is a character encoding mismatch, and the secondary 
reason (which limits the set of characters that can be used in file 
names in Perl for Windows programs) is the use of the 8-bit API instead 
of the 16-bit API.  My proposal addressed both of those issues.



> I am alos not sure wether programs rely on the broken semantics - my
> experience is that e.g. reading a filename (%APPDATA%) from an environment
> variable and trying to access files that way doesn't work when ansi and
> unicode disagree on encoding (which is the case even on my latin1
> system, btw.)


I seem to recall discussions on this list about how to workaround some 
of these file naming issues, in a variety of ways; pre-transcoding from 
local character set to ANSI, and using special packages that use 16-bit 
Windows APIs are two "workarounds" that I recall, which address parts of 
the problems in different ways.  So I think you can be sure that there 
are programs that attempt to work-around the broken semantics.


> But then, perl on windows is differently broken depending on which perl
> you use - activestate has a really broken fork for example, and handles
> filenames differently than other perls on windows.
> 
> I am not sure how many people really rely on that behaviour, and I am not
> sure if this couldn't be just fixed by enforcing a single encoding.


As far as I'm concerned, I'd be willing/delighted to convert my code 
when I upgrade to a fixed Perl for Windows. But it seems that retaining 
compatibility with the prior release is important in this forum.



> But my experience is limited - I know the windows APIs and the problems
> associated with not having a single format in which to store filenames.
> 
> On unix, this is positively better, as there is only ever a single format
> to store filenames in that works regardless of locale (the problems start
> when you interpret these filenames).


To the inexperienced (I haven't used Unix versions that support Unicode, 
my last Unix box was about 6 years ago, although I hope to start to play 
with Linux soon), it sounds like the Windows scheme of requiring 
filenames to be in Unicode, or some known/configured encoding that can 
be converted to Unicode, is well-defined.  You don't describe enough 
details of the Unix scheme, but indicating that there are problems 
attempting to interpret the file names sounds frightening.


> So I will only comment on E and F.
> 
> I think the pragma already exists, namely "use locale".


"locale" certainly exists, but it seems that it doesn't solve this 
problem with filenames.  Hence my suggestion for a different pragma.


> If I "use locale" in my program, I would expect perl to apply the current
> locale to any strings, in regexes or elsewhere (to the extent possible).


That would be a reasonable expectation, but I hear locale is even more 
broken than Unicode in Perl for Windows.


> If I don't "use locale", then I would expect regexes to interpret my strings
> as unicode, regardless of the utf-8 flag, which I can't see in my source.
> (the "surprising" behaviour).


The goal of my proposal was to migrate to achieving this.  We don't have 
it now, but it is a good goal.


> Regarding filenames, this is very easy on unix: all filenames are
> interpreted as octte strings, no specific encoding (perl cnanot know the
> encoding of filenames on unix), so the functions all have to downgrade,
> and if that fails, we have a bug (filenames are not locale-dependent
> on unix, they are simply octet strings where only "/" and \000 are
> interpreted).
> 
> (if it does not fail, it might still be a bug, we we cannot detect this).


Ah, more details about filenames.  Well, this sounds positively weird. 
Octet strings are not particularly user-friendly, if you can't interpret 
them as characters reliably.

 From what you say, and what I think I've heard elsewhere, Unix filename 
interpretation is a mess.  Seems like the only bigger mess I've heard 
about is VMS file handling, where they seem to have a choice of several 
messes.


> I know "use locale" has weird side effects, but it basically boils down to
> what perluniintro calls "native 8-bit encoding" (fortunately, it is not
> even limited to 8-bit).
> 
> even if there were need for a new pragma, I wouldn't call it
> "compatibility", because both behaviours are useful. The difference is
> that I can control which interpretation is applied to my strings and do
> not have to rely on an invisible flag on my scalars.


"compatibility" was a name I used for one value of the pragma, not the 
pragma itself.  But I don't care about the names... I didn't even give 
the pragma a name.

But your comment makes me wonder; if you think the "compatibility" mode 
is useful, why are you complaining about bugs?  I suggested the 
"compatibility" mode as a way to retain compatibility with the 
workarounds for the existing bugs, so that code could be slowly migrated 
to "Unicode" mode, which would fix all the bugs.


> But then, "locale" maps exactly on the concept of "native encoding",
> because my unix process might run ina locale using koi8-r, and then I
> would want a way to take advantage of the locale w.r.t. to interpreting my
> koi8-r data. (do not get confused by the mention of POSIX in the locale
> manpage, locales are an ISO-C thing and ought to exist on windows as
> well.


Locale's do exist on Windows.


> So for me, this is not a compatibility issue - right now, I don't think
> anybody relies on the utf-8 flag behaviour in perl (a great deal has
> changed between 5.6 and 5.8, and less has changed between 5.8 and 5.10, so
> those programs need fixing already).


Every Perl for Windows program that attempts to access file names 
containing extra-ASCII characters contains some sort of workaround, as I 
understand it.  So there is a compatibility issue... one I'd be willing 
to deal with as a one-time cost of converting to a fixed, Perl for 
Windows that supports Unicode well... but for which I can see a 
migration path which I outlined.


-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About