Front page | perl.perl5.porters |
Postings from May 2008
Re: on the almost impossibility to write correct XS modules
From:
Glenn Linderman
Date:
May 20, 2008 02:10
Subject:
Re: on the almost impossibility to write correct XS modules
Message ID:
48329586.702@NevCal.com
On approximately 5/19/2008 9:20 PM, came the following characters from
the keyboard of Marc Lehmann:
>> This is equivalent to converting from Latin1 to Unicode (UTF-8) for the
>> range of numbers corresponding to Unicode code points (which applies to
>> all the numbers that are representable in the octet format).
>
> No, it is not. If the source data isn't latin1-encoded to begin with than
> converting from latin1 to unicode is not a sensible operation to apply.
>
> automatic upgrade, however, is, and thats because it does not apply any such
> interpretation to the scalar. this is a subtle but crucial difference.
I see the point you are making -- semantics only.
The operations of "automatic upgrade" and "conversion of Latin1 to
Unicode UTF-8" are equivalent -- numerically, they are not only
equivalent, but identical transformations -- it is only the semantics of
them that is different.
So the point you are making is that using precise semantics is important
to true understanding of the issue; the point I am making is that the
semantics used by most participants of this forum is sufficient to
describe the problem in a way that might encourage bugs to be fixed,
without attempting to fully re-educate them all in a flamewar... their
model of the world, and of the use of strings, maybe be more limited
than yours and mine, regarding strings being sequences of numbers
instead of sequences of characters, but that is OK... the bugs, and the
fixes, are the same in both cases!
>> If you are able to disagree with that, then you are simply being
>> disagreeable, which doesn't help get the bugs fixed.
>
> "If you don't agree to me you are not helpful"? Now that's a nice strawmen
> argument :/
Yep :) Thought you'd like that one!
>>> This is not what happens. Perl simply does not assume any encoding. If
>>> you have an 8-bit filename encoded in latin1 then perl doesn't treat it
>>> any different than an 8-bit filename encoded in koi8-r (another "ANSI"
>>> encoding).
>> The conversion of numeric characters from an 8-bit representation to a
>> UTF8 multi-byte representation within Perl is often referred to as
>> "assuming a latin1 encoding" by many discussions on this list.
>
> In an informal way, you may well do that. When talking about unicode
> semantics in perl, then being so sloppy will not do, however, because it
> is important that the upgrade process works regardless of any encoding
> (and is reversible).
>
>> know, and I know, that it is simply two different representations of the
>> list of numbers that make up a string.
>
> Unfortunately, perl doesn't really handle it that way. regexes for example
> treat the same number on the perl level differently depending on how its
> encoded internally.
>
> And this is a problem.
Yep, that's one of them. And that is possibly addressed by the proposal
I made.
>
>> But describing it the other way
>> helps other people understand it, and it is not particularly false.
>
> In my (not small) experience in explaining it to people, telling them
> "not particular wrong" things about perels unicode handling scares them,
> because they do not want that perl interprets their, say, koi8-r data as
> latin1 in any way.
I think it is fine for you to teach Perl programmers the one true string
model inside Perl. I just wish there was one :) And so do you. And so
your students will be smarter than the rest of the Perl programmers. We
need more smart Perl programmers.
Maybe someday we can fix the documentation. I'm more interested in
seeing if we can explain the problem in a manner that is understandable,
so that we can get Perl fixed. I wish I was smart enough and had time
enough to fix Perl myself, but failing that, I attempt to communicate
with those that are smart enough to fix Perl.
>> you want to convince people of things, you should attempt to use their
>> terminology as much as possible, and explain the problems in a way
>> they'll understand it, rather than telling them they don't know what
>> they are talking about...
>
> Well, some people, like jan, clearly don't understand the issues.
Well, it seems Jan is smart enough to produce Perl for Windows, buggy
though it is... and I bet he's smart enough to fix this problem,
although I don't know if he has the time or inclination.
> Also, my terminology is their terminology. Perl simply doesn't interpret
> your string as latin1 when upgrading. Thats a fact. In your or my
> terminology.
The upgrade operation is exactly that of upgrading Latin1 to Unicode
UTF-8. The semantics may or may not be, depending on the encoding of
the underlying data. But since folks want to talk about Latin1 to
Unicode UTF-8 upgrade operations, I'm happy to point out the problems we
encounter in those terms.
>> Every time Perl alters the internal UTF8 flag, and correspondingly the
>> representation of the string data, it makes the assumption that there is
>> no numeric difference between the octet encoding and the multi-bytes
>> encoding.
>
> Exactly. It makes no assumption about the character encoding itself, because
> the function is encoding-agnostic. It doesn't interpret your data as latin1.
>
>> The only character sets for which this is true is Latin1 and
>> Unicode, AFAIK
>
> It is true for other encodings as well, such as ascii.
Yet you discarded my ASCII platform argument earlier... I'm well aware
that a subset of a subset is a subset of the original... that ASCII is a
subset of Latin1, and Latin1 is a subset of Unicode, and therefore ASCII
is a subset of Unicode.
> In fact, here is a good example why forcing an encoding interpretation on
> upgrading/downgrading is wrong: the assumption of no numeric difference
> between upgrading and downgrading is true for *any* 8-bit encoding
> and it is also true for *any* codeset, simply because the numbers do not
> change.
>
> If you have koi8-r data (which is not compatible to latin1), then
> upgrading and downgrading will not alter the fact that it is koi8-r data
> (in current perls, and outside e.g. the buggy win32 module which enforces
> different interpretation and breaks if strings get upgraded).
If you have koi8-r data, you don't ever need to upgrade or downgrade it,
and if you do, you don't have koi8-r data any more. koi8-r is an 8-bit
encoding. http://en.wikipedia.org/wiki/KOI8-R
So if you upgrade a string containing koi8-r data, you have "numbers
expressed in Unicode UTF-8 structural format (multibytes) that
correspond to the 8-bit encoding of koi8-r", not koi8-r itself. Nor do
you have Unicode UTF-8. And if you downgrade it, you have koi8-r string
data again.
However, if you have Latin1, and perform the same transformation, you
have Unicode UTF-8.
And if you have numbers, you still have the same numeric values.
All three are equivalent transformations.
> This is why your enforcing of such an interpretation is scary, because many
> people still handle such data, and they need the safety that perl doesn't
> tinker with their characters silently.
Sure. But it does tinker with them, if you upgrade and/or downgrade.
The nice thing about Perl's upgrade and downgrade is that for Latin1 (or
the set of numbers between 0 and 255 inclusive) the operations are
inverses... if you perform both, you have the same thing you started with.
> The transformation as it is was not chosen because of your two reasons. It
> was chosen because it doesn't alter the string on the perl level. If you take
> a string and upgrade it and dissect it, it will contain the same codepoints.
I don't know this for certain; I wasn't the one that made the decision,
nor was I involved with any prior discussion, nor have I read the
discussion from that timeframe. I doubt you have either.
I strongly suspect that the real reason for choosing this transformation
was because it was easy, and could be justified as a Latin1 to Unicode
UTF-8 conversion, when the transformation is perceived as operating on a
string, rather than a sequence of small numbers.
> Any other transformation (like the one proposed by jan) doesn't have this
> property, and since it isn't documented when perl does these upgrades and
> downgrades, this is exactly why that proposal is broken by design.
Jan's basic proposal could have been made to work, had it been done when
Unicode was introduced to Perl. It seems to be much to late to
implement Jan's proposal now, many years later, as too much code would
break. That is why, when I perceived that there could be a proposal
that could work, with minimal to no breakage, and providing a migration
path to a usable "Perl for Windows supporting Unicode", that I spent the
time to propose it.
I am working on applications now that I may have to abandon Perl for,
because of the limitations of Perl for Windows handling of Unicode.
This saddens me, because otherwise Perl is a great language.
> The problem is mainly bugs, so while valid, I don't see how one could keep
> compatibility, because the question is what to keep compatibility to -
> 5.8, 5.6, 5.005, 5.10? choose one, all are different.
Seems like the primary concern is compatibility with the prior release.
So 5.10 is the target. And the secondary concern is compatibility
with CPAN, and there again, 5.10 is the target.
Sure there are bugs. And Of course, Perl for Windows inherits the other
bugs in Perl at large with respect to Unicode handling, such as regexp,
and toupper, etc. However, it seems to be that the primary set of
Windows-specific bugs that prevents Perl for Windows from handling
Unicode properly is in the use of APIs that deal with file names, and
the primary reason is a character encoding mismatch, and the secondary
reason (which limits the set of characters that can be used in file
names in Perl for Windows programs) is the use of the 8-bit API instead
of the 16-bit API. My proposal addressed both of those issues.
> I am alos not sure wether programs rely on the broken semantics - my
> experience is that e.g. reading a filename (%APPDATA%) from an environment
> variable and trying to access files that way doesn't work when ansi and
> unicode disagree on encoding (which is the case even on my latin1
> system, btw.)
I seem to recall discussions on this list about how to workaround some
of these file naming issues, in a variety of ways; pre-transcoding from
local character set to ANSI, and using special packages that use 16-bit
Windows APIs are two "workarounds" that I recall, which address parts of
the problems in different ways. So I think you can be sure that there
are programs that attempt to work-around the broken semantics.
> But then, perl on windows is differently broken depending on which perl
> you use - activestate has a really broken fork for example, and handles
> filenames differently than other perls on windows.
>
> I am not sure how many people really rely on that behaviour, and I am not
> sure if this couldn't be just fixed by enforcing a single encoding.
As far as I'm concerned, I'd be willing/delighted to convert my code
when I upgrade to a fixed Perl for Windows. But it seems that retaining
compatibility with the prior release is important in this forum.
> But my experience is limited - I know the windows APIs and the problems
> associated with not having a single format in which to store filenames.
>
> On unix, this is positively better, as there is only ever a single format
> to store filenames in that works regardless of locale (the problems start
> when you interpret these filenames).
To the inexperienced (I haven't used Unix versions that support Unicode,
my last Unix box was about 6 years ago, although I hope to start to play
with Linux soon), it sounds like the Windows scheme of requiring
filenames to be in Unicode, or some known/configured encoding that can
be converted to Unicode, is well-defined. You don't describe enough
details of the Unix scheme, but indicating that there are problems
attempting to interpret the file names sounds frightening.
> So I will only comment on E and F.
>
> I think the pragma already exists, namely "use locale".
"locale" certainly exists, but it seems that it doesn't solve this
problem with filenames. Hence my suggestion for a different pragma.
> If I "use locale" in my program, I would expect perl to apply the current
> locale to any strings, in regexes or elsewhere (to the extent possible).
That would be a reasonable expectation, but I hear locale is even more
broken than Unicode in Perl for Windows.
> If I don't "use locale", then I would expect regexes to interpret my strings
> as unicode, regardless of the utf-8 flag, which I can't see in my source.
> (the "surprising" behaviour).
The goal of my proposal was to migrate to achieving this. We don't have
it now, but it is a good goal.
> Regarding filenames, this is very easy on unix: all filenames are
> interpreted as octte strings, no specific encoding (perl cnanot know the
> encoding of filenames on unix), so the functions all have to downgrade,
> and if that fails, we have a bug (filenames are not locale-dependent
> on unix, they are simply octet strings where only "/" and \000 are
> interpreted).
>
> (if it does not fail, it might still be a bug, we we cannot detect this).
Ah, more details about filenames. Well, this sounds positively weird.
Octet strings are not particularly user-friendly, if you can't interpret
them as characters reliably.
From what you say, and what I think I've heard elsewhere, Unix filename
interpretation is a mess. Seems like the only bigger mess I've heard
about is VMS file handling, where they seem to have a choice of several
messes.
> I know "use locale" has weird side effects, but it basically boils down to
> what perluniintro calls "native 8-bit encoding" (fortunately, it is not
> even limited to 8-bit).
>
> even if there were need for a new pragma, I wouldn't call it
> "compatibility", because both behaviours are useful. The difference is
> that I can control which interpretation is applied to my strings and do
> not have to rely on an invisible flag on my scalars.
"compatibility" was a name I used for one value of the pragma, not the
pragma itself. But I don't care about the names... I didn't even give
the pragma a name.
But your comment makes me wonder; if you think the "compatibility" mode
is useful, why are you complaining about bugs? I suggested the
"compatibility" mode as a way to retain compatibility with the
workarounds for the existing bugs, so that code could be slowly migrated
to "Unicode" mode, which would fix all the bugs.
> But then, "locale" maps exactly on the concept of "native encoding",
> because my unix process might run ina locale using koi8-r, and then I
> would want a way to take advantage of the locale w.r.t. to interpreting my
> koi8-r data. (do not get confused by the mention of POSIX in the locale
> manpage, locales are an ISO-C thing and ought to exist on windows as
> well.
Locale's do exist on Windows.
> So for me, this is not a compatibility issue - right now, I don't think
> anybody relies on the utf-8 flag behaviour in perl (a great deal has
> changed between 5.6 and 5.8, and less has changed between 5.8 and 5.10, so
> those programs need fixing already).
Every Perl for Windows program that attempts to access file names
containing extra-ASCII characters contains some sort of workaround, as I
understand it. So there is a compatibility issue... one I'd be willing
to deal with as a one-time cost of converting to a fixed, Perl for
Windows that supports Unicode well... but for which I can see a
migration path which I outlined.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking