Front page | perl.perl5.porters |
Postings from May 2008
Re: on the almost impossibility to write correct XS modules
From:
Marc Lehmann
Date:
May 19, 2008 21:20
Subject:
Re: on the almost impossibility to write correct XS modules
Message ID:
20080520042028.GA16896@schmorp.de
On Mon, May 19, 2008 at 06:28:12PM -0700, Glenn Linderman <perl@NevCal.com> wrote:
> >>1) The "automatic" conversion of 8-bit to UTF-8 "assumed" Latin1 because
> >>it was (a) easy numerically (b) worked well on platforms that use Latin1
> >>as their native encoding.
> >
> >Which platform is that? I really don't know *any* such platform.
>
> You don't have to know of one to figure out that the present scheme
> works fine on such a platform if it exists.
True, but you stated those platforms were the reason for why the automatic
conversion worked that way.
If no such platform exists, your argument is moot, because nobody would
implement a scheme because it is useful on no platform.
> Since it was done this way, I would assume it must have been useful
> somewhere... but perhaps it was just ASCII platforms for which it worked
> well.
Whatever an ascii platform is would be fine with about any other such
conversion.
> >Note also that the automatic conversion in perl doesn't assume any
> >encoding *at all*, so this is simply not true.
>
> Perl assumes an encoding for various operations; you've stated that. My
Yes, but automatic upgrade is _not_ one of them.
> saying that Perl assumes an encoding, is simply a collection: the set of
> all Perl operations that assume an encoding.
fine, but automatic upgrade, what you were talking about, is not in that
set. Point being?
> The conversion of internal string formats does assume that all the
> characters representable by various numbers in the octet format
> (internal UTF8 flag turned off) convert to the same number in the
> multi-bytes format (internal UTF8 flag turned on).
Yes.
> This is equivalent to converting from Latin1 to Unicode (UTF-8) for the
> range of numbers corresponding to Unicode code points (which applies to
> all the numbers that are representable in the octet format).
No, it is not. If the source data isn't latin1-encoded to begin with than
converting from latin1 to unicode is not a sensible operation to apply.
automatic upgrade, however, is, and thats because it does not apply any such
interpretation to the scalar. this is a subtle but crucial difference.
> If you are able to disagree with that, then you are simply being
> disagreeable, which doesn't help get the bugs fixed.
"If you don't agree to me you are not helpful"? Now that's a nice strawmen
argument :/
> >This is not what happens. Perl simply does not assume any encoding. If
> >you have an 8-bit filename encoded in latin1 then perl doesn't treat it
> >any different than an 8-bit filename encoded in koi8-r (another "ANSI"
> >encoding).
>
> The conversion of numeric characters from an 8-bit representation to a
> UTF8 multi-byte representation within Perl is often referred to as
> "assuming a latin1 encoding" by many discussions on this list.
In an informal way, you may well do that. When talking about unicode
semantics in perl, then being so sloppy will not do, however, because it
is important that the upgrade process works regardless of any encoding
(and is reversible).
> know, and I know, that it is simply two different representations of the
> list of numbers that make up a string.
Unfortunately, perl doesn't really handle it that way. regexes for example
treat the same number on the perl level differently depending on how its
encoded internally.
And this is a problem.
> But describing it the other way
> helps other people understand it, and it is not particularly false.
In my (not small) experience in explaining it to people, telling them
"not particular wrong" things about perels unicode handling scares them,
because they do not want that perl interprets their, say, koi8-r data as
latin1 in any way.
> you want to convince people of things, you should attempt to use their
> terminology as much as possible, and explain the problems in a way
> they'll understand it, rather than telling them they don't know what
> they are talking about...
Well, some people, like jan, clearly don't understand the issues.
Also, my terminology is their terminology. Perl simply doesn't interpret
your string as latin1 when upgrading. Thats a fact. In your or my
terminology.
> >upgrading and downgrading doesn't change that, or at least shouldn't
> >change that. where it does, it affects unix as much as any other platform.
>
> It could; are you referring to a particular version of Unix heres
No, all versions are the same here, right down to good old POSIX, or even
ISO-C.
> And what is its native 8-bit encoding?
Unix, by specification, has no native (or preferred) 8-bit encoding,
just like windows. There really isn't much of a difference in the 8-bit
apartment, except that unix interprets your data much less then windows
(for example, filesystem interaction doesn't check nor care for character
encodings).
> I can neither agree nor disagree with your statement here, without
> knowing more facts about the unix you are referring to.
There is only one, really, because all work the same.
> >>Retrofitting Perl on Windows to assume 8-bit data is ANSI will break all
> >>code that attempts to work with the constraints of 1 and 2.
> >
> >This would probably be true if 1) and 2) were real, but they are not.
>
> They are real; they are just stated in different terms than you prefer
> to use.
Sorry, but thtas bullshit. 1) for example claims this was implemented for the
sake of platforms that don't exist, which is not a sensible argument. This
has nothing to do with terminology.
It also has nothing to do with my person.
> >>have prevented, by example of a widely-used platform, the assumption
> >>throughout lots of Perl code, that all 8-bit data is assumed to be
> >>Latin1 implicitly.
> >
> >Perl doesn't do that anywhere on any platform, to my knowledge. Make an
> >example of a platform that expects filenames as latin1.
>
> Every time Perl alters the internal UTF8 flag, and correspondingly the
> representation of the string data, it makes the assumption that there is
> no numeric difference between the octet encoding and the multi-bytes
> encoding.
Exactly. It makes no assumption about the character encoding itself, because
the function is encoding-agnostic. It doesn't interpret your data as latin1.
> The only character sets for which this is true is Latin1 and
> Unicode, AFAIK
It is true for other encodings as well, such as ascii.
In fact, here is a good example why forcing an encoding interpretation on
upgrading/downgrading is wrong: the assumption of no numeric difference
between upgrading and downgrading is true for *any* 8-bit encoding
and it is also true for *any* codeset, simply because the numbers do not
change.
If you have koi8-r data (which is not compatible to latin1), then
upgrading and downgrading will not alter the fact that it is koi8-r data
(in current perls, and outside e.g. the buggy win32 module which enforces
different interpretation and breaks if strings get upgraded).
This is why your enforcing of such an interpretation is scary, because many
people still handle such data, and they need the safety that perl doesn't
tinker with their characters silently.
The transformation as it is was not chosen because of your two reasons. It
was chosen because it doesn't alter the string on the perl level. If you take
a string and upgrade it and dissect it, it will contain the same codepoints.
Any other transformation (like the one proposed by jan) doesn't have this
property, and since it isn't documented when perl does these upgrades and
downgrades, this is exactly why that proposal is broken by design.
If it was accompanied by making upgrades and downgrades *explicit*, i.e. perl
would die when you concatenated an upgraded and a downgraded string, or would
never silently upgrade/downgrade on its own and you always have to force it
manually, then this model would become workable, at the expense of making
perl strings type-ful, as there will be two incompatible string types.
Of course, this wouldn't be very perl.
> >(you can select this under unix, yes, but you can do so under windows as
> >well).
>
> So there you have answered your own question about platforms.
Yes, all humans are chinese because all chinese are humans. I said as a
special case, you can make it true, but in general it is not.
> issue arises because Perl for Windows does not require Windows to be
> configured to use Latin1 as the default code page;
And neither does it under unix. So your argument is wrong again, because the
issue does not arise because of "anything windows" at all.
> neither does it
> convert to or from Latin1 (or anything else) when calling Windows APIs;
and neither on unix.
> but it does assume numerical equality when converting between octet and
> multibytes strings, and that is only valid for Latin1 and Unicode.
No, numeric equality is true for every encoding on the world that uses
codepoints <256 - think about it. For example, the number 177 means the same
koi-8 characterm regardless of wether it was upgraded or not.
This is the property that is useful, not having latin1, that is not that
useful, and not implemented in perl anyways (see regexes for example).
> Hence, it assumes Latin1 during that conversion.
Wrong. It doesn't do so. The conversion used was chosen because it doesn't
change codepoints - character 177 stays character 177, not because latin1 is
particularly important for any specific platform.
> you read it... but I would be interested, if, setting aside the
> disagreements you stated above, if you think a scheme such as I outlined
> could be a helpful solution for Perl, using your mental model of
> strings, implicit internal format conversions, and such, which I think
> is reasonably accurate, even if it doesn't use the same terminology that
> most people on this forum use.
Well, to me, this is a mailinglist, but maybe my terminology is wrong
there, too. I do think I use the same terminology as everybody else here,
I am just being more exact in what I say, because if you are sloppy, you
fail to communicate the important differences and fall into traps like you
did above, because you couldn't escape the "character encoding" mental
model.
As for your points...
as I outlined, the problem is not so much backwards compatibility - perl 5.6
is totally difeernt to 5.8, 5.8 is different in many such encoding issues as
5.10.
The problem is mainly bugs, so while valid, I don't see how one could keep
compatibility, because the question is what to keep compatibility to -
5.8, 5.6, 5.005, 5.10? choose one, all are different.
I am alos not sure wether programs rely on the broken semantics - my
experience is that e.g. reading a filename (%APPDATA%) from an environment
variable and trying to access files that way doesn't work when ansi and
unicode disagree on encoding (which is the case even on my latin1
system, btw.)
But then, perl on windows is differently broken depending on which perl
you use - activestate has a really broken fork for example, and handles
filenames differently than other perls on windows.
I am not sure how many people really rely on that behaviour, and I am not
sure if this couldn't be just fixed by enforcing a single encoding.
But my experience is limited - I know the windows APIs and the problems
associated with not having a single format in which to store filenames.
On unix, this is positively better, as there is only ever a single format
to store filenames in that works regardless of locale (the problems start
when you interpret these filenames).
So I will only comment on E and F.
I think the pragma already exists, namely "use locale".
If I "use locale" in my program, I would expect perl to apply the current
locale to any strings, in regexes or elsewhere (to the extent possible).
If I don't "use locale", then I would expect regexes to interpret my strings
as unicode, regardless of the utf-8 flag, which I can't see in my source.
(the "surprising" behaviour).
Regarding filenames, this is very easy on unix: all filenames are
interpreted as octte strings, no specific encoding (perl cnanot know the
encoding of filenames on unix), so the functions all have to downgrade,
and if that fails, we have a bug (filenames are not locale-dependent
on unix, they are simply octet strings where only "/" and \000 are
interpreted).
(if it does not fail, it might still be a bug, we we cannot detect this).
I know "use locale" has weird side effects, but it basically boils down to
what perluniintro calls "native 8-bit encoding" (fortunately, it is not
even limited to 8-bit).
even if there were need for a new pragma, I wouldn't call it
"compatibility", because both behaviours are useful. The difference is
that I can control which interpretation is applied to my strings and do
not have to rely on an invisible flag on my scalars.
But then, "locale" maps exactly on the concept of "native encoding",
because my unix process might run ina locale using koi8-r, and then I
would want a way to take advantage of the locale w.r.t. to interpreting my
koi8-r data. (do not get confused by the mention of POSIX in the locale
manpage, locales are an ISO-C thing and ought to exist on windows as
well.
So for me, this is not a compatibility issue - right now, I don't think
anybody relies on the utf-8 flag behaviour in perl (a great deal has
changed between 5.6 and 5.8, and less has changed between 5.8 and 5.10, so
those programs need fixing already).
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / pcg@goof.com
-=====/_/_//_/\_,_/ /_/\_\