Front page | perl.perl5.porters |
Postings from May 2008
Re: on the almost impossibility to write correct XS modules
From:
Marc Lehmann
Date:
May 19, 2008 14:17
Subject:
Re: on the almost impossibility to write correct XS modules
Message ID:
20080519211659.GG28949@schmorp.de
On Mon, May 19, 2008 at 06:37:11PM +0200, Tels <nospam-abuse@bloodgate.com> wrote:
> > This is the correct way to approach unicode because it frees the
> > programmer from tracking both external and internal encodings.
>
> Uhm, excuse me? I don't think this actually frees the programmer from
> tracking internal encodings and especially not tracking of external
> encodings.
I didn't claim that. I said it frees him from tracking *both*.
> Perl's "one-encoding-for-all" approach has the real world problem that
> you cannot easily mix strings without being very very very very
> careful, or you get garbage. Automatically and without warning.
Yes, and these bugs need to be fixed. If I have a string, then its
interpretation must not silently change just because I did soem oepration
that forced perl to upgrade it internally, without documentation where
exactly this happens or what I can do about it.
Or to make an example, if I have a string that contains a single character
with codepoint 200, then I do not want it to change in any way (on the
perl level), regardless of any upgrades or downgrades that perl *silently*
applies.
Whereever and whenever this happens, this is a bug.
One could fix this bug by putting the burden on the programmer, by
documentation all functions that cause such silent encoding changes (that
includes xs module documentation).
Another way is to apply the same interpretation everywhere: a string is
simply a concatenation of codepoints, encoding (external encoding) is
supplied by the programmer.
> And most of the problems when you want to work with Unicode (even if you
> _only_ want to use UTF-8, not even throwing UTF-16 into the mix)
One of the major problem in this discussion is when people confuse
"unicode" and an encoding such as "utf-8" or "latin-1": I can encode
(some) unicode in latin1, that doesn't make it less unicode, and I can
encode all unicode in utf-8 or utf-16, but that does't make either valid
unicode codepoints.
For perl utf-8 and utf-16 are just byte strings. Yes, when you mix them
together, you might shoot yourself in the foot (or not, there are valid
reasons to do so), but the programmer is responsible for it.
Now throw in some of the "transparent ansi encoding" in it, and you will
find that perl sometimes re-intreprets your utf-8 bytes as, say, koi8-r
while upgrading it. In extreme cases this transformation might not even be
reversible.
A programmer must not have to track this, this is insane. a programmer
has his hands full in tracking his own string encodings, he must not be
bothered to also track what pelr or perl could not do to his strings when
he isn't expecting it (because this is not documented).
> Or in other words, Perls "frees the programmer from traking encodings"
> by making him carefully track all strings as they come in and go out
exactly. perl itself does _not_ attach an encoding to its strings, thats what
I am saying all the time.
atatching one (namely ANSI depending in internal flags) is just plain
broken, as this would make it imppossible to handle "the" non-ansi encoding
in a defined way.
> Not to mention that you actually lose the information what original
> encoding the string had - "aa" looks the same in latin1 and utf-8, but
> depending on which encoding it "has", acts differently. (at least thats
> what I remember from regexps discussions)
exactly, thats what forcing the interpretation to ansi is simply wrong -
my binary data string isn't ansi-encoded, my utf-8 string from that file
isn't either, etc. etc.
> It would be _much_ easier if all strings in Perl carried their encoding
> with them, and Perl would be able to simple mix two strings by
> automatically upgrading them according to their encoding. Then you'd
> also be able to query the encoding, btw. No more guesswork based upon a
> single bit.
That would be interesting, but you would still have to mark all such data
accordingly, so the programm still has to track changes.
It would also break perl w.r.t. earlier versions completely, as suddenly you
needed to tag your data wether it is a string or not.
Perl's current (mostly-implemented) unicode model is to treat strings as
concatenations of codepoints, and it is up to you to interpret them.
regexes are simply *buggy* when they use different interpretations of my
string data depending on some earlier siletn/transparent upgrade operation
that isn't refectled in my code.
> The current way (everything is either Latin-1 or UTF-8 and we only have
This is simply not the current way. If it were, I couldn't handle euc-jp
or binary data in perl, but obviously, I can.
> a single bit to distinguish between these two cases) is just a pain,
> especially if you need something else than utf-8.
It would be a pain, I totally agree, but this doesn't reflect reality.
> You have a UTF-8 regeps like the following:
>
> my $skip = qr/Quarantäne/i;
What is an "utf-8 regexp"? from the code, one cnanot tell (is the source
encoded in 8-bit or utf-8, does it use utf-8 or not?)
> You read in data and manually decode it to utf-8 to match it against the
> regexp:
You decode it *from* utf-8, not *to* utf-8 in perl.
> pre-parses the data. As a side-effect, the data now comes already
> decoded in UTF-8 format.
You completely get it backwards, the data starts as utf-8 encoded on the
perl level and stops being encoded as utf-8 after the decode. It simply
isn't utf-8 encoded anymore.
> The second decode() then destroys the data,
Actually, it might also croak because you are decoding utf-8 and this must
not have any characters >255.
> because Perl does not know that the data was already in UTF-8 and
> encodes it twice.
But the data isn't already in utf-8.
> Oops, new bug.
No, just gross misunderstand on your part. And one cannot blame you, even so
many people get it wrong.
> And this bug could have been prevented entirely if the string was
> properly tagged with its encoding, and thus a double encoding would have
> been never possible.
Indeed, at the cost of losing backwards compatibility to all earlier
versions and all XS modules.
It would also force programmers to declare everything, not a very perlish
way.
But it would solve that issue.
> So while the current situation is "working" somehow, please do not
> describe it as "ideal" :)
I didn't describe the current situation as ideal at all, please read my
postings and you will see that the opposite is the case. If the situation
were ideal I wouldn't ask for a lots of changes and wouldn't point out the
problems we have.
If the model as originally planned and mostly implemented in 5.005_x
(_after_ the clearly bad camel model with no flags at all) were completely
implemented, then it would be easy to explain the pelr unicode model to
people:
1) strings are basically lists of characters
2) a character is an integer in the range 0..2**63 or wherever
perls functions officially stop (no no mention of utf-8).
3) if you use a function or construct that deals with character
encodings, then that function defines what happens. examples:
open ..., $filename character values must be in ANSI (windows)
unicode (possible alternative on windows),
or whatever encoding your filesystem/env expects
(unix), just as in any other language.
regex match perls regexes interpret your characters as
unicode (alternative: unicode or, with use
locale, locale-specific encoding).
JSON::XS::decode_json $s $s must be in utf-8
print whatever either the file expects (raw)
or unicode (when you set an encoding).
$a = $b . $c it just concatenates characters, no encoding required
$a = substr $b, ... it just gives you a substr, no encoding required
etc...
this is how it *mostly* works nowadays. while not perfect, it gives you
a very simple string model (basically, in perl 5.005 you could have
characters 0..255, in 5.6 and higher the range was simply extended).
You cna explain this to anybody in just a few minutes. Sure, he will need to
find out what open wants on his platform, and how to convert, but it is far
simpler than all the crap perluniintro throws at the unsuspecting user, *and*
would completely get rid of that mysterious utf-8 flag:
The principle is that Perl tries to keep its data as eight-bit
bytes for as long as possible, but as soon as Unicodeness cannot be
avoided, the data is transparently upgraded to Unicode.
"how the hell should I know when it becomes necessary"?
Specifically, if all code points in the string are 0xFF or less,
Perl uses the native eight-bit character set. Otherwise, it uses
UTF-8.
"how about other encodings, or binary data?"
"so if i shave off one charaxcter of a string it might suddenly change its
encoding?"
[This] produces a fairly useless mixture of native bytes and UTF-8,
as well as a warning:
perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
"how do i mix latin1 characters such as ä in Quarantäne with unicode
characters outside the latin1 range when it is useless?"
"is this per character per string?"
"I don't get this warning?" (%ENV!)
etc. etc.
Note that perluniintro has lots of stuff like the "as long as possible" that
is completely incomprehensible to most perl programmers, yet still it claims
that it depends on such internal magic when it goes to interpret your string.
This is a totally broken design that cannot be explained to *anybody* because
it isn't logical at all.
So while simply extending strings to higher character ranges and still
asking the user to keep track of the encoding he wants (which is limited
to input and output places) is *far* better than forcing "one 8-bit
encoding on everybody" or forcing the user to *also* keep track of what
"as long as possible" means in perl, or when the perl interpreter silently
changes his data from ansi to unicode, with a corresponding change in
interprettaions.
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / pcg@goof.com
-=====/_/_//_/\_,_/ /_/\_\