develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

From:
Tels
Date:
May 19, 2008 09:37
Subject:
Re: on the almost impossibility to write correct XS modules
Message ID:
200805191837.19067@bloodgate.com
Moin,

On Monday 19 May 2008 17:26:55 Marc Lehmann wrote:
> On Sat, May 17, 2008 at 10:50:12AM -0700, Jan Dubois 
<jand@activestate.com> wrote:
[snip]

> > The brokenness right now is that when Perl automatically upgrades
> > this data to UTF8, it assumes that the data is Latin1 instead of
> > ANSI,
>
> Uhm, no, you are totally confused about how character handling is
> done in perl, and I cannot blame you (the many bugs and documentation
> mistakes combined make it hard to see what is meant).
>
> Strings in perl are simply concatenated characters, which in turn are
> represented by numbers.
>
> Perl doesn't store an encoding together with strings, only the
> programmer knows the encoding of strings.
>
> This is the correct way to approach unicode because it frees the
> programmer from tracking both external and internal encodings.

Uhm, excuse me? I don't think this actually frees the programmer from 
tracking internal encodings and especially not tracking of external 
encodings.

Perl's "one-encoding-for-all" approach has the real world problem that 
you cannot easily mix strings without being very very very very 
careful, or you get garbage. Automatically and without warning.

And most of the problems when you want to work with Unicode (even if you 
_only_ want to use UTF-8, not even throwing UTF-16 into the mix), is 
that it is very very easy to have data that is not encoded in UTF-8 nor 
latin1, and you mix it with UTF-8 (or encode it twice or whatever) and 
you end up with garbage. Which is usually bad as this very discussion 
about ansi shows :)

Or in other words, Perls "frees the programmer from traking encodings" 
by making him carefully track all strings as they come in and go out 
and then track which strings internally are in which encoding, and even 
then sometimes you mix fire with water unintentionally. Which I don't 
think is ideal as the many many bugs I have found in my own (supposedly 
working bugfree) utf-8 using Perl code.

Not to mention that you actually lose the information what original 
encoding the string had - "aa" looks the same in latin1 and utf-8, but 
depending on which encoding it "has", acts differently. (at least thats 
what I remember from regexps discussions)

It would be _much_ easier if all strings in Perl carried their encoding 
with them, and Perl would be able to simple mix two strings by 
automatically upgrading them according to their encoding. Then you'd 
also be able to query the encoding, btw. No more guesswork based upon a 
single bit.

The current way (everything is either Latin-1 or UTF-8 and we only have 
a single bit to distinguish between these two cases) is just a pain, 
especially if you need something else than utf-8.

Here is an example what bit me today, just in case people think this is 
a theoretical discussion:

You have a UTF-8 regeps like the following:

	my $skip = qr/Quarantäne/i;

You read in data and manually decode it to utf-8 to match it against the 
regexp:

	my $data = decode('utf-8',from_file());

	# much later in the file
	if ($data =~ $skip) { ... do something ... };

Now, some time later (maybe much later, and a different person), 
replaces the hand-rolled from_file() routine with something that 
pre-parses the data. As a side-effect, the data now comes already 
decoded in UTF-8 format. The second decode() then destroys the data, 
because Perl does not know that the data was already in UTF-8 and 
encodes it twice. 

Oops, new bug. And this bug could have been prevented entirely if the 
string was properly tagged with its encoding, and thus a double 
encoding would have been never possible.

So while the current situation is "working" somehow, please do not 
describe it as "ideal" :)

All the best,

Tels

-- 
 Signed on Mon May 19 18:11:35 2008 with key 0x93B84C15.
 Get one of my photo posters: http://bloodgate.com/posters
 PGP key on http://bloodgate.com/tels.asc or per email.

 "My glasses, my glasses. I cannot see without my glasses."
 - "My
 glasses, my glasses. I cannot be seen without my glasses."




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About