Hi!
I recently found out that it is almost impossible to write XS modules that
deal with unicode correctly, and here is why:
First, the long-known-issue:
In XS parameters, the type "char *" is utterly useless, as you have no clue
about the encoding of the characters. This even breaks backward compatibility
to existing xs modules, who do not expect character values >255.
A lot of modules on CPAN have been broken by this incompatible change in
5.6 or so.
Now, how about fixing it?
Some modules started to use different typemap entries to work around this
issue, for example:
void LOG (utf8_string msg)
T_OCTETS
$var = SvPVbyte_nolen ($arg)
T_UTF8 // == utf8_string
$var = SvPVutf8_nolen ($arg)
Unfortunately, unlike other, similar, functions (like SvIV, SvPV etc.), this
easily destroys the scalar value:
LOG ("see this object:");
LOG ($obj);
# $obj no longer an object here, it became a string
So unlike other accessor functions such as SvPV, SvPVutf8 changes the
contents of the SV in a very visible way (while SvIV doesn't destroy the
string, for example).
I can understand why it does so, but the problem is, there is simply no good
way to deal with utf-8 in XS as the API is extremely hostile at the moment.
To get it right, I think one has to do something like this (this can be
optimised of course, but that makes it even more complicated):
T_UTF8
$var = SvPVutf8_nolen (sv_mortalcopy ($arg))
I think the situation with unicode and cpan perl modules cannot improve
as long as it so difficult to do somethign as simple as get at the string
data in a non-random/godgiven encoding.
Also, even though it is 5.10 now, it should be *seriously* considered to
replace the almost completely useless char * typemap entry by something
that gives you octets (preferably non-destructively). Or somebody explain
to me when "char *" does something useful in current perl versions without
tinkering with retesting ST(x) manually...
Just my 0.02€.
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / pcg@goof.com
-=====/_/_//_/\_,_/ /_/\_\