develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

Ben Morrow
May 20, 2008 18:33
Re: on the almost impossibility to write correct XS modules
Message ID:

Quoth ("Rafael Garcia-Suarez"):
> 2008/5/20 Ben Morrow <>:
> >> But sometimes we want perl to magically switch between Unicode and
> >> non-Unicode semantics depending on the data it's handling. Does that
> >> mean that we need to add a new kind of data to perl, "Unicode SV" ?
> >> Will that solve problems ? What problems will this create ?
> >
> > This seems sane to me. While we're there we can make the new type (QV?
> > UPV?) a wchar_t* instead of a utf8-encoded char*. That way we get
> > autoconversion as needed, with cacheing. We also get the ability to
> > declare 'all my 8-bit strings are in $encoding' rather than being fixed
> > to ISO8859-1. (This must be a *different* option from the one that says
> > 'my source code is in $encoding', though they could default to the same
> > thing. I don't know what to do about literal strings: reencode them,
> > probably.)
> You're mixing Unicode and encodings, there.

If you're converting a string from bytes to Unicode or vice versa (as
part of sv_upgrade) you are doing so according to some encoding.
Currently Perl only allows that encoding to be ISO8859-1, for good
reasons; it seemed to me that your proposal allowed that to change, but
I think I may have misunderstood you, given that below you said...

> And, assuming we add a new flag on SV (let's call it UPOK like you did
> below) for Unicode strings, and that a new quotelike operator qu// is
> added to create them, have "uc qu/ß/" return an UPOK SV containing "SS"
> in the PV slot. That PV slot could be SvUTF8 or not, that should not
> matter and should not be visible from perl. ("SS" is perfectly
> representable in pure ASCII so SvUTF8 isn't needed there.)
> On the other hand C<use unicode; uc qq/ß/> would return "SS" without
> the UPOK flag set.

...which isn't what I thought you meant at all.

OK, new proposal; this is how I have thought Unicode *ought* to work in
Perl for some time. It's entirely possible there's some serious flaw
with it, of couse... I was (assuming you were) intending UPOK to
represent a new entry in the SV, so struct xpv would become something

    struct xpv {
        char *      xpv_pv;     /* byte string */
        STRLEN      xpv_cur;
        STRLEN      xpv_len;
        wchar_t *   xpv_upv;    /* Unicode string */
        STRLEN      xpv_ucur;
        STRLEN      xpv_ulen;

(or perhaps xpv would stay as-is, and we'd have an xpvupv like that).
POK says the PV slot is valid, UPOK says the UPV slot is valid, and you
can have both valid at once so you don't have to keep converting a given
string between bytes and characters. You have to keep track of which
representation is canonical, of course, exactly as with string<->number

Then you have two forms of (say) 'eq', each of which sv_upgrades its
arguments to the appropriate type; except that (for compatibility) if
you specify neither 'bytes' nor 'unicode' it guesses which you wanted. A
new quote-like qu// would be useful, but it would just be a shortcut for
do { use unicode; qq// }; similarly, a qb// would be useful to get
binary strings when under 'unicode'.

> Here's my position :
> - to deal with encodings, use Encode.
> - no encoding-aware strings in core perl. (of course, you can still
>   use magic, ties, etc. to add behaviour)

I agree here: while strings that knew what encoding they started out as
sound like a cool idea, I suspect it would quickly become unmanageable.

> - the "Unicodeness" of a string would be independent of its SvUTF8 flag.
>   If will just indicate that <some list of perl built-ins> must apply
>   Unicode semantics when dealing with it.

This feels wrong, to me. Perl has always had polymorphic values and
monomorphic operators; allowing the string to choose which version of
the operator it gets seems like going the other way. In an ideal world,
I would advocate a new set of operators: ueq, ult, u., usubstr, and so
on; since this is obviously impractical, a pragma to choose which 'eq'
you want seems like the way to go.

> - the "unicode" pragma (or whatever name is chosen) will be needed to
>   say that <same list of perl built-ins> in its scope must apply
>   Unicode semantics to Perl strings. (as opposed to newfangled Unicode
>   strings)

Converting said strings to Unicode how? ISO8859-1, as perl does now?

> Currently we have :
>     $ bleadperl -wle 'print uc "ß"'
>     ß
>     $ bleadperl -wle 'use utf8; print uc "ß"'
>     SS
> That's wrong: the pragma utf8 indicates internal encoding, but modifies
> semantics. What I've in mind is : make those two one-liners output an ß.
> Under the "unicode" pragma, make them both output SS.

Yes. the conflation of 'source-file encoding' with 'operator semantics'
was clearly a mistake. 

> >    - by default, if any argument of a string operation is UPOK then all
> >      of them are upgraded to UPOK and the operation occurs on SvUPV; if
> >      all are !UPOK then they are all upgraded to POK and the operation
> >      occurs on SvPV. (This assumes all numbers can be represented in
> >      the current character set :).)
> With my proposed outlined implementation, that's upgraded as per
> sv_upgrade.

Yes, that was what I meant.

> Do we really want this upgrade to be done transparently ? Like, in
> concatenating an SV and a USV ? Remember why we needed
> encoding::warnings ? Because we can't know what encoding a
> Perl string is in.
> We could do it the hard way (also known as the python way) : forbid
> any mix between Unicode strings and Perl strings. Force people to
> write C<$foo = qu/$foo/> to get Unicode strings. (*regardless* of
> any encoding issue or PerlIO layer or locale or pragma.) Make qq/$foo/
> warn if $foo is a Unicode string (being thus downgraded to a Perl
> string).

I would want to forbid $foo = qu/$foo/ as well, assuming $foo was a byte
string to start with. If we do things this way, the only way to convert
between the two types of string should be with Encode.

<snip my stuff>
> You're confusing Unicode and encoding again.

Any conversion between a Unicode string and a string of bytes involves
an encoding, no? You seem to be saying the two are not related: am I
completely misunderstanding something, or are you simply stating 'Perl's
byte<->character conversions will always use ISO8859-1, as it's a subset
of Unicode[0], and if you want anything else use Encode' as a decision
you've made?

[0] Yes, yes, all character sets are subsets of Unicode... I mean
Unicode-the-numbered-list rather than -the-unordered-set. AFAIK there
isn't a separate name for it.

> chr moduloes its argument under bytes, and I'd like to keep that:
>     $ bleadperl -Mbytes -le 'print ord chr 258'
>     2


> To my understanding "use bytes" means "don't look at the SvUTF8 flag".
> See in utf8.h :
>     #define IN_BYTES (CopHINTS_get(PL_curcop) & HINT_BYTES)
>     #define DO_UTF8(sv) (SvUTF8(sv) && !IN_BYTES)

Well, that's what it means now. But that's *truly* evil: the user should
never be able to see the raw bytes perl happens to use to store
characters. That's like letting you see the bytes that make up a float.
Attempting to apply byte semantics to a Unicode string should either
auto-convert it or fail.

> >    - 'use locale' can probably be made to work again, if it is only
> >      applied to SvPV and never to SvUPV. 'use locale' should probably
> >      imply 'use bytes', and set the current encoding.
> I'd be happy to set locale to rest. In peace.

OK. It's kinda handy to do things like sort correctly, but that kind of
thing is arguably better handled by a module (which can read the
standard locale database if it wants, of course).


           All persons, living or dead, are entirely coincidental.                                                  Kurt Vonnegut Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About