develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

Thread Previous | Thread Next
Rafael Garcia-Suarez
May 21, 2008 01:29
Re: on the almost impossibility to write correct XS modules
Message ID:
2008/5/21 Ben Morrow <>:
> OK, new proposal; this is how I have thought Unicode *ought* to work in
> Perl for some time. It's entirely possible there's some serious flaw
> with it, of couse... I was (assuming you were) intending UPOK to
> represent a new entry in the SV, so struct xpv would become something
> like
>    struct xpv {
>        char *      xpv_pv;     /* byte string */
>        STRLEN      xpv_cur;
>        STRLEN      xpv_len;
>        wchar_t *   xpv_upv;    /* Unicode string */

(I'm under the impression that wchar_t is not portable and not suitable
to store Unicode, since its size is implementation-defined. However we
could use UTF-16 or -32 here, if that's more convenient.)

>        STRLEN      xpv_ucur;
>        STRLEN      xpv_ulen;
>    }
> (or perhaps xpv would stay as-is, and we'd have an xpvupv like that).
> POK says the PV slot is valid, UPOK says the UPV slot is valid, and you
> can have both valid at once so you don't have to keep converting a given
> string between bytes and characters. You have to keep track of which
> representation is canonical, of course, exactly as with string<->number
> conversions.
> Then you have two forms of (say) 'eq', each of which sv_upgrades its
> arguments to the appropriate type; except that (for compatibility) if
> you specify neither 'bytes' nor 'unicode' it guesses which you wanted. A
> new quote-like qu// would be useful, but it would just be a shortcut for
> do { use unicode; qq// }; similarly, a qb// would be useful to get
> binary strings when under 'unicode'.
>> Here's my position :
>> - to deal with encodings, use Encode.
>> - no encoding-aware strings in core perl. (of course, you can still
>>   use magic, ties, etc. to add behaviour)
> I agree here: while strings that knew what encoding they started out as
> sound like a cool idea, I suspect it would quickly become unmanageable.
>> - the "Unicodeness" of a string would be independent of its SvUTF8 flag.
>>   If will just indicate that <some list of perl built-ins> must apply
>>   Unicode semantics when dealing with it.
> This feels wrong, to me. Perl has always had polymorphic values and
> monomorphic operators; allowing the string to choose which version of
> the operator it gets seems like going the other way. In an ideal world,
> I would advocate a new set of operators: ueq, ult, u., usubstr, and so
> on; since this is obviously impractical, a pragma to choose which 'eq'
> you want seems like the way to go.

I now tend to agree with this.
Actually and moreoever I now tend to agree with Juerd: always apply
default Unicode semantics. Get alternative ops -- or a pragma -- to
get latin-1 semantics. Which makes the whole point of UPOK strings
rather unuseful now.

Some way to mark PVs as "binary" and not upgradeable to SvUTF8 would be
handy, though.

[snipping a lot of thoughtful stuff]

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About