develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

Juerd Waalboer
May 20, 2008 07:15
Re: on the almost impossibility to write correct XS modules
Message ID:
Rafael Garcia-Suarez skribis 2008-05-20 10:55 (+0200):
> 2008/5/20 Marc Lehmann <>:
> > Unfortunately, perl doesn't really handle it that way. regexes for example
> > treat the same number on the perl level differently depending on how its
> > encoded internally.
> > And this is a problem.
> You could add uc/lc to the list.

If you're looking for a list, Unicode::Semantics has documentation that
has such a list. It's probably not complete, but a starting point.

    * uc, lc, ucfirst, lcfirst, \U, \L, \u, \l
    * \d, \s, \w, \D, \S, \W
    * /.../i, (?i:...)
    * /[[:posix:]]/

> Now, at the perl language level, I think the problem we have is that
> we sometimes want uc, lc or //i to have Unicode semantics, and sometimes
> not. (other operations here ?)

Er, why "sometimes not"?

Why would you uppercase something that's not text?

I suggest that we keep the possibility to uppercase only the ASCII
character range, and call that ASCII::uc(), while the normal uc() is
made Unicode compliant regardless of the PV's state.

Maybe this should even be called Unicode::uc(), and uc() should
"default" to Unicode, with "use ASCII qw(uc);" and "use Unicode qw(uc);"
as ways to override the default.

> For those two cases we can:
> * Add a pragma that says "in this block, apply Unicode semantics".

There are three ways of dealing with text data in Perl:

1. The text is unicode (i.e. uc("aä") eq "AÄ")
2. The text is ASCII   (i.e. uc("aä") eq "Aä")
3. It's determined by the UTF8 flag. It is now widely agreed that 3 is
   wrong. However, many parts of perl use option 3 now.

I suggest that there'll also be a "use feature" called
unicode_by_default, that does no more than include the new pragma to
enable unicode semantics. This, to make "use v5.12;" include the pragma,
so to avoid that you forget to request a certain behaviour.

>   Additionally, we can add a regexp flag qr//u, that says "this
>   regexp matches with Unicode semantics". (I'm thinking out loud
>   here)

I have suggested /u(nicode), /a(scii) before. These are "needed" in
addition to the pragma, because of qr//: there must be a way to
stringify the lexically selected behavior so it survives the end of the
lexical scope.

> * Drop relying on the SvUTF8 flag to choose whether Unicode semantics
>   should be applied. Big change, not backwards compatible, but IMO
>   needed for sanity.


However, there's also a way to 

> But sometimes we want perl to magically switch between Unicode and
> non-Unicode semantics depending on the data it's handling.

No, we don't want this to happen MAGICALLY. Or at least I really do not
want Perl to do that. This is one place where DWIM heuristics simply
cannot work.

> Does that mean that we need to add a new kind of data to perl,
> "Unicode SV" ?  Will that solve problems ? What problems will this
> create ?

Indeed there could be a way to indicate "I intend this string to be a
byte string". I have a module, called, in the works that makes
this very easy. I'll try to release it really soon so you can have a

Because of the way BLOB works, it could probably be used by XS and core
code too. BLOB assumes that everything is text until explicitly marked
as binary.
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <>  <>
  Convolution:     ICT solutions and consultancy <>
1; Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About