develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

From:
Juerd Waalboer
Date:
May 20, 2008 07:15
Subject:
Re: on the almost impossibility to write correct XS modules
Message ID:
20080520141536.GE2842@c4.convolution.nl
Rafael Garcia-Suarez skribis 2008-05-20 10:55 (+0200):
> 2008/5/20 Marc Lehmann <schmorp@schmorp.de>:
> > Unfortunately, perl doesn't really handle it that way. regexes for example
> > treat the same number on the perl level differently depending on how its
> > encoded internally.
> > And this is a problem.
> You could add uc/lc to the list.

If you're looking for a list, Unicode::Semantics has documentation that
has such a list. It's probably not complete, but a starting point.

    * uc, lc, ucfirst, lcfirst, \U, \L, \u, \l
    * \d, \s, \w, \D, \S, \W
    * /.../i, (?i:...)
    * /[[:posix:]]/

> Now, at the perl language level, I think the problem we have is that
> we sometimes want uc, lc or //i to have Unicode semantics, and sometimes
> not. (other operations here ?)

Er, why "sometimes not"?

Why would you uppercase something that's not text?

I suggest that we keep the possibility to uppercase only the ASCII
character range, and call that ASCII::uc(), while the normal uc() is
made Unicode compliant regardless of the PV's state.

Maybe this should even be called Unicode::uc(), and uc() should
"default" to Unicode, with "use ASCII qw(uc);" and "use Unicode qw(uc);"
as ways to override the default.

> For those two cases we can:
> * Add a pragma that says "in this block, apply Unicode semantics".

There are three ways of dealing with text data in Perl:

1. The text is unicode (i.e. uc("aä") eq "AÄ")
2. The text is ASCII   (i.e. uc("aä") eq "Aä")
3. It's determined by the UTF8 flag. It is now widely agreed that 3 is
   wrong. However, many parts of perl use option 3 now.

I suggest that there'll also be a "use feature" called
unicode_by_default, that does no more than include the new pragma to
enable unicode semantics. This, to make "use v5.12;" include the pragma,
so to avoid that you forget to request a certain behaviour.

>   Additionally, we can add a regexp flag qr//u, that says "this
>   regexp matches with Unicode semantics". (I'm thinking out loud
>   here)

I have suggested /u(nicode), /a(scii) before. These are "needed" in
addition to the pragma, because of qr//: there must be a way to
stringify the lexically selected behavior so it survives the end of the
lexical scope.

> * Drop relying on the SvUTF8 flag to choose whether Unicode semantics
>   should be applied. Big change, not backwards compatible, but IMO
>   needed for sanity.

Yes!

However, there's also a way to 

> But sometimes we want perl to magically switch between Unicode and
> non-Unicode semantics depending on the data it's handling.

No, we don't want this to happen MAGICALLY. Or at least I really do not
want Perl to do that. This is one place where DWIM heuristics simply
cannot work.

> Does that mean that we need to add a new kind of data to perl,
> "Unicode SV" ?  Will that solve problems ? What problems will this
> create ?

Indeed there could be a way to indicate "I intend this string to be a
byte string". I have a module, called BLOB.pm, in the works that makes
this very easy. I'll try to release it really soon so you can have a
look.

Because of the way BLOB works, it could probably be used by XS and core
code too. BLOB assumes that everything is text until explicitly marked
as binary.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <#####@juerd.nl>  <http://juerd.nl/sig>
  Convolution:     ICT solutions and consultancy <sales@convolution.nl>
1;



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About