develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

From:
Ben Morrow
Date:
May 20, 2008 03:48
Subject:
Re: on the almost impossibility to write correct XS modules
Message ID:
uvlag5-in4.ln1@osiris.mauzo.dyndns.org

Quoth rgarciasuarez@gmail.com ("Rafael Garcia-Suarez"):
> 
> For those two cases we can:
> * Add a pragma that says "in this block, apply Unicode semantics".
>   Additionally, we can add a regexp flag qr//u, that says "this
>   regexp matches with Unicode semantics". (I'm thinking out loud
>   here) (Also, probably any regexp that uses \p should be considered
>   "in Unicode mode")
> * Drop relying on the SvUTF8 flag to choose whether Unicode semantics
>   should be applied. Big change, not backwards compatible, but IMO
>   needed for sanity.

++

> But sometimes we want perl to magically switch between Unicode and
> non-Unicode semantics depending on the data it's handling. Does that
> mean that we need to add a new kind of data to perl, "Unicode SV" ?
> Will that solve problems ? What problems will this create ?

This seems sane to me. While we're there we can make the new type (QV?
UPV?) a wchar_t* instead of a utf8-encoded char*. That way we get
autoconversion as needed, with cacheing. We also get the ability to
declare 'all my 8-bit strings are in $encoding' rather than being fixed
to ISO8859-1. (This must be a *different* option from the one that says
'my source code is in $encoding', though they could default to the same
thing. I don't know what to do about literal strings: reencode them,
probably.)

A lot of care will be needed to get all the cases right. For instance,
what happens when a (POK, UPOK) SV is string-compared with a (POK) SV?
I think the right answer is

    - by default, if any argument of a string operation is UPOK then all
      of them are upgraded to UPOK and the operation occurs on SvUPV; if
      all are !UPOK then they are all upgraded to POK and the operation
      occurs on SvPV. (This assumes all numbers can be represented in
      the current character set :).)
      
      This 'upgrade' may in fact be a 'downgrade' by current SvUTF8
      terminology, from UPOK->POK, in which case any characters that
      can't be encoded elict a warning. Ideally all of Encode's options
      should be applicable.

      What to do about chr/ord/"\x", especially given that some
      encodings have more that 256 characters, I'm not sure. I suspect
      the current 'assume numbers <256 are byte values and go in SvPV,
      and numbers >255 are Unicode codepoints and go in SvUPV' is a
      decent compromise, *given that users can ask for sane semantics if
      they want them*.

    - under 'use bytes', all string operations upgrade all SVs to POK,
      'upgrade' as above. chr stuffs literal bytes into SvPV.

    - under 'use unicode', all string operations upgrade all SVs to
      UPOK, and chr takes a Unicode codepoint and returns a string that
      is UPOK only. This means that the numbers passed to chr mean
      different things under 'unicode' and 'bytes'. This is a feature :).

    - regexes know which of SvPV and SvUPV they should be matching
      against. I think we need two new flags, /u and /U (or maybe /b),
      with the default being bytes if use-bytes, unicode if use-unicode,
      and guess if neither.

    - 'use locale' can probably be made to work again, if it is only
      applied to SvPV and never to SvUPV. 'use locale' should probably
      imply 'use bytes', and set the current encoding.

This would at least allow user to specify that they understand Unicode
and want consistent semantics, without losing the ability to manipulate
binary data.

Ben

-- 
               We do not stop playing because we grow old; 
                  we grow old because we stop playing.
                            ben@morrow.me.uk



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About