develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

Thread Previous | Thread Next
From:
Rafael Garcia-Suarez
Date:
May 20, 2008 05:33
Subject:
Re: on the almost impossibility to write correct XS modules
Message ID:
b77c1dce0805200532h6662b1edwbd18b09546904787@mail.gmail.com
2008/5/20 Ben Morrow <ben@morrow.me.uk>:
>> But sometimes we want perl to magically switch between Unicode and
>> non-Unicode semantics depending on the data it's handling. Does that
>> mean that we need to add a new kind of data to perl, "Unicode SV" ?
>> Will that solve problems ? What problems will this create ?
>
> This seems sane to me. While we're there we can make the new type (QV?
> UPV?) a wchar_t* instead of a utf8-encoded char*. That way we get
> autoconversion as needed, with cacheing. We also get the ability to
> declare 'all my 8-bit strings are in $encoding' rather than being fixed
> to ISO8859-1. (This must be a *different* option from the one that says
> 'my source code is in $encoding', though they could default to the same
> thing. I don't know what to do about literal strings: reencode them,
> probably.)

You're mixing Unicode and encodings, there.

Here's my position :
- to deal with encodings, use Encode.
- no encoding-aware strings in core perl. (of course, you can still
  use magic, ties, etc. to add behaviour)
- the "Unicodeness" of a string would be independent of its SvUTF8 flag.
  If will just indicate that <some list of perl built-ins> must apply
  Unicode semantics when dealing with it.
- the "unicode" pragma (or whatever name is chosen) will be needed to
  say that <same list of perl built-ins> in its scope must apply
  Unicode semantics to Perl strings. (as opposed to newfangled Unicode
  strings)

Currently we have :

    $ bleadperl -wle 'print uc "ß"'
    ß
    $ bleadperl -wle 'use utf8; print uc "ß"'
    SS

That's wrong: the pragma utf8 indicates internal encoding, but modifies
semantics. What I've in mind is : make those two one-liners output an ß.
Under the "unicode" pragma, make them both output SS.

And, assuming we add a new flag on SV (let's call it UPOK like you did
below) for Unicode strings, and that a new quotelike operator qu// is
added to create them, have "uc qu/ß/" return an UPOK SV containing "SS"
in the PV slot. That PV slot could be SvUTF8 or not, that should not
matter and should not be visible from perl. ("SS" is perfectly
representable in pure ASCII so SvUTF8 isn't needed there.)

On the other hand C<use unicode; uc qq/ß/> would return "SS" without
the UPOK flag set.

> A lot of care will be needed to get all the cases right. For instance,
> what happens when a (POK, UPOK) SV is string-compared with a (POK) SV?

Indeed, we'll need a matrix there.

> I think the right answer is
>
>    - by default, if any argument of a string operation is UPOK then all
>      of them are upgraded to UPOK and the operation occurs on SvUPV; if
>      all are !UPOK then they are all upgraded to POK and the operation
>      occurs on SvPV. (This assumes all numbers can be represented in
>      the current character set :).)

With my proposed outlined implementation, that's upgraded as per
sv_upgrade.

Do we really want this upgrade to be done transparently ? Like, in
concatenating an SV and a USV ? Remember why we needed
encoding::warnings ? Because we can't know what encoding a
Perl string is in.

We could do it the hard way (also known as the python way) : forbid
any mix between Unicode strings and Perl strings. Force people to
write C<$foo = qu/$foo/> to get Unicode strings. (*regardless* of
any encoding issue or PerlIO layer or locale or pragma.) Make qq/$foo/
warn if $foo is a Unicode string (being thus downgraded to a Perl
string).

We could do it the dwimmy way: apply heuristics when mixing POK SVs and
UPOK SVs and guess games about encodings, and end up with complicated
rules that will duplicate the current bugs with the UTF8 flag.

I would prefer the hard way.

>      This 'upgrade' may in fact be a 'downgrade' by current SvUTF8
>      terminology, from UPOK->POK, in which case any characters that
>      can't be encoded elict a warning. Ideally all of Encode's options
>      should be applicable.
>
>      What to do about chr/ord/"\x", especially given that some
>      encodings have more that 256 characters, I'm not sure. I suspect
>      the current 'assume numbers <256 are byte values and go in SvPV,
>      and numbers >255 are Unicode codepoints and go in SvUPV' is a
>      decent compromise, *given that users can ask for sane semantics if
>      they want them*.

You're confusing Unicode and encoding again.

>    - under 'use bytes', all string operations upgrade all SVs to POK,
>      'upgrade' as above. chr stuffs literal bytes into SvPV.

chr moduloes its argument under bytes, and I'd like to keep that:

    $ bleadperl -Mbytes -le 'print ord chr 258'
    2

To my understanding "use bytes" means "don't look at the SvUTF8 flag".
See in utf8.h :

    #define IN_BYTES (CopHINTS_get(PL_curcop) & HINT_BYTES)
    #define DO_UTF8(sv) (SvUTF8(sv) && !IN_BYTES)

>    - under 'use unicode', all string operations upgrade all SVs to
>      UPOK, and chr takes a Unicode codepoint and returns a string that
>      is UPOK only. This means that the numbers passed to chr mean
>      different things under 'unicode' and 'bytes'. This is a feature :).

I still prefer the hard way.

>    - regexes know which of SvPV and SvUPV they should be matching
>      against. I think we need two new flags, /u and /U (or maybe /b),
>      with the default being bytes if use-bytes, unicode if use-unicode,
>      and guess if neither.

Ah yes. What about regexps (now type SVt_REGEXP) with the UPOK flag set?

I think that one flag /u is enough. m//u would be equivalent to
C<use unicode; m//>. It would be forbidden to mix qr// and qr//u.

Also, captures would retain the UPOK flag from the matched string.

>    - 'use locale' can probably be made to work again, if it is only
>      applied to SvPV and never to SvUPV. 'use locale' should probably
>      imply 'use bytes', and set the current encoding.

I'd be happy to set locale to rest. In peace.

Gosh, did I just come up with a big plan to save Unicode in perl ?

Now, I've this slight feeling that all this plan might be bullshit
because I've overlooked something obvious. I'll have to think a bit,
read replies, and maybe summarize and post a model proposal cc:ing
all the experts.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About