develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

Juerd Waalboer
May 20, 2008 06:42
Re: on the almost impossibility to write correct XS modules
Message ID:
Jan Dubois skribis 2008-05-18  8:40 (-0700):
> No, "\xff" is guaranteed to have byte semantics for backwards compatibility:

"byte semantics" is a dangerous term, partly because different people
use it for different things. Some people use it to refer to functions
and operators acting on the bytes in the PV's buffer regardless of the
SvUTF8 flag's state, but those functions are generally broken and in
need of repair (as announced in perl5100delta, this would break

By default, "\xff" by itself will indeed create a string that
*internally* is a single byte 0xff.

A Perl string is a Unicode string. Or actually, a sequence of almost
arbitrary integer values that most operations ought to interpret as
unicode codepoints. If it contains only characters < 256, it may be
"encoded as latin1" (represented as 8 bit with a straight mapping)
internally, both for efficiency and for backwards compatibility. When
strings are sent or received with system calls, that has to occur in
bytes. If a string only contains characters < 256, it can be used as a
byte string. (Note: I originally believed otherwise and was wrong.)

Still it can be useful to write your program in a way that avoids that a
string that will be used as a byte string, is ever upgraded to UTF8
*internally*: upgrading and downgrading it again might be a performance
issue. There should be no difference in semantics, regardless of the
internal encoding of the string. It is a bug that there is.

I believe that this snippet:

> perluniintro.pod:
> | Note that C<\x..> (no C<{}> and only two hexadecimal digits), C<\x{...}>,
> | and C<chr(...)> for arguments less than C<0x100> (decimal 256)
> | generate an eight-bit character for backward compatibility with older
> | Perls.  For arguments of C<0x100> or more, Unicode characters are
> | always produced. If you want to force the production of Unicode
> | characters regardless of the numeric value, use C<pack("U", ...)>
> | instead of C<\x..>, C<\x{...}>, or C<chr()>.

is misleading. It suggests that Perl has two kinds of strings
technically, which is not true. There is a single string type with two
*internal* representations. The word *internal* is notably missing in
the quoted part of perluniintro.

Let's change "generate an eight-bit character" to "generate a string
that has an eight-bit encoding internally".

In any case, CHARACTERS DO NOT HAVE BITS. Bytes have 8 bits, characters
just have a number.

> As perluniintro.pod above points out, the only reliable way to do this
> is pack("U", $codepoint).  Or you can use named characters using

If the remaining bugs in Perl (see also Unicode::Semantics) are fixed,
then there is no longer any *need* for forcing the internal encoding to

This said, I think that pack("U", $codepoint) is not a very good idea.
Without degressing into details, I would like to point out that it's
usually better to associate the upgrade with the buggy operator, rather
than the string itself.

So instead of:

    my $char = pack("U", $codepoint);

    ...  # perhaps lots of code here

    my $uc = uc($char);

I would suggest using:

    my $char = chr($codepoint);

    ... # perhaps lots of code here

    utf8::upgrade($char);  # work around bug
    my $uc = uc($char);

> [perluniintro]
> | Internally, Perl currently uses either whatever the native eight-bit
> | character set of the platform (for example Latin-1) is

This is simply not true. Perl uses either latin1 or ebcdic for its
internally eight-bit strings. Not Windows-1252, for example.

> | defaulting to UTF-8, to encode Unicode strings.

defaulting to UTF-8, WITH A WARNING, for strings that could not be
downgraded, i.e. strings that contain characters > 255.

The warning is there for a reason: it says you're doing it wrong. You're
forcing a byte-incompatible string on a byte operation (system call),
and forgot to encode.
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <>  <>
  Convolution:     ICT solutions and consultancy <>
1; Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About