develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

Thread Previous | Thread Next
Juerd Waalboer
May 20, 2008 14:24
Re: on the almost impossibility to write correct XS modules
Message ID:
Glenn Linderman skribis 2008-05-20 13:44 (-0700):
> >If a string only contains characters < 256, it can be used as a
> >byte string. (Note: I originally believed otherwise and was wrong.)
> I'm glad to see that you have expanded your understanding of strings to 
> realize that they are sequences of integer values.

Just for the record: I've believed this for quite some time now, and I
think my documentation patches are consistent with it. If you find any
inconsistency there, please let me know.

> I'm still a bit concerned by your "almost arbitrary" modifier, mostly 
> because I'm not sure what you mean by that.

Perl does assume that the values are Unicode codepoints in some places,
and sometimes warns if they are deemed invalid by Unicode.

One example of many:

    my $foo = chr 0xffff;


    Unicode character 0xffff is illegal at -e line 1.

Even though ord($foo) properly returns 65535 afterwards.

This is not consistent with the view that a string is made of characters
which are just integers, with no character set logic implied.

By "almost arbitrary" I mean that while it is possible to use these
values, Perl will complain about it.

See also Chris Hall's insightful posts about this subject.

> >In any case, CHARACTERS DO NOT HAVE BITS. Bytes have 8 bits, characters
> >just have a number.
> Except for the historical, inherited-from-C, concept of an 8-bit char, I 
> could agree with this.

It's a modern computing fact that every byte has 8 bits. It has been
different, yes, but to my knowledge no computer system has non-8 bit
bytes. I'm not calling that a char, by the way. I don't know why you
think I'm using that concept.

> I _do_ agree that it would be good to develop a set of terminology that 
> can be well-defined, used throughout the documentation as it is updated, 

Have tried that, but it turned out to be impossible to reach concensus
over the terminology.

Specifically, "character" is a good name for what the Perl documentation
tries to communicate. If you want to store arbitrary integers in a
string, that's supported but entirely up to you. The normal Perl way of
doing that would be to use an array. Strings in Perl are mainly used for
text data and binary data. That's a somewhat limited view of this very
useful data type, but it helps to make teaching doable.

> I continue to use "blorf", but it needs a different name, preferably not 
> "character" or "char", because those have too many semantics inherited 
> from other programming languages and concepts.

"hash" also has a rather different meaning in general computing jargon:
an MD5 hash is not at all a key-value structure. Sometimes it is
practical to re-use existing words. But it needs to be done consistently
and there needs to be a huge corpus that actually uses the term in its
new meaning. Both requirements are met for "character" in Perl's

Note that in a distant past, "lists" were called "arrays" in Perl's
documentation, even though they're very different from "arrays" like
@foo. It is possible to change a word, but there must be a very good
reason for it. In my opinion, inconsistency with Perl stuff is a good
reason, but inconsistency with other languages is not.

> chr and ord are inverse operations.

Only for characters, er blorfs, within the supported range.

> Byte strings are a subset of strings that contain only blorfs in the 
> range [0..255].

Note that in general the *operation* determines the kind of string.
Operations involving system communication like print and readdir are
used with binary strings (or explicit encoding through encode(),
encode_utf8(), utf8::encode() or :encoding). Some operations don't care
about how the string will be used, and just work on the charac.. blorfs,
like length() and substr(), whereas others are specifically text
related, and impose textness on the string: uc(), /\w/...

In other words: if you use the string "5" as a number, it IS a number.

If you use the string $foo as a number, it is a number.
If you use the string $foo as binary data, it is binary data [**].
If you use the string $foo as text data, it is text data.

Perl handles this for you.

[**] Of course, just like "5" being perfectly usable as a number and
"hello" not even resembling one, using a string as binary data only
makes sense if it meets the condition for that: it has only ch^Wblorfs
with ordinal values that are less than 256.

But indeed it can be very useful to have a name for strings which are
intended to be used as byte strings later on. (Hm, let's call them

> Is there any argument about the above definitions?  I think they are 
> pretty universally agreed to, at least conceptually.  It seems there are 
> bugs where chr doesn't accept all legal blorfs (attempting to mix in 
> Unicode semantics), and it seems there are cases where chr and ord are 
> not inverse operations in the presence of certain "locales".  I consider 
> these bugs, does anyone disagree?

It's not a property of chr, but a property of Perl, to "not accept" (if
that's the correct phrase) certain characters:

    my $bar = "\x{ffff}";

> The following may be a bit more controversial... but I think they are 
> consistent, and would produce an easy to explain system...

I tend to believe that anything that's controversial will never be easy
to explain. (That's okay, though. Sometimes it's needed in order to fix
bigger issues.)

> So, all prior character set standards will, hereafter, be referred to as 
> "encodings", meaning that they define a subset of Unicode characters, 
> and also a way of representing those characters as bytes or byte sequences.

That's what Perl already does, although it's sometimes hard to convince
Yves that this is actually a /good/ idea. :)

> Encodings fall into several categories:

I don't agree that these categories are a useful distinction. It's very
complex, and only people working on the Encode module suite are served
by the level of detail IMO.

Don't get me wrong. Your list is interesting and educational, just not
of much use to most Perl programmers.

Instead I suggest the following two categories:

1. Single byte encodings: every character is a single byte. By
necessity, only a small subset of Unicode is supported.

2. Multibyte encodings: every character is a one or more bytes.
2a. Legacy: Only a subset of Unicode is supported.
2b. Unicode: The whole Unicode set is supported.
2c. Full: A larger range than Unicode is supported.

An encoding may or may not be ASCII-compatible.

> the only [unicode encoding] that has been put to widespread use 
> is UTF-8.

Not true. Windows uses UTF-16 internally, and you can't deny that
Windows is widespread :)

> [It should be noted that Decode has a bug: it presently accepts non-byte 
> strings, and treats them as byte strings.  It should accept either byte 
> or non-byte strings, and produce an error if any of the input blorfs are 
> unknown to the expected encoding (generally, any blorf value > 256 is 
> unknown to most byte-oriented encodings).

Agreed that decode should not accept any string that has a value > 255.

Note your off by one error in "> 256". Those are deadly! :)
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <>  <>
  Convolution:     ICT solutions and consultancy <>

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About