develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

Glenn Linderman
May 20, 2008 17:16
Re: on the almost impossibility to write correct XS modules
Message ID:
On approximately 5/20/2008 2:24 PM, came the following characters from 
the keyboard of Juerd Waalboer:
> Glenn Linderman skribis 2008-05-20 13:44 (-0700):
>>> If a string only contains characters < 256, it can be used as a
>>> byte string. (Note: I originally believed otherwise and was wrong.)
>> I'm glad to see that you have expanded your understanding of strings to 
>> realize that they are sequences of integer values.
> Just for the record: I've believed this for quite some time now, and I
> think my documentation patches are consistent with it. If you find any
> inconsistency there, please let me know.

Will do.  But although your patches have promoted the "keep the binary 
and text separate" philosophy, they haven't (so far) stated that it 
can't be done, as far as I've noticed.  I just fear they might head in 
that direction over time.

>> I'm still a bit concerned by your "almost arbitrary" modifier, mostly 
>> because I'm not sure what you mean by that.
> Perl does assume that the values are Unicode codepoints in some places,
> and sometimes warns if they are deemed invalid by Unicode.
> One example of many:
>     my $foo = chr 0xffff;
> Warns:
>     Unicode character 0xffff is illegal at -e line 1.
> Even though ord($foo) properly returns 65535 afterwards.
> This is not consistent with the view that a string is made of characters
> which are just integers, with no character set logic implied.
> By "almost arbitrary" I mean that while it is possible to use these
> values, Perl will complain about it.
> See also Chris Hall's insightful posts about this subject.

Thanks, yes, it was Chris' discussion that made me comment about chr 
having bugs in this regard, but I couldn't quickly find that discussion.

We are on the same page here.

>>> In any case, CHARACTERS DO NOT HAVE BITS. Bytes have 8 bits, characters
>>> just have a number.
>> Except for the historical, inherited-from-C, concept of an 8-bit char, I 
>> could agree with this.
> It's a modern computing fact that every byte has 8 bits. It has been
> different, yes, but to my knowledge no computer system has non-8 bit
> bytes. I'm not calling that a char, by the way. I don't know why you
> think I'm using that concept.

I didn't think you were.  But I was concerned that using the term 
"character" would make other people think that you were.  It may be 
acceptable to use the term "character", although it is only by borrowing 
from C (which has its 8-bit char) that the concept of a character being 
a range-restricted integer is well-known.  So I hate to borrow half of 
the C character concept, when I want to strongly avoid the other half. 
Of course, I know "blorf" is not a good word, it is just a placeholder. 
  "codepoint" is nice, but strongly defined by Unicode, and I'd rather 
not disturb that, it could come in handy to distinguish between binary 
blorfs and, um, er, character blorfs, in describing things in 
documentation.  And describing "binary characters" and "character 
characters" would be somewhat confusion, adding to my rejection of the 
term "character" :(  "value" is too generic, perhaps?

Of course Perl already has a term for the sequence of binary blorfs: 
v-string.  I guess that is deprecated though. I don't see that it had a 
term for a blorf though, I was hoping...

>> I _do_ agree that it would be good to develop a set of terminology that 
>> can be well-defined, used throughout the documentation as it is updated, 
> Have tried that, but it turned out to be impossible to reach concensus
> over the terminology.

If you, Marc, Yves, and I (others welcome), continue to have long, 
involved, Unicode discussions, maybe that will eventually browbeat 
everyone into realizing that there is a need for consensus, so that the 
discussions can be shorter, if for no other reason. :)  Unfortunately, 
this discussion seems to be turning into a consensus-about-what-is-wrong 
festival, although we still have a few diverse proposals for the cure.

> Specifically, "character" is a good name for what the Perl documentation
> tries to communicate. If you want to store arbitrary integers in a
> string, that's supported but entirely up to you. The normal Perl way of
> doing that would be to use an array. Strings in Perl are mainly used for
> text data and binary data. That's a somewhat limited view of this very
> useful data type, but it helps to make teaching doable.

Yep, not every bit of arcana needs to be in the introductory courses, 
but the SvUTF8 bit altering semantics has made it hard to keep it out.

>> I continue to use "blorf", but it needs a different name, preferably not 
>> "character" or "char", because those have too many semantics inherited 
>> from other programming languages and concepts.
> "hash" also has a rather different meaning in general computing jargon:
> an MD5 hash is not at all a key-value structure. Sometimes it is
> practical to re-use existing words. But it needs to be done consistently
> and there needs to be a huge corpus that actually uses the term in its
> new meaning. Both requirements are met for "character" in Perl's
> documentation.
> Note that in a distant past, "lists" were called "arrays" in Perl's
> documentation, even though they're very different from "arrays" like
> @foo. It is possible to change a word, but there must be a very good
> reason for it. In my opinion, inconsistency with Perl stuff is a good
> reason, but inconsistency with other languages is not.
>> chr and ord are inverse operations.
> Only for characters, er blorfs, within the supported range.

Sure, that's what I meant.

>> Byte strings are a subset of strings that contain only blorfs in the 
>> range [0..255].
> Note that in general the *operation* determines the kind of string.

The operation determines the semantics of the string.  The SvUTF8 flag 
determines the storage format.  Those two should be independent.

The above was supposed to be a storage-format-independent definition of 
byte string.

> Operations involving system communication like print and readdir are
> used with binary strings (or explicit encoding through encode(),
> encode_utf8(), utf8::encode() or :encoding). Some operations don't care
> about how the string will be used, and just work on the charac.. blorfs,
> like length() and substr(), whereas others are specifically text
> related, and impose textness on the string: uc(), /\w/...
> In other words: if you use the string "5" as a number, it IS a number.
> If you use the string $foo as a number, it is a number.
> If you use the string $foo as binary data, it is binary data [**].
> If you use the string $foo as text data, it is text data.
> Perl handles this for you.
> [**] Of course, just like "5" being perfectly usable as a number and
> "hello" not even resembling one, using a string as binary data only
> makes sense if it meets the condition for that: it has only ch^Wblorfs
> with ordinal values that are less than 256.
> But indeed it can be very useful to have a name for strings which are
> intended to be used as byte strings later on. (Hm, let's call them
> blobs!)

Yeah, I agree with all this.  Even to calling them blobs, which would 
have avoided the confusion about what my definition of byte strings 
above meant.

>> Is there any argument about the above definitions?  I think they are 
>> pretty universally agreed to, at least conceptually.  It seems there are 
>> bugs where chr doesn't accept all legal blorfs (attempting to mix in 
>> Unicode semantics), and it seems there are cases where chr and ord are 
>> not inverse operations in the presence of certain "locales".  I consider 
>> these bugs, does anyone disagree?
> It's not a property of chr, but a property of Perl, to "not accept" (if
> that's the correct phrase) certain characters:
>     my $bar = "\x{ffff}";

That should be permitted, except when validating Unicode-ness per Chris' 

>> The following may be a bit more controversial... but I think they are 
>> consistent, and would produce an easy to explain system...
> I tend to believe that anything that's controversial will never be easy
> to explain. (That's okay, though. Sometimes it's needed in order to fix
> bigger issues.)
>> So, all prior character set standards will, hereafter, be referred to as 
>> "encodings", meaning that they define a subset of Unicode characters, 
>> and also a way of representing those characters as bytes or byte sequences.
> That's what Perl already does, although it's sometimes hard to convince
> Yves that this is actually a /good/ idea. :)

Well, it does, yes.  So that is a good thing :)  Makes fixing the 
documentation easier :)  I was just trying to collect folks that hadn't 
realized that into the same page!

>> Encodings fall into several categories:
> I don't agree that these categories are a useful distinction. It's very
> complex, and only people working on the Encode module suite are served
> by the level of detail IMO.
> Don't get me wrong. Your list is interesting and educational, just not
> of much use to most Perl programmers.

See, I even agree with this... it was somewhat of a digression.  But 
just trying to sidestep anyone that didn't think their data could be 
converted to Unicode.

> Instead I suggest the following two categories:
> 1. Single byte encodings: every character is a single byte. By
> necessity, only a small subset of Unicode is supported.
> 2. Multibyte encodings: every character is a one or more bytes.
> 2a. Legacy: Only a subset of Unicode is supported.
> 2b. Unicode: The whole Unicode set is supported.
> 2c. Full: A larger range than Unicode is supported.
> An encoding may or may not be ASCII-compatible.

There is only one "ASCII-compatible" encoding: ASCII itself.  Other 
things are Extended ASCII, which is only somewhat compatible with 7-bit 
ASCII, not 8-bit ASCII.  This is a fine point, but I think you can 
accept the term "Extended ASCII" here?

>> the only [unicode encoding] that has been put to widespread use 
>> is UTF-8.
> Not true. Windows uses UTF-16 internally, and you can't deny that
> Windows is widespread :)

You are right of course, and I did mention UTF-16 and Windows somewhere 
that you snipped.  My error above was not including "[byte-oriented 
Unicode encoding]" in that phrase, which is what I meant.  I think I 
said that somewhere, but not close enough, I guess.

>> [It should be noted that Decode has a bug: it presently accepts non-byte 
>> strings, and treats them as byte strings.  It should accept either byte 
>> or non-byte strings, and produce an error if any of the input blorfs are 
>> unknown to the expected encoding (generally, any blorf value > 256 is 
>> unknown to most byte-oriented encodings).
> Agreed that decode should not accept any string that has a value > 255.
> Note your off by one error in "> 256". Those are deadly! :)

Oops!  Thanks for the catch.  I intended 255 there.

Glenn --
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About