develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

From:
Glenn Linderman
Date:
May 20, 2008 13:44
Subject:
Re: on the almost impossibility to write correct XS modules
Message ID:
48333831.4050405@NevCal.com
On approximately 5/20/2008 6:42 AM, came the following characters from 
the keyboard of Juerd Waalboer:
> Jan Dubois skribis 2008-05-18  8:40 (-0700):
>> No, "\xff" is guaranteed to have byte semantics for backwards compatibility:
> 
> "byte semantics" is a dangerous term, partly because different people
> use it for different things. Some people use it to refer to functions
> and operators acting on the bytes in the PV's buffer regardless of the
> SvUTF8 flag's state, but those functions are generally broken and in
> need of repair (as announced in perl5100delta, this would break
> compatibility).
> 
> By default, "\xff" by itself will indeed create a string that
> *internally* is a single byte 0xff.
> 
> A Perl string is a Unicode string. Or actually, a sequence of almost
> arbitrary integer values that most operations ought to interpret as
> unicode codepoints. If it contains only characters < 256, it may be
> "encoded as latin1" (represented as 8 bit with a straight mapping)
> internally, both for efficiency and for backwards compatibility. When
> strings are sent or received with system calls, that has to occur in
> bytes. If a string only contains characters < 256, it can be used as a
> byte string. (Note: I originally believed otherwise and was wrong.)


Hi Juerd,

I'm glad to see that you have expanded your understanding of strings to 
realize that they are sequences of integer values.  Especially since you 
have the ability to express yourself clearly, and are comfortable with 
the process of submitting documentation patches, since the 
documentation, as you have long recognized, is somewhat inconsistent.

Prior to now, I have been somewhat concerned that you would submit 
patches that would remove the concept of storing arbitrary numbers in 
strings from the documentation, and although that has a limited semantic 
usefulness, basic binary file input/output cannot be achieved in any 
other way.

I'm still a bit concerned by your "almost arbitrary" modifier, mostly 
because I'm not sure what you mean by that.  I would take it to mean 
that there is some upper bound (which seems to be somewhat platform 
dependent [32-bit vs 64-bit platforms]).  Certain operators also 
restrict certain specific values, but except for Encode, I believe such 
restrictions to be bugs.  There was some discussion about this in the 
last few months, which clarified that, and suggested some specific bug 
fixes, and some possible extensions for Unicode validation features.


> Still it can be useful to write your program in a way that avoids that a
> string that will be used as a byte string, is ever upgraded to UTF8
> *internally*: upgrading and downgrading it again might be a performance
> issue. There should be no difference in semantics, regardless of the
> internal encoding of the string. It is a bug that there is.


I can agree with this.  There are a few exceptional cases where 
manipulating non-byte binary strings can be useful, and can even be 
clearer or more efficient than the alternatives, but they are in the 
minority.  Generally binary strings can be manipulated as byte strings 
very effectively, and there is no need or desire to explicitly or 
implicitly convert their internal format to use the "multibytes" or 
"internal UTF8 flag on" representation.


> I believe that this snippet:
> 
>> perluniintro.pod:
>> | Note that C<\x..> (no C<{}> and only two hexadecimal digits), C<\x{...}>,
>> | and C<chr(...)> for arguments less than C<0x100> (decimal 256)
>> | generate an eight-bit character for backward compatibility with older
>> | Perls.  For arguments of C<0x100> or more, Unicode characters are
>> | always produced. If you want to force the production of Unicode
>> | characters regardless of the numeric value, use C<pack("U", ...)>
>> | instead of C<\x..>, C<\x{...}>, or C<chr()>.
> 
> is misleading. It suggests that Perl has two kinds of strings
> technically, which is not true. There is a single string type with two
> *internal* representations. The word *internal* is notably missing in
> the quoted part of perluniintro.
> 
> Let's change "generate an eight-bit character" to "generate a string
> that has an eight-bit encoding internally".
> 
> In any case, CHARACTERS DO NOT HAVE BITS. Bytes have 8 bits, characters
> just have a number.


Except for the historical, inherited-from-C, concept of an 8-bit char, I 
could agree with this.

I _do_ agree that it would be good to develop a set of terminology that 
can be well-defined, used throughout the documentation as it is updated, 
and which captures the essence of what you have said just above:  I'll 
rewrite that, temporarily, using "blorf":

In any case, BLORFS DO NOT HAVE BITS. Bytes have 8 bits, blorfs
just have a number.


So I postulate that the following should all be true, when some pragma 
says to put Perl into "all Unicode" mode, whether it is the "tri-state" 
pragma I suggested that allows retaining existing semantics for 
compatibility, or whether it is some on/off pragma, that breaks existing 
workarounds.

I don't see any value in a second string type that always has Unicode 
semantics... I see value in providing a path to applying Unicode 
semantics all the time to the current string type... unless explicitly, 
lexically, chosen otherwise (lexical choice can be via options or 
parameters to operations, or via lexical pragmata).  Once the semantics 
are divorced from the SvUTF8 flag, the current string type can handle 
things just fine.

I continue to use "blorf", but it needs a different name, preferably not 
"character" or "char", because those have too many semantics inherited 
from other programming languages and concepts.  And strings can contain 
non-character data.



A blorf is a number that is a component of a string.  Each possible 
character can be represented as a blorf.

A string is a linear sequence of blorfs.

Subsequences can be obtained via the substr operation.

Numbers can be converted to blorfs via the chr operation.

The first blorf of a string can by converted to a number via the ord 
operation.

chr and ord are inverse operations.

Byte strings are a subset of strings that contain only blorfs in the 
range [0..255].  These are handy for binary input/output operations, 
which require understanding and manipulating the exact physical size of 
the data.

Byte strings can also be manipulated by general string operations, and 
all operations which use only byte strings produce only byte strings.

Pack and Unpack can be used to extract data from binary files into more 
easily manipulated forms.  This includes some string manipulations.


Is there any argument about the above definitions?  I think they are 
pretty universally agreed to, at least conceptually.  It seems there are 
bugs where chr doesn't accept all legal blorfs (attempting to mix in 
Unicode semantics), and it seems there are cases where chr and ord are 
not inverse operations in the presence of certain "locales".  I consider 
these bugs, does anyone disagree?


The following may be a bit more controversial... but I think they are 
consistent, and would produce an easy to explain system... they are 
close to what we have now, but assume that bugs will need to be fixed to 
achieve this goal.


All character set standards prior to Unicode have been defined in terms 
of bytes, and/or interpretations of sequences of bytes.  In Unicode 
terms, all character set standards prior to Unicode are actually 
"encodings of a subset of Unicode".  Since Unicode does have the charter 
to include all characters used in the world today, and also interesting 
historical characters, it makes sense to adopt and use Unicode 
terminology, so I do so (to the best of my ability) below.

So, all prior character set standards will, hereafter, be referred to as 
"encodings", meaning that they define a subset of Unicode characters, 
and also a way of representing those characters as bytes or byte sequences.

Encodings fall into several categories:

1) ASCII which has 128 characters.  The ASCII subset is the first 128 
Unicode codepoints, and are represented in bytes by ignoring the 
high-order bit, and using the numeric value of the remaining 7 bits of 
the byte as the numerically corresponding Unicode codepoint.

2) Extended ASCII.  This is a set of encodings, each of which stores a 
single character per byte.  The high-order bit is not ignored; if it is 
zero, the remaining bits are interpreted as ASCII characters, if it is 
set, the byte value specifies some other Unicode codepoint.

2A) ASCII can be considered a special case of Extended ASCII, where each 
of its 128 characters has two possible representations.

2B) There is a distinguished encoding in this set called "Latin1" which 
interprets all its byte values as the numerically corresponding Unicode 
codepoint.

3) Single byte encodings.  This is a set of encodings, each of which 
stores a single character per byte.  This type of encoding does not 
require (nor does it prevent) any numerical correspondence between byte 
values and Unicode codepoints.  There is a one-to-one mapping from most 
standardized single byte encodings to corresponding Unicode codepoints, 
which could be implemented via a look-up table.

3A) ASCII can be considered a special case of single byte encodings, 
where each of its 128 characters has two possible representations.

3B) There is a distinguished encoding in this set called "Latin1" which 
interprets all its byte values as the numerically corresponding Unicode 
codepoint.

3C) Extended ASCII is a subset of single byte encodings.

3D) There is another somewhat widely used (historically), non-ASCII 
encoding, called EBCDIC, defined and promoted by IBM.


4) Shifted encodings.  This is a set of encodings, where a few 
distinguished byte values do not represent characters, but rather 
instructions on how to interpret subsequent byte values.  Typically, 
there are several look-up tables of the sort that define a single byte 
encoding, and particular distinguished byte values can shift (or select) 
among the look-up tables for interpreting future byte values.

Many shifted encodings include ASCII as a subset, at least in one of the 
look-up tables, sometimes in all.

4A) DBCS encodings.  This is a set of shifted encodings where the 
distinguished byte values only affect the next following byte, 
thereafter reverting back to the initial, or default, look-up table. 
They have the nice property that they can be traversed randomly or in 
reverse more easily than the general shifted encoding, by only looking 
at a particular byte, and the prior and next bytes, to determine how to 
choose decode the character at that position.

5) N-byte encodings.  This is a set of encodings where a fixed number of 
bytes is used to represent each data value.  All the varieties 1-4 above 
could be done using an N-byte encoding, but because of the size of the 
lookup tables involved, this isn't typically done.  N-byte encodings can 
be stored in byte strings whose length is a multiple of N, so a new data 
type isn't necessary, although it could be convenient in some architectures.


I'm unaware of any encodings that do not fit one of the above classes. 
I'd like to hear about any others.


Unicode also defines a variety of encodings.  Among byte-oriented 
encodings, UTF-8, UTF-7, UTF-EBCDIC, and FSSUTF have all been defined, 
maybe others also, but the only one that has been put to widespread use 
is UTF-8.  It contains an ASCII subset, and then variable length 
sequences of bytes in the range [128..256] can be constructed which map 
via a numeric and logic formula, to produce a Unicode codepoint.

Unicode also defines some N-byte encodings, UTF-16, UCS-2, and UCS-4. 
UTF-16 is used by Windows, so becomes interesting.


Perl uses a superset of UTF-8 as its internal format when representing 
strings of blorfs that are outside the range of [0..255], and sometimes 
even for strings of blorfs that are inside that range.  This is hidden 
from the user, although they should be aware of the issue when they 
attempt to do input and output operations of any sort: some encoding 
operation may need to be associated with stream files, or explicit 
encoding/decoding may need to be done for binary data access.

Perl has operations to convert to and from various encodings to strings 
of blorfs representing Unicode codepoints.  Perl has string operations 
which assume Unicode semantics, such as case shifting, case insensitive 
comparisons, and regexp character classes.

Blorfs may represent other values, even other character sets, but Perl 
has no operations which understand them, other than as a sequence of 
numeric values.

[The rest assumes that a discussion from some months back, regarding 
Encode bugs and features, is resolved appropriately, in light of some 
statement of what is "the one true way" to handle Unicode.]

Encode can be used to convert any string to a UTF-8 format byte string 
-- Encode always produces byte strings -- so any string may be placed in 
a binary file, when the format of that file accommodates it.  Encode has 
options that enable it to ignore or enforce various Unicode 
semanticswhich may produce errors, rather than converted byte strings. 
Encode has options to convert to a large number of encodings.  Encode 
may be "lossy", if particular blorfs cannot be represented in the 
selected output encoding.

Decode reverses the Encode operations.

[It should be noted that Decode has a bug: it presently accepts non-byte 
strings, and treats them as byte strings.  It should accept either byte 
or non-byte strings, and produce an error if any of the input blorfs are 
unknown to the expected encoding (generally, any blorf value > 256 is 
unknown to most byte-oriented encodings).


>> As perluniintro.pod above points out, the only reliable way to do this
>> is pack("U", $codepoint).  Or you can use named characters using
>> charnames.pm.
> 
> If the remaining bugs in Perl (see also Unicode::Semantics) are fixed,
> then there is no longer any *need* for forcing the internal encoding to
> UTF8.
> 
> This said, I think that pack("U", $codepoint) is not a very good idea.
> Without degressing into details, I would like to point out that it's
> usually better to associate the upgrade with the buggy operator, rather
> than the string itself.
> 
> So instead of:
> 
>     my $char = pack("U", $codepoint);
> 
>     ...  # perhaps lots of code here
> 
>     my $uc = uc($char);
> 
> I would suggest using:
> 
>     my $char = chr($codepoint);
> 
>     ... # perhaps lots of code here
> 
>     utf8::upgrade($char);  # work around bug
>     my $uc = uc($char);
> 
>> [perluniintro]
>> | Internally, Perl currently uses either whatever the native eight-bit
>> | character set of the platform (for example Latin-1) is
> 
> This is simply not true. Perl uses either latin1 or ebcdic for its
> internally eight-bit strings. Not Windows-1252, for example.
> 
>> | defaulting to UTF-8, to encode Unicode strings.
> 
> defaulting to UTF-8, WITH A WARNING, for strings that could not be
> downgraded, i.e. strings that contain characters > 255.
> 
> The warning is there for a reason: it says you're doing it wrong. You're
> forcing a byte-incompatible string on a byte operation (system call),
> and forgot to encode.


Yeah


-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About