Front page | perl.perl5.porters |
Postings from May 2008
Re: on the almost impossibility to write correct XS modules
From:
Glenn Linderman
Date:
May 20, 2008 13:44
Subject:
Re: on the almost impossibility to write correct XS modules
Message ID:
48333831.4050405@NevCal.com
On approximately 5/20/2008 6:42 AM, came the following characters from
the keyboard of Juerd Waalboer:
> Jan Dubois skribis 2008-05-18 8:40 (-0700):
>> No, "\xff" is guaranteed to have byte semantics for backwards compatibility:
>
> "byte semantics" is a dangerous term, partly because different people
> use it for different things. Some people use it to refer to functions
> and operators acting on the bytes in the PV's buffer regardless of the
> SvUTF8 flag's state, but those functions are generally broken and in
> need of repair (as announced in perl5100delta, this would break
> compatibility).
>
> By default, "\xff" by itself will indeed create a string that
> *internally* is a single byte 0xff.
>
> A Perl string is a Unicode string. Or actually, a sequence of almost
> arbitrary integer values that most operations ought to interpret as
> unicode codepoints. If it contains only characters < 256, it may be
> "encoded as latin1" (represented as 8 bit with a straight mapping)
> internally, both for efficiency and for backwards compatibility. When
> strings are sent or received with system calls, that has to occur in
> bytes. If a string only contains characters < 256, it can be used as a
> byte string. (Note: I originally believed otherwise and was wrong.)
Hi Juerd,
I'm glad to see that you have expanded your understanding of strings to
realize that they are sequences of integer values. Especially since you
have the ability to express yourself clearly, and are comfortable with
the process of submitting documentation patches, since the
documentation, as you have long recognized, is somewhat inconsistent.
Prior to now, I have been somewhat concerned that you would submit
patches that would remove the concept of storing arbitrary numbers in
strings from the documentation, and although that has a limited semantic
usefulness, basic binary file input/output cannot be achieved in any
other way.
I'm still a bit concerned by your "almost arbitrary" modifier, mostly
because I'm not sure what you mean by that. I would take it to mean
that there is some upper bound (which seems to be somewhat platform
dependent [32-bit vs 64-bit platforms]). Certain operators also
restrict certain specific values, but except for Encode, I believe such
restrictions to be bugs. There was some discussion about this in the
last few months, which clarified that, and suggested some specific bug
fixes, and some possible extensions for Unicode validation features.
> Still it can be useful to write your program in a way that avoids that a
> string that will be used as a byte string, is ever upgraded to UTF8
> *internally*: upgrading and downgrading it again might be a performance
> issue. There should be no difference in semantics, regardless of the
> internal encoding of the string. It is a bug that there is.
I can agree with this. There are a few exceptional cases where
manipulating non-byte binary strings can be useful, and can even be
clearer or more efficient than the alternatives, but they are in the
minority. Generally binary strings can be manipulated as byte strings
very effectively, and there is no need or desire to explicitly or
implicitly convert their internal format to use the "multibytes" or
"internal UTF8 flag on" representation.
> I believe that this snippet:
>
>> perluniintro.pod:
>> | Note that C<\x..> (no C<{}> and only two hexadecimal digits), C<\x{...}>,
>> | and C<chr(...)> for arguments less than C<0x100> (decimal 256)
>> | generate an eight-bit character for backward compatibility with older
>> | Perls. For arguments of C<0x100> or more, Unicode characters are
>> | always produced. If you want to force the production of Unicode
>> | characters regardless of the numeric value, use C<pack("U", ...)>
>> | instead of C<\x..>, C<\x{...}>, or C<chr()>.
>
> is misleading. It suggests that Perl has two kinds of strings
> technically, which is not true. There is a single string type with two
> *internal* representations. The word *internal* is notably missing in
> the quoted part of perluniintro.
>
> Let's change "generate an eight-bit character" to "generate a string
> that has an eight-bit encoding internally".
>
> In any case, CHARACTERS DO NOT HAVE BITS. Bytes have 8 bits, characters
> just have a number.
Except for the historical, inherited-from-C, concept of an 8-bit char, I
could agree with this.
I _do_ agree that it would be good to develop a set of terminology that
can be well-defined, used throughout the documentation as it is updated,
and which captures the essence of what you have said just above: I'll
rewrite that, temporarily, using "blorf":
In any case, BLORFS DO NOT HAVE BITS. Bytes have 8 bits, blorfs
just have a number.
So I postulate that the following should all be true, when some pragma
says to put Perl into "all Unicode" mode, whether it is the "tri-state"
pragma I suggested that allows retaining existing semantics for
compatibility, or whether it is some on/off pragma, that breaks existing
workarounds.
I don't see any value in a second string type that always has Unicode
semantics... I see value in providing a path to applying Unicode
semantics all the time to the current string type... unless explicitly,
lexically, chosen otherwise (lexical choice can be via options or
parameters to operations, or via lexical pragmata). Once the semantics
are divorced from the SvUTF8 flag, the current string type can handle
things just fine.
I continue to use "blorf", but it needs a different name, preferably not
"character" or "char", because those have too many semantics inherited
from other programming languages and concepts. And strings can contain
non-character data.
A blorf is a number that is a component of a string. Each possible
character can be represented as a blorf.
A string is a linear sequence of blorfs.
Subsequences can be obtained via the substr operation.
Numbers can be converted to blorfs via the chr operation.
The first blorf of a string can by converted to a number via the ord
operation.
chr and ord are inverse operations.
Byte strings are a subset of strings that contain only blorfs in the
range [0..255]. These are handy for binary input/output operations,
which require understanding and manipulating the exact physical size of
the data.
Byte strings can also be manipulated by general string operations, and
all operations which use only byte strings produce only byte strings.
Pack and Unpack can be used to extract data from binary files into more
easily manipulated forms. This includes some string manipulations.
Is there any argument about the above definitions? I think they are
pretty universally agreed to, at least conceptually. It seems there are
bugs where chr doesn't accept all legal blorfs (attempting to mix in
Unicode semantics), and it seems there are cases where chr and ord are
not inverse operations in the presence of certain "locales". I consider
these bugs, does anyone disagree?
The following may be a bit more controversial... but I think they are
consistent, and would produce an easy to explain system... they are
close to what we have now, but assume that bugs will need to be fixed to
achieve this goal.
All character set standards prior to Unicode have been defined in terms
of bytes, and/or interpretations of sequences of bytes. In Unicode
terms, all character set standards prior to Unicode are actually
"encodings of a subset of Unicode". Since Unicode does have the charter
to include all characters used in the world today, and also interesting
historical characters, it makes sense to adopt and use Unicode
terminology, so I do so (to the best of my ability) below.
So, all prior character set standards will, hereafter, be referred to as
"encodings", meaning that they define a subset of Unicode characters,
and also a way of representing those characters as bytes or byte sequences.
Encodings fall into several categories:
1) ASCII which has 128 characters. The ASCII subset is the first 128
Unicode codepoints, and are represented in bytes by ignoring the
high-order bit, and using the numeric value of the remaining 7 bits of
the byte as the numerically corresponding Unicode codepoint.
2) Extended ASCII. This is a set of encodings, each of which stores a
single character per byte. The high-order bit is not ignored; if it is
zero, the remaining bits are interpreted as ASCII characters, if it is
set, the byte value specifies some other Unicode codepoint.
2A) ASCII can be considered a special case of Extended ASCII, where each
of its 128 characters has two possible representations.
2B) There is a distinguished encoding in this set called "Latin1" which
interprets all its byte values as the numerically corresponding Unicode
codepoint.
3) Single byte encodings. This is a set of encodings, each of which
stores a single character per byte. This type of encoding does not
require (nor does it prevent) any numerical correspondence between byte
values and Unicode codepoints. There is a one-to-one mapping from most
standardized single byte encodings to corresponding Unicode codepoints,
which could be implemented via a look-up table.
3A) ASCII can be considered a special case of single byte encodings,
where each of its 128 characters has two possible representations.
3B) There is a distinguished encoding in this set called "Latin1" which
interprets all its byte values as the numerically corresponding Unicode
codepoint.
3C) Extended ASCII is a subset of single byte encodings.
3D) There is another somewhat widely used (historically), non-ASCII
encoding, called EBCDIC, defined and promoted by IBM.
4) Shifted encodings. This is a set of encodings, where a few
distinguished byte values do not represent characters, but rather
instructions on how to interpret subsequent byte values. Typically,
there are several look-up tables of the sort that define a single byte
encoding, and particular distinguished byte values can shift (or select)
among the look-up tables for interpreting future byte values.
Many shifted encodings include ASCII as a subset, at least in one of the
look-up tables, sometimes in all.
4A) DBCS encodings. This is a set of shifted encodings where the
distinguished byte values only affect the next following byte,
thereafter reverting back to the initial, or default, look-up table.
They have the nice property that they can be traversed randomly or in
reverse more easily than the general shifted encoding, by only looking
at a particular byte, and the prior and next bytes, to determine how to
choose decode the character at that position.
5) N-byte encodings. This is a set of encodings where a fixed number of
bytes is used to represent each data value. All the varieties 1-4 above
could be done using an N-byte encoding, but because of the size of the
lookup tables involved, this isn't typically done. N-byte encodings can
be stored in byte strings whose length is a multiple of N, so a new data
type isn't necessary, although it could be convenient in some architectures.
I'm unaware of any encodings that do not fit one of the above classes.
I'd like to hear about any others.
Unicode also defines a variety of encodings. Among byte-oriented
encodings, UTF-8, UTF-7, UTF-EBCDIC, and FSSUTF have all been defined,
maybe others also, but the only one that has been put to widespread use
is UTF-8. It contains an ASCII subset, and then variable length
sequences of bytes in the range [128..256] can be constructed which map
via a numeric and logic formula, to produce a Unicode codepoint.
Unicode also defines some N-byte encodings, UTF-16, UCS-2, and UCS-4.
UTF-16 is used by Windows, so becomes interesting.
Perl uses a superset of UTF-8 as its internal format when representing
strings of blorfs that are outside the range of [0..255], and sometimes
even for strings of blorfs that are inside that range. This is hidden
from the user, although they should be aware of the issue when they
attempt to do input and output operations of any sort: some encoding
operation may need to be associated with stream files, or explicit
encoding/decoding may need to be done for binary data access.
Perl has operations to convert to and from various encodings to strings
of blorfs representing Unicode codepoints. Perl has string operations
which assume Unicode semantics, such as case shifting, case insensitive
comparisons, and regexp character classes.
Blorfs may represent other values, even other character sets, but Perl
has no operations which understand them, other than as a sequence of
numeric values.
[The rest assumes that a discussion from some months back, regarding
Encode bugs and features, is resolved appropriately, in light of some
statement of what is "the one true way" to handle Unicode.]
Encode can be used to convert any string to a UTF-8 format byte string
-- Encode always produces byte strings -- so any string may be placed in
a binary file, when the format of that file accommodates it. Encode has
options that enable it to ignore or enforce various Unicode
semanticswhich may produce errors, rather than converted byte strings.
Encode has options to convert to a large number of encodings. Encode
may be "lossy", if particular blorfs cannot be represented in the
selected output encoding.
Decode reverses the Encode operations.
[It should be noted that Decode has a bug: it presently accepts non-byte
strings, and treats them as byte strings. It should accept either byte
or non-byte strings, and produce an error if any of the input blorfs are
unknown to the expected encoding (generally, any blorf value > 256 is
unknown to most byte-oriented encodings).
>> As perluniintro.pod above points out, the only reliable way to do this
>> is pack("U", $codepoint). Or you can use named characters using
>> charnames.pm.
>
> If the remaining bugs in Perl (see also Unicode::Semantics) are fixed,
> then there is no longer any *need* for forcing the internal encoding to
> UTF8.
>
> This said, I think that pack("U", $codepoint) is not a very good idea.
> Without degressing into details, I would like to point out that it's
> usually better to associate the upgrade with the buggy operator, rather
> than the string itself.
>
> So instead of:
>
> my $char = pack("U", $codepoint);
>
> ... # perhaps lots of code here
>
> my $uc = uc($char);
>
> I would suggest using:
>
> my $char = chr($codepoint);
>
> ... # perhaps lots of code here
>
> utf8::upgrade($char); # work around bug
> my $uc = uc($char);
>
>> [perluniintro]
>> | Internally, Perl currently uses either whatever the native eight-bit
>> | character set of the platform (for example Latin-1) is
>
> This is simply not true. Perl uses either latin1 or ebcdic for its
> internally eight-bit strings. Not Windows-1252, for example.
>
>> | defaulting to UTF-8, to encode Unicode strings.
>
> defaulting to UTF-8, WITH A WARNING, for strings that could not be
> downgraded, i.e. strings that contain characters > 255.
>
> The warning is there for a reason: it says you're doing it wrong. You're
> forcing a byte-incompatible string on a byte operation (system call),
> and forgot to encode.
Yeah
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking