Front page | perl.perl5.porters |
Postings from March 2007
Re: perl, the data, and the tf8 flag
From: Glenn Linderman
March 31, 2007 14:36
Re: perl, the data, and the tf8 flag
Message ID: 460ED420.1000003@NevCal.com
On approximately 3/31/2007 3:03 AM, came the following characters from
the keyboard of Juerd Waalboer:
> Tels skribis 2007-03-31 11:45 (+0000):
>> As you can see, there are four different types of data, but Perl has only
>> one bit flag to distiguish them.
> I'd say it has two types of data, and indeed that one bit.
> With the bit on, it's unicode data that internally is encoded as UTF-8.
> You're not supposed to access the UTF-8 encoded octet buffer. This
> string should never be used with octet operations like vec or unpack "C"
> or "n".
> With the bit off, it's either unicode data that internally is encoded as
> ISO-8859-1, or it is binary data. This string can safely be used for
> octet operations (but of course, that doesn't make sense if the sting
> was intended as text, with the exception of some ancient 8bit things
OK, so Tels says 4 and describes 4 types of data. Juerd says two, but
describes 3 types of data. Mark says there is only one type of data
(binary). From all this discusison, I think I agree that data formats
are what you call them, so I will attempt to use yet different
terminology, that isn't derived from C, Perl, or Unicode, as much as
So it seems that everyone agrees that there are two basic types of
string data: one containing only values < 255 stored in bytes and the
other containing values from 0 to 2^30 (or so, maybe it is 2^32 -- what
is the exact limit here?) stored in multi-octet sequences according to
the same rules used by Unicode to transform codepoint values to UTF-8
octet sequences. I'll call these definitions "bytes" and "multi-bytes"
in the rest of this discussion. And I'll call the UTF8-flag the bytes
vs multi-bytes flag. And each of my "bytes" has 8 bits of data that can
be used to represent numbers, and the numbers also have other
interpretations applied to them, such as characters or parts of
characters, etc., but which characters map to which numbers is outside
the scope of this discussion... only encode, decode, and I/O devices care.
As a developer that hopes to be implementing Unicode character support
in a perl application soon (tuits, always tuits), I have the following
questions and comments. Corrections are welcome for any/all of this.
1) What operations can safely be used on bytes stored in a string
without causing implicit upgrades to multi-bytes?
My perception from following all this discussion is that you can do any
operation, as long as all the data involved is bytes data that has never
been upgraded, except for decode, which always assumes a bytes parameter.
2) What operations create multi-bytes data?
My perception is that chr() with a parameter whose numeric value is >
255, decode, read from filehandle with a decoding layer.
3) What operations create bytes data?
My perception is read from binmode filehandle, or one with no decoding
layer, chr() with numeric parameter < 256, encode.
4) What operations implicitly upgrade data from binary, assuming that
because of context it must be ISO-8859-1 encoded data?
My perception is _any operation_ that also includes another operand that
is UTF8 already.
Are there any that do so implicitly, without UTF8 data in the other
5) It seems that there should be documented lists of operations and core
modules that a) never upgrade b) never downgrade c) always upgrade d)
always downgrade e) may upgrade f) may downgrade and the conditions
under which it may happen. Alternately, there the upgrade/downgrade
rules are common to most operations and core modules, the rules for
non-listed operations should be documented, with the conditions under
whith they apply, and the exception operations and core modules should
be more explicitly documented. Does such documentation exist?
Juerd's forthcoming perlunitut document seems to imply that the rules
are indeed common to all operations, but this discussion seems to
indicate that there might be a few exceptions to that... I'll mention
the following in hopes of stirring up more helpful discussion...
A) pack -- it is not clear to me how this operation could produce
anything except bytes for the packed buffer parameter, regardless of
other parameters supplied.
B) unpack -- it is not clear to me how this operation could successfully
process a multi-bytes buffer parameter, except by first downgrading it,
if it contains no values > 255, since all the operations on it are
defined in terms of unpacking bytes.
C) use bytes; -- clearly this impacts lots of other operations.
D) Data::Dumper -- someone made the claim that Data::Dumper simply
ignores the UTF-8 flag, and functions properly. Could someone elucidate
how that happens? If the data is bytes, it produces bytes data in its
output, one presumes (generally '-quoted strings), if the data is
multi-bytes, I'm not sure what it does or should do. I could
experiment, but I haven't yet. Since likely both Mark and Juerd already
know what Data::Dumper does and how it works or fails to work with
multi-bytes data, perhaps the explanation would be good to have on record.
E) decode -- ignores the bytes vs. multi-bytes flag and processes its
input as bytes, producing a multi-bytes result.
F) encode -- ignores the bytes vs. multi-bytes flag, assumes a
multi-bytes input, and produces a bytes result.
G) regular expressions -- lots of reference is made to regular
expressions being broken, or at least different, for multi-byte stuff.
I fail to see why regular expressions are so hard to deal with. Of
course, I haven't implemented a regular expression engine, and so some
of my naive ideas may result in horrible performance, but it seems that
multi-byte regular expression stuff already has horrible performance, so
maybe my ideas aren't any worse, just different. Or maybe they are worse.
Firstly, regular expressions deal in "characters", not bytes, or
multi-byte sequences. However, all characters are, in the end, stored
in bytes, as either one or multi-bytes. So it would seem that the
regexp would have to be compiled twice: once for bytes inputs, and once
for multi-bytes inputs. A simple choice would be made between these
based on the bytes vs multi-bytes flag on the input parameter. For the
bytes technology, all is well understood, at least by Dave and Yves, I'm
sure. For the multi-bytes case, all constant data could be reduced to
the appropriate multi-byte sequence (with an associated character count
for that particular multi-byte sequence) and comparisons proceed
easily. The problem is I see is that "." (and other matching
characters) can match one or more bytes (depending on the character
code(s)). So for "." and friends, the multi-byte regexp engine would
need to examine the value of the next byte, and possibly more of them,
until a character boundary is reached. The overall logic of matching
would be similar though, but the counting of characters matched and
bytes consumed, would be different. However, it seems that the current
regexp engine is much more complex than this, worrying about character
values even for matching constant strings...
Anyway, I'm in way over my head talking about the regexp engine... can
someone describe how it is broken, from an input/output perspective, in
simple terms? I understand that the character classes have different
meanings between bytes and multi-bytes parameters -- is there anything
else that would bite me in converting to Unicode?
Glenn -- http://nevcal.com/
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking