develooper Front page | perl.perl5.porters | Postings from March 2007

Re: perl, the data, and the tf8 flag

Thread Previous | Thread Next
Glenn Linderman
March 31, 2007 14:36
Re: perl, the data, and the tf8 flag
Message ID:
On approximately 3/31/2007 3:03 AM, came the following characters from 
the keyboard of Juerd Waalboer:
> Tels skribis 2007-03-31 11:45 (+0000):
>> As you can see, there are four different types of data, but Perl has only 
>> one bit flag to distiguish them. 
> I'd say it has two types of data, and indeed that one bit.
> With the bit on, it's unicode data that internally is encoded as UTF-8.
> You're not supposed to access the UTF-8 encoded octet buffer. This
> string should never be used with octet operations like vec or unpack "C"
> or "n".
> With the bit off, it's either unicode data that internally is encoded as
> ISO-8859-1, or it is binary data. This string can safely be used for
> octet operations (but of course, that doesn't make sense if the sting
> was intended as text, with the exception of some ancient 8bit things
> crypt()).

OK, so Tels says 4 and describes 4 types of data.  Juerd says two, but 
describes 3 types of data.  Mark says there is only one type of data 
(binary).  From all this discusison, I think I agree that data formats 
are what you call them, so I will attempt to use yet different 
terminology, that isn't derived from C, Perl, or Unicode, as much as 

So it seems that everyone agrees that there are two basic types of 
string data: one containing only values < 255 stored in bytes and the 
other containing values from 0 to 2^30 (or so, maybe it is 2^32 -- what 
is the exact limit here?) stored in multi-octet sequences according to 
the same rules used by Unicode to transform codepoint values to UTF-8 
octet sequences.  I'll call these definitions "bytes" and "multi-bytes" 
in the rest of this discussion.  And I'll call the UTF8-flag the bytes 
vs multi-bytes flag.  And each of my "bytes" has 8 bits of data that can 
be used to represent numbers, and the numbers also have other 
interpretations applied to them, such as characters or parts of 
characters, etc., but which characters map to which numbers is outside 
the scope of this discussion... only encode, decode, and I/O devices care.

As a developer that hopes to be implementing Unicode character support 
in a perl application soon (tuits, always tuits), I have the following 
questions and comments.  Corrections are welcome for any/all of this.

1) What operations can safely be used on bytes stored in a string 
without causing implicit upgrades to multi-bytes?

My perception from following all this discussion is that you can do any 
operation, as long as all the data involved is bytes data that has never 
been upgraded, except for decode, which always assumes a bytes parameter.

2) What operations create multi-bytes data?

My perception is that chr() with a parameter whose numeric value is > 
255, decode, read from filehandle with a decoding layer.

3) What operations create bytes data?

My perception is read from binmode filehandle, or one with no decoding 
layer, chr() with numeric parameter < 256, encode.

4) What operations implicitly upgrade data from binary, assuming that 
because of context it must be ISO-8859-1 encoded data?

My perception is _any operation_ that also includes another operand that 
is UTF8 already.

Are there any that do so implicitly, without UTF8 data in the other 

5) It seems that there should be documented lists of operations and core 
modules that a) never upgrade b) never downgrade c) always upgrade d) 
always downgrade e) may upgrade f) may downgrade and the conditions 
under which it may happen.  Alternately, there the upgrade/downgrade 
rules are common to most operations and core modules, the rules for 
non-listed operations should be documented, with the conditions under 
whith they apply, and the exception operations and core modules should 
be more explicitly documented.  Does such documentation exist?

Juerd's forthcoming perlunitut document seems to imply that the rules 
are indeed common to all operations, but this discussion seems to 
indicate that there might be a few exceptions to that... I'll mention 
the following in hopes of stirring up more helpful discussion...

A) pack -- it is not clear to me how this operation could produce 
anything except bytes for the packed buffer parameter, regardless of 
other parameters supplied.

B) unpack -- it is not clear to me how this operation could successfully 
process a multi-bytes buffer parameter, except by first downgrading it, 
if it contains no values > 255, since all the operations on it are 
defined in terms of unpacking bytes.

C) use bytes; -- clearly this impacts lots of other operations.

D) Data::Dumper -- someone made the claim that Data::Dumper simply 
ignores the UTF-8 flag, and functions properly.  Could someone elucidate 
how that happens?  If the data is bytes, it produces bytes data in its 
output, one presumes (generally '-quoted strings), if the data is 
multi-bytes, I'm not sure what it does or should do.  I could 
experiment, but I haven't yet.  Since likely both Mark and Juerd already 
know what Data::Dumper does and how it works or fails to work with 
multi-bytes data, perhaps the explanation would be good to have on record.

E) decode -- ignores the bytes vs. multi-bytes flag and processes its 
input as bytes, producing a multi-bytes result.

F) encode -- ignores the bytes vs. multi-bytes flag, assumes a 
multi-bytes input, and produces a bytes result.

G) regular expressions -- lots of reference is made to regular 
expressions being broken, or at least different, for multi-byte stuff.  
I fail to see why regular expressions are so hard to deal with. Of 
course, I haven't implemented a regular expression engine, and so some 
of my naive ideas may result in horrible performance, but it seems that 
multi-byte regular expression stuff already has horrible performance, so 
maybe my ideas aren't any worse, just different.  Or maybe they are worse.

Firstly, regular expressions deal in "characters", not bytes, or 
multi-byte sequences.  However, all characters are, in the end, stored 
in bytes, as either one or multi-bytes.  So it would seem that the 
regexp would have to be compiled twice: once for bytes inputs, and once 
for multi-bytes inputs.  A simple choice would be made between these 
based on the bytes vs multi-bytes flag on the input parameter.  For the 
bytes technology, all is well understood, at least by Dave and Yves, I'm 
sure.  For the multi-bytes case, all constant data could be reduced to 
the appropriate multi-byte sequence (with an associated character count 
for that particular multi-byte sequence) and comparisons proceed 
easily.  The problem is I see is that "." (and other matching 
characters) can match one or more bytes (depending on the character 
code(s)).  So for "." and friends, the multi-byte regexp engine would 
need to examine the value of the next byte, and possibly more of them, 
until a character boundary is reached.  The overall logic of matching 
would be similar though, but the counting of characters matched and 
bytes consumed, would be different.  However, it seems that the current 
regexp engine is much more complex than this, worrying about character 
values even for matching constant strings...

Anyway, I'm in way over my head talking about the regexp engine... can 
someone describe how it is broken, from an input/output perspective, in 
simple terms?  I understand that the character classes have different 
meanings between bytes and multi-bytes parameters -- is there anything 
else that would bite me in converting to Unicode?

Glenn --
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About