develooper Front page | perl.perl5.porters | Postings from March 2007

Re: perl, the data, and the tf8 flag

Thread Previous | Thread Next
Glenn Linderman
March 31, 2007 18:45
Re: perl, the data, and the tf8 flag
Message ID:
On approximately 3/31/2007 3:34 PM, came the following characters from 
the keyboard of Juerd Waalboer:
> Glenn Linderman skribis 2007-03-31 14:35 (-0700):
>> Juerd says two, but describes 3 types of data.
> Well observed! :)

:) Thought that would get your attention, if nothing else did!

You sidestepped comment on what the range is for data stored in 
multi-bytes format, though... is it [0:2^30-1] or [0:2^32-1], or what?  
Anyone?  Unicode implements [0:0x10ffff], but what does Perl support?

> I invite you to read perlunitut and perlunifaq. You'll have to look for
> them (e.g. Google) or get them from bleadperl.

I've read perlunifaq several times trying to figure things out.  I read 
perlunitut yesterday, when it came up in this discussion.

I found perlunifaq quite opaque the first several times I read it.  
perlunitut seems easier to follow, but didn't answer all my questions 

>> 1) What operations can safely be used on bytes stored in a string 
>> without causing implicit upgrades to multi-bytes?
> All operations are safe, except:
> 1. operations that add characters greater than 255
> 2. joining text strings with byte strings (because
> the text string may already internally be prepared for handling
> characters greater than 255, and forces the byte string to be prepared
> in a similar way, breaking its binaryness) and byte operations (because
> they cannot handle characters greater than 255).
> If you read carefully, you'll notice that "1" is just a different way of
> having "2", and come to the conclusion that there's only one simple, yet
> important, guideline: keep binary and text separate, and only bridge
> between the two by means of decoding and encoding.

Again, you say two, but describe 3.... :)  Maybe that is a habit of yours?

1. operations that add characters greater than 255
2. joining text strings with byte strings
3. byte operations

>> My perception from following all this discussion is that you can do any 
>> operation, as long as all the data involved is bytes data that has never 
>> been upgraded, except for decode, which always assumes a bytes parameter.
> Your perception is correct.

Thanks for the confirmation.

>> 2) What operations create multi-bytes data?
> 1. operations that add characters greater than 255
> 2. joining text strings with byte strings or byte operations, if the
> text string is internally prepared for handling characters greater than
> 255 ("is internally encoded as UTF8", "has the UTF8 flag set").
> This is, in essence, the same list as before :)

In essence.

>> 3) What operations create bytes data?
> In general, everything that creates a new string value with only
> characters in the 0..255 range, with the possible exception of
> operations that are designed to create text strings only (like decode).
> Some examples:
>     "\xdf\xff\xa0\x00\xa1"            # binary (but can be used as text)
>     "\x{df}\x{ff}\x{a0}\x{00}\x{a1}"  # binary (but can be used as text)
>     "\x{100}..."                      # text (should not be used as bin)
>     chr 255                           # binary (but can be used as text)
>     chr 256                           # text (should not be used as bin)
>     readline $fh                      # either, depends on io layers
> Note that "use encoding" will turn almost everything into a text-only
> creating thing, and makes using binary data very hard. My advise is to
> avoid the module altogether.
>> 4) What operations implicitly upgrade data from binary, assuming that 
>> because of context it must be ISO-8859-1 encoded data?
> In the core, only concatenating with an internally-UTF8 string. All
> other operations that require UTF8 only upgrade temporarily.
> There are modules that carelessly upgrade strings; this causes no
> problem if your string is a text string and you decode/encode properly
> and keep the string properly separated from byte strings. But it might
> otherwise.
>> My perception is _any operation_ that also includes another operand that 
>> is UTF8 already.
> When the "other operand" becomes part of the target string at some
> point, yes.

OK, good clarification, thanks.

>> 5) It seems that there should be documented lists of operations and core 
>> modules that a) never upgrade b) never downgrade c) always upgrade d) 
>> always downgrade e) may upgrade f) may downgrade and the conditions 
>> under which it may happen.
> Instead of compiling lists, isn't adding this information to the
> existing documentation a better idea?
> Lists like this would suggest that you'd have to learn them by heart in
> order to write safe code, while in fact it's only needed to keep text
> from byte values and byte operations. The latter is a lot easier to
> learn and universally applicable.

Right.  As long as the rules apply everywhere, the only things we need 
to learn are the exceptions.  I think now we just need to discuss the 

>> Juerd's forthcoming perlunitut document seems to imply that the rules 
>> are indeed common to all operations, but this discussion seems to 
>> indicate that there might be a few exceptions to that... 
> It dose seem to indicate so, but I've yet to see proof of these other
> supposed implicit upgrades...

OK, so the working assumption is that the rules apply everywhere, except 
for the exceptions below.  Otherwise, it is a bug.

>> A) pack -- it is not clear to me how this operation could produce 
>> anything except bytes for the packed buffer parameter, regardless of 
>> other parameters supplied.
> pack "U" works like chr. I'd strongly advise against using U with other
> letters, because U makes text strings, and the other letters are byte
> operations (so using them together would break the text/byte rule).
> pack "U*", LIST is useful because it is more convenient than writing
> join "", map chr, LIST.

So it would be good to have this advice noted in the documentation for 
pack, eh?  I haven't read the 5.10 pack documentation, only 5.8.8 and 
before, and the Perl 5 Pocket Reference Manual, but no such advice is 
given there.  Actually, the documentation conflicts with what you say 
above (regarding U making text strings).  It says:

 From perlfunc:
> If the pattern begins with a |U|, the resulting string will be treated 
> as UTF-8-encoded Unicode. You can force UTF-8 encoding on in a string 
> with an initial |U0|, and the bytes that follow will be interpreted as 
> Unicode characters. If you don't want this to happen, you can begin 
> your pattern with |C0| (or anything else) to force Perl not to UTF-8 
> encode your string, and then follow this with a |U*| somewhere in your 
> pattern.

Immediately afterwards it contradicts itself by saying what I thought:

> You must yourself do any alignment or padding by inserting for example 
> enough |'x'|es while packing. There is no way to |pack()| and 
> |unpack()| could know where the bytes are going to or coming from. 
> Therefore |pack| (and |unpack|) handle their output and input as flat 
> sequences of bytes.

So it is quite certain that this section of documentation needs to be 
improved, before I can understand what actually will happen.  In the 
presence of such contradictions, your advice to avoid using both U and 
other template characters looks reasonable, even if your justification 
contradicts some of the self-contradictory documentation.

The way I read this is that any pack template that starts with U 
produces multi-bytes, and any pack template that does not start with U 
produces bytes.  So then if unpack is given a "multi-bytes producing 
template" (starts with U) it should expect a multi-bytes value, and if 
handed bytes, should upgrade it.  And if unpack is given a "bytes 
producing template" (starts with anything but U) it should expect a 
bytes value, and if handed multi-bytes, should downgrade it.  Or why not?

I think that would "cure" Mark's oft-repeated "unpack is broken" claim...

>> B) unpack -- it is not clear to me how this operation could successfully 
>> process a multi-bytes buffer parameter, except by first downgrading it, 
>> if it contains no values > 255, since all the operations on it are 
>> defined in terms of unpacking bytes.
> Indeed. Even downgrading is questionable, because its operand should
> never have been upgraded in the first place.

I could imagine something like the following being invented as a 
communication protocol...

$x = $text_string . "\0" . pack( "template", @params);

# send somewhere, that does the following

( $retrieve_text_string, $unpack_me ) = split( "\0", $x );

@retrieve_params = unpack( "template", $unpack_me );

It seems that would violate your recommendation of keeping things 
separate, but one needs to avoid two separate communications, for 
efficiency, eh?  So do you have a recommended practice for this sort of 

It seems that fixing unpack to downgrade multi-byte strings if the 
template doesn't start with U and to upgrade byte strings if the 
template starts with U would cure this problem, if the parameter is 
upgraded for some reason.

> Binary strings don't have any encoding, as far as Perl is concerned.
> When it gets a string of which it is certain that it does have an
> encoding, it can't possibly be binary.

Binary has two encodings.  That's how I started out this discussion (but 
that has been snipped by now).  Multiple sequential binary values in the 
range [0:255] can be packed into byte strings (pack "C*"), and multiple 
sequential binary values in the range [0:2^30] can be packed into 
multi-byte strings (pack "U*").  That's two different encodings for 
binary sequences (strings).  And that seems to really be all that Perl 
knows about, with three exceptions: 

1) encode, and the encoding layers, apply a variety of character set 
semantics to transform characters from one type of binary encoding and 
values to another (either bytes to bytes, bytes to multi-bytes, 
multi-bytes to bytes, and even (if Shift JIS or Hangul is supported) 
other stuff stored i bytes format to/from multi-bytes).

2) the regexp engine has the concept of character classes, which are a 
collection of sets of specific binary values that are assumed to have a 
particular character set semantic.

3) the regexp engine has the concept of "case insensitivity" which is a 
large collection of rules about binary values should should be treated 
identically, based on the assumption of the rules of a particular 
character set.

If we stick with binary values in the range [0:255] in byte strings, 
then it could be considered inappropriate and wasteful of resources to 
"upgrade" them, or "downgrade" them, but it works.  Except, um, for 
unpack, sometimes?

> non-U unpacking might as well return undef or random values, when it
> gets a string that has the UTF8 flag set :)

Not sure how it would return undef???  The values are hardly random, 
either, they come out of the buffer it is handed, right?  But if a 
multi-bytes string is passed to a bytes-expecting template, then clearly 
it won't produce the results you expect...

So it needs to be downgraded first... because it expects a bytes string, 
whether handed bytes or multi-bytes.  Unless U is involved, and then it 
expects a multi-bytes string?  Or at least if U is the first template 
character... or something.  This pack/unpack stuff doesn't seem to be 
well documented.  It seems to matter than U is first or not, and then it 
seems to matter if an implicit upgrade happened or not, and it 
apparently isn't willing to do an implicit downgrade of a multi-bytes 
buffer even if the first template character is not U, even though pack's 
behaviour suggests that such would be appropriate.

> Again, here's the special case for U, that works like "ord", but again
> supports lists in a nicer way. And with unpack too, I think it's wrong
> to mix U with other letters.

Yes, it does seem that avoiding the mixing avoids the problems, but it 
also seems that unpack could behave better.

>> C) use bytes; -- clearly this impacts lots of other operations.
> I advise against "use bytes", and "use encodings".

OK.  What do you recommend when needing to store a UTF-8 string into a 
struct?  I have a program that uses a string variable to "simulate" the 
memory of another computer... so it truly wants to stay bytes.  But I 
want to store UTF-8 encoded strings into it.  Seems like "use bytes;" is 
a perfect match for the operations that work on the simulated memory.  
Maybe this would be a place where you would agree to make an exception 
to your above advice?  Or do you have another recommended technique?

>> D) Data::Dumper -- someone made the claim that Data::Dumper simply 
>> ignores the UTF-8 flag, and functions properly.  Could someone elucidate 
>> how that happens?  
> It's because the text/byte distinction exists in a programmer's mind,
> not in the string value. There is a latin1/utf8 distinction in the
> string value, internally, but the representation of the codepoints
> doesn't change the value of the codepoints, so effectively, even with a
> different internal encoding, you maintain the same string value.
> If you use $Data::Dumper::Useqq, Dumper uses \ notation for non-ASCII,
> so dumped binary strings even survive encoding layers. (Okay, they
> should be ASCII compatible, but most are).
> Without Useqq, D::D works fine on unicode strings and binary strings,
> but your binary strings might be re-encoded in an inconvenient way.

So you are saying that Data::Dumper treats strings as text, whether they 
are text or binary.  But that if you have binary data, it will be dumped 
as part of the string, and reading/writing to binmode files probably 
works, or reading/writing to nearly any file works with 
$Data::Dumper::Useqq turned on.

>> G) regular expressions -- lots of reference is made to regular 
>> expressions being broken, or at least different, for multi-byte stuff.  
>> I fail to see why regular expressions are so hard to deal with. 
> I guess that it is not particularly hard to deal with the bug, because
> both kinds of semantics are already present. It's dealing with all the
> code out in the wild that depends on the current buggy semantics that is
> hard. To remain backwards compatible, new syntax has to be introduced
> (but may be implied with "use 5.10", for example). See my thread "fixing
> the regex engine wrt unicode".

Well I had read that, but I didn't understand it the first time.  
Rereading it now I think I do.

The problem is that there are two (<- :) *) kinds of data that regexp's 
can operate on:

1) Unicode multi-byte
1) ASCII byte
1) ASCII multi-byte
2) Latin-1 byte
1) Latin-1 multi-byte

* That's 5 kinds of data, actually, but two outcomes: works, and doesn't 
work, I think, where case 1 works and case 2 doesn't?  But that's 
because Latin-1 semantics were never supported before Unicode, because 
people used all sorts of different "code pages", none of which were 
known or understood by Perl, and were hence ignored by Perl, and only 
ASCII semantics were implemented?

So the "unicode regexp" problem is really a "Latin-1 bytes regexp" 
problem?  Yes, your /u feature would seem to cure that, then, if that is 
the only problem.

>> Firstly, regular expressions deal in "characters", not bytes, or 
>> multi-byte sequences.  
> Some people like to use regular expressions on their binary data, and I
> think they should be able to keep doing it. Of course, things like /i or
> the predefined character classes don't make sense there, and those are
> the broken things.

Sure, even though regexps are character oriented things, if the pattern 
matches, regexp it!  That's the same theory under which one can have 
binary data units in the range [0:2^30] stored in multi-bytes, right?

Glenn --
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About