develooper Front page | perl.perl5.porters | Postings from March 2007

Re: perl, the data, and the tf8 flag

Thread Previous | Thread Next
From:
Glenn Linderman
Date:
March 31, 2007 20:00
Subject:
Re: perl, the data, and the tf8 flag
Message ID:
460F2013.5040209@NevCal.com
On approximately 3/31/2007 6:44 PM, came the following characters from 
the keyboard of Glenn Linderman:
> On approximately 3/31/2007 3:34 PM, came the following characters from 
> the keyboard of Juerd Waalboer:
>> Glenn Linderman skribis 2007-03-31 14:35 (-0700):
>>  
>>> Juerd says two, but describes 3 types of data.
>>>     
>> Well observed! :)
>>   
>
> :) Thought that would get your attention, if nothing else did!
>
> You sidestepped comment on what the range is for data stored in 
> multi-bytes format, though... is it [0:2^30-1] or [0:2^32-1], or 
> what?  Anyone?  Unicode implements [0:0x10ffff], but what does Perl 
> support?
>
>> I invite you to read perlunitut and perlunifaq. You'll have to look for
>> them (e.g. Google) or get them from bleadperl.
>>   
>
> I've read perlunifaq several times trying to figure things out.  I 
> read perlunitut yesterday, when it came up in this discussion.
>
> I found perlunifaq quite opaque the first several times I read it.  
> perlunitut seems easier to follow, but didn't answer all my questions 
> either.
>
>>> 1) What operations can safely be used on bytes stored in a string 
>>> without causing implicit upgrades to multi-bytes?
>>>     
>>
>> All operations are safe, except:
>>
>> 1. operations that add characters greater than 255
>> 2. joining text strings with byte strings (because
>> the text string may already internally be prepared for handling
>> characters greater than 255, and forces the byte string to be prepared
>> in a similar way, breaking its binaryness) and byte operations (because
>> they cannot handle characters greater than 255).
>>
>> If you read carefully, you'll notice that "1" is just a different way of
>> having "2", and come to the conclusion that there's only one simple, yet
>> important, guideline: keep binary and text separate, and only bridge
>> between the two by means of decoding and encoding.
>>   
>
> Again, you say two, but describe 3.... :)  Maybe that is a habit of 
> yours?
>
> 1. operations that add characters greater than 255
> 2. joining text strings with byte strings
> 3. byte operations
>
>
>>> My perception from following all this discussion is that you can do 
>>> any operation, as long as all the data involved is bytes data that 
>>> has never been upgraded, except for decode, which always assumes a 
>>> bytes parameter.
>>>     
>>
>> Your perception is correct.
>>   
>
> Thanks for the confirmation.
>
>
>>> 2) What operations create multi-bytes data?
>>>     
>>
>> 1. operations that add characters greater than 255
>> 2. joining text strings with byte strings or byte operations, if the
>> text string is internally prepared for handling characters greater than
>> 255 ("is internally encoded as UTF8", "has the UTF8 flag set").
>>
>> This is, in essence, the same list as before :)
>>   
>
> In essence.
>
>
>>> 3) What operations create bytes data?
>>>     
>>
>> In general, everything that creates a new string value with only
>> characters in the 0..255 range, with the possible exception of
>> operations that are designed to create text strings only (like decode).
>>
>> Some examples:
>>
>>     "\xdf\xff\xa0\x00\xa1"            # binary (but can be used as text)
>>     "\x{df}\x{ff}\x{a0}\x{00}\x{a1}"  # binary (but can be used as text)
>>     "\x{100}..."                      # text (should not be used as bin)
>>     chr 255                           # binary (but can be used as text)
>>     chr 256                           # text (should not be used as bin)
>>     readline $fh                      # either, depends on io layers
>>
>> Note that "use encoding" will turn almost everything into a text-only
>> creating thing, and makes using binary data very hard. My advise is to
>> avoid the module altogether.
>>
>>  
>>> 4) What operations implicitly upgrade data from binary, assuming 
>>> that because of context it must be ISO-8859-1 encoded data?
>>>     
>>
>> In the core, only concatenating with an internally-UTF8 string. All
>> other operations that require UTF8 only upgrade temporarily.
>>
>> There are modules that carelessly upgrade strings; this causes no
>> problem if your string is a text string and you decode/encode properly
>> and keep the string properly separated from byte strings. But it might
>> otherwise.
>>
>>  
>>> My perception is _any operation_ that also includes another operand 
>>> that is UTF8 already.
>>>     
>>
>> When the "other operand" becomes part of the target string at some
>> point, yes.
>>   
>
>
> OK, good clarification, thanks.
>
>
>>> 5) It seems that there should be documented lists of operations and 
>>> core modules that a) never upgrade b) never downgrade c) always 
>>> upgrade d) always downgrade e) may upgrade f) may downgrade and the 
>>> conditions under which it may happen.
>>>     
>>
>> Instead of compiling lists, isn't adding this information to the
>> existing documentation a better idea?
>>
>> Lists like this would suggest that you'd have to learn them by heart in
>> order to write safe code, while in fact it's only needed to keep text
>> from byte values and byte operations. The latter is a lot easier to
>> learn and universally applicable.
>>   
>
> Right.  As long as the rules apply everywhere, the only things we need 
> to learn are the exceptions.  I think now we just need to discuss the 
> exceptions...
>
>
>>> Juerd's forthcoming perlunitut document seems to imply that the 
>>> rules are indeed common to all operations, but this discussion seems 
>>> to indicate that there might be a few exceptions to that...     
>>
>> It dose seem to indicate so, but I've yet to see proof of these other
>> supposed implicit upgrades...
>>   
>
> OK, so the working assumption is that the rules apply everywhere, 
> except for the exceptions below.  Otherwise, it is a bug.
>
>
>>> A) pack -- it is not clear to me how this operation could produce 
>>> anything except bytes for the packed buffer parameter, regardless of 
>>> other parameters supplied.
>>>     
>>
>> pack "U" works like chr. I'd strongly advise against using U with other
>> letters, because U makes text strings, and the other letters are byte
>> operations (so using them together would break the text/byte rule).
>>
>> pack "U*", LIST is useful because it is more convenient than writing
>> join "", map chr, LIST.
>>   
>
> So it would be good to have this advice noted in the documentation for 
> pack, eh?  I haven't read the 5.10 pack documentation, only 5.8.8 and 
> before, and the Perl 5 Pocket Reference Manual, but no such advice is 
> given there.  Actually, the documentation conflicts with what you say 
> above (regarding U making text strings).  It says:
>
> From perlfunc:
>> If the pattern begins with a |U|, the resulting string will be 
>> treated as UTF-8-encoded Unicode. You can force UTF-8 encoding on in 
>> a string with an initial |U0|, and the bytes that follow will be 
>> interpreted as Unicode characters. If you don't want this to happen, 
>> you can begin your pattern with |C0| (or anything else) to force Perl 
>> not to UTF-8 encode your string, and then follow this with a |U*| 
>> somewhere in your pattern.
>
> Immediately afterwards it contradicts itself by saying what I thought:
>
>> You must yourself do any alignment or padding by inserting for 
>> example enough |'x'|es while packing. There is no way to |pack()| and 
>> |unpack()| could know where the bytes are going to or coming from. 
>> Therefore |pack| (and |unpack|) handle their output and input as flat 
>> sequences of bytes.
>
> So it is quite certain that this section of documentation needs to be 
> improved, before I can understand what actually will happen.  In the 
> presence of such contradictions, your advice to avoid using both U and 
> other template characters looks reasonable, even if your justification 
> contradicts some of the self-contradictory documentation.
>
> The way I read this is that any pack template that starts with U 
> produces multi-bytes, and any pack template that does not start with U 
> produces bytes.  So then if unpack is given a "multi-bytes producing 
> template" (starts with U) it should expect a multi-bytes value, and if 
> handed bytes, should upgrade it.  And if unpack is given a "bytes 
> producing template" (starts with anything but U) it should expect a 
> bytes value, and if handed multi-bytes, should downgrade it.  Or why not?
>
> I think that would "cure" Mark's oft-repeated "unpack is broken" claim...
>
>
>>> B) unpack -- it is not clear to me how this operation could 
>>> successfully process a multi-bytes buffer parameter, except by first 
>>> downgrading it, if it contains no values > 255, since all the 
>>> operations on it are defined in terms of unpacking bytes.
>>>     
>>
>> Indeed. Even downgrading is questionable, because its operand should
>> never have been upgraded in the first place.
>>   
>
>
> I could imagine something like the following being invented as a 
> communication protocol...
>
> $x = $text_string . "\0" . pack( "template", @params);
>
> # send somewhere, that does the following
>
> ( $retrieve_text_string, $unpack_me ) = split( "\0", $x );
>
> @retrieve_params = unpack( "template", $unpack_me );
>
>
> It seems that would violate your recommendation of keeping things 
> separate, but one needs to avoid two separate communications, for 
> efficiency, eh?  So do you have a recommended practice for this sort 
> of action?
>
> It seems that fixing unpack to downgrade multi-byte strings if the 
> template doesn't start with U and to upgrade byte strings if the 
> template starts with U would cure this problem, if the parameter is 
> upgraded for some reason.

I read perlunifaq again... so one can "encode_utf8", which presumably 
causes a binary byte stream that contains the UTF-8 style of octets 
sequences.  So truly, truly you can have UTF-8 data in bytes or 
multi-bytes forms!

OK, I guess this could be written as

$x = encode_utf8( $text_string ) . "\0" . pack( "template", @params);

# send somewhere, that does the following

( $retrieve_text_string, $unpack_me ) = split( "\0", $x );

$retrieve_text_string = decode_utf8( $retrieve_text_string );

@retrieve_params = unpack( "template", $unpack_me );


And doing so would sidestep the implicit upgrade problem.


>> Binary strings don't have any encoding, as far as Perl is concerned.
>> When it gets a string of which it is certain that it does have an
>> encoding, it can't possibly be binary.  
>
> Binary has two encodings.  That's how I started out this discussion 
> (but that has been snipped by now).  Multiple sequential binary values 
> in the range [0:255] can be packed into byte strings (pack "C*"), and 
> multiple sequential binary values in the range [0:2^30] can be packed 
> into multi-byte strings (pack "U*").  That's two different encodings 
> for binary sequences (strings).  And that seems to really be all that 
> Perl knows about, with three exceptions:
> 1) encode, and the encoding layers, apply a variety of character set 
> semantics to transform characters from one type of binary encoding and 
> values to another (either bytes to bytes, bytes to multi-bytes, 
> multi-bytes to bytes, and even (if Shift JIS or Hangul is supported) 
> other stuff stored i bytes format to/from multi-bytes).
>
> 2) the regexp engine has the concept of character classes, which are a 
> collection of sets of specific binary values that are assumed to have 
> a particular character set semantic.
>
> 3) the regexp engine has the concept of "case insensitivity" which is 
> a large collection of rules about binary values should should be 
> treated identically, based on the assumption of the rules of a 
> particular character set.
>
> If we stick with binary values in the range [0:255] in byte strings, 
> then it could be considered inappropriate and wasteful of resources to 
> "upgrade" them, or "downgrade" them, but it works.  Except, um, for 
> unpack, sometimes?
>
>> non-U unpacking might as well return undef or random values, when it
>> gets a string that has the UTF8 flag set :)
>>   
>
> Not sure how it would return undef???  The values are hardly random, 
> either, they come out of the buffer it is handed, right?  But if a 
> multi-bytes string is passed to a bytes-expecting template, then 
> clearly it won't produce the results you expect...
>
> So it needs to be downgraded first... because it expects a bytes 
> string, whether handed bytes or multi-bytes.  Unless U is involved, 
> and then it expects a multi-bytes string?  Or at least if U is the 
> first template character... or something.  This pack/unpack stuff 
> doesn't seem to be well documented.  It seems to matter than U is 
> first or not, and then it seems to matter if an implicit upgrade 
> happened or not, and it apparently isn't willing to do an implicit 
> downgrade of a multi-bytes buffer even if the first template character 
> is not U, even though pack's behaviour suggests that such would be 
> appropriate.
>
>
>> Again, here's the special case for U, that works like "ord", but again
>> supports lists in a nicer way. And with unpack too, I think it's wrong
>> to mix U with other letters.
>>   
>
> Yes, it does seem that avoiding the mixing avoids the problems, but it 
> also seems that unpack could behave better.
>
>
>>> C) use bytes; -- clearly this impacts lots of other operations.
>>>     
>>
>> I advise against "use bytes", and "use encodings".
>>   
>
> OK.  What do you recommend when needing to store a UTF-8 string into a 
> struct?  I have a program that uses a string variable to "simulate" 
> the memory of another computer... so it truly wants to stay bytes.  
> But I want to store UTF-8 encoded strings into it.  Seems like "use 
> bytes;" is a perfect match for the operations that work on the 
> simulated memory.  Maybe this would be a place where you would agree 
> to make an exception to your above advice?  Or do you have another 
> recommended technique?

And so encode_utf8 is probably the solution to my question just above, 
allowing "length in bytes" to be obtained for a UTF-8 encoded string.  
And pulling it back out, one would decode_utf8 it...

>
>
>>> D) Data::Dumper -- someone made the claim that Data::Dumper simply 
>>> ignores the UTF-8 flag, and functions properly.  Could someone 
>>> elucidate how that happens?      
>>
>> It's because the text/byte distinction exists in a programmer's mind,
>> not in the string value. There is a latin1/utf8 distinction in the
>> string value, internally, but the representation of the codepoints
>> doesn't change the value of the codepoints, so effectively, even with a
>> different internal encoding, you maintain the same string value.
>>
>> If you use $Data::Dumper::Useqq, Dumper uses \ notation for non-ASCII,
>> so dumped binary strings even survive encoding layers. (Okay, they
>> should be ASCII compatible, but most are).
>>
>> Without Useqq, D::D works fine on unicode strings and binary strings,
>> but your binary strings might be re-encoded in an inconvenient way.
>>   
>
> So you are saying that Data::Dumper treats strings as text, whether 
> they are text or binary.  But that if you have binary data, it will be 
> dumped as part of the string, and reading/writing to binmode files 
> probably works, or reading/writing to nearly any file works with 
> $Data::Dumper::Useqq turned on.
>
>
>>> G) regular expressions -- lots of reference is made to regular 
>>> expressions being broken, or at least different, for multi-byte 
>>> stuff.  I fail to see why regular expressions are so hard to deal 
>>> with.     
>>
>> I guess that it is not particularly hard to deal with the bug, because
>> both kinds of semantics are already present. It's dealing with all the
>> code out in the wild that depends on the current buggy semantics that is
>> hard. To remain backwards compatible, new syntax has to be introduced
>> (but may be implied with "use 5.10", for example). See my thread "fixing
>> the regex engine wrt unicode".
>>   
>
> Well I had read that, but I didn't understand it the first time.  
> Rereading it now I think I do.
>
> The problem is that there are two (<- :) *) kinds of data that 
> regexp's can operate on:
>
> 1) Unicode multi-byte
> 1) ASCII byte
> 1) ASCII multi-byte
> 2) Latin-1 byte
> 1) Latin-1 multi-byte
>
> * That's 5 kinds of data, actually, but two outcomes: works, and 
> doesn't work, I think, where case 1 works and case 2 doesn't?  But 
> that's because Latin-1 semantics were never supported before Unicode, 
> because people used all sorts of different "code pages", none of which 
> were known or understood by Perl, and were hence ignored by Perl, and 
> only ASCII semantics were implemented?
>
> So the "unicode regexp" problem is really a "Latin-1 bytes regexp" 
> problem?  Yes, your /u feature would seem to cure that, then, if that 
> is the only problem.
>
>
>>> Firstly, regular expressions deal in "characters", not bytes, or 
>>> multi-byte sequences.      
>>
>> Some people like to use regular expressions on their binary data, and I
>> think they should be able to keep doing it. Of course, things like /i or
>> the predefined character classes don't make sense there, and those are
>> the broken things.
>>   
>
> Sure, even though regexps are character oriented things, if the 
> pattern matches, regexp it!  That's the same theory under which one 
> can have binary data units in the range [0:2^30] stored in 
> multi-bytes, right?
>
>


-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About