Front page | perl.perl5.porters |
Postings from March 2007
Re: perl, the data, and the tf8 flag
Thread Previous
|
Thread Next
From:
Glenn Linderman
Date:
March 31, 2007 20:00
Subject:
Re: perl, the data, and the tf8 flag
Message ID:
460F2013.5040209@NevCal.com
On approximately 3/31/2007 6:44 PM, came the following characters from
the keyboard of Glenn Linderman:
> On approximately 3/31/2007 3:34 PM, came the following characters from
> the keyboard of Juerd Waalboer:
>> Glenn Linderman skribis 2007-03-31 14:35 (-0700):
>>
>>> Juerd says two, but describes 3 types of data.
>>>
>> Well observed! :)
>>
>
> :) Thought that would get your attention, if nothing else did!
>
> You sidestepped comment on what the range is for data stored in
> multi-bytes format, though... is it [0:2^30-1] or [0:2^32-1], or
> what? Anyone? Unicode implements [0:0x10ffff], but what does Perl
> support?
>
>> I invite you to read perlunitut and perlunifaq. You'll have to look for
>> them (e.g. Google) or get them from bleadperl.
>>
>
> I've read perlunifaq several times trying to figure things out. I
> read perlunitut yesterday, when it came up in this discussion.
>
> I found perlunifaq quite opaque the first several times I read it.
> perlunitut seems easier to follow, but didn't answer all my questions
> either.
>
>>> 1) What operations can safely be used on bytes stored in a string
>>> without causing implicit upgrades to multi-bytes?
>>>
>>
>> All operations are safe, except:
>>
>> 1. operations that add characters greater than 255
>> 2. joining text strings with byte strings (because
>> the text string may already internally be prepared for handling
>> characters greater than 255, and forces the byte string to be prepared
>> in a similar way, breaking its binaryness) and byte operations (because
>> they cannot handle characters greater than 255).
>>
>> If you read carefully, you'll notice that "1" is just a different way of
>> having "2", and come to the conclusion that there's only one simple, yet
>> important, guideline: keep binary and text separate, and only bridge
>> between the two by means of decoding and encoding.
>>
>
> Again, you say two, but describe 3.... :) Maybe that is a habit of
> yours?
>
> 1. operations that add characters greater than 255
> 2. joining text strings with byte strings
> 3. byte operations
>
>
>>> My perception from following all this discussion is that you can do
>>> any operation, as long as all the data involved is bytes data that
>>> has never been upgraded, except for decode, which always assumes a
>>> bytes parameter.
>>>
>>
>> Your perception is correct.
>>
>
> Thanks for the confirmation.
>
>
>>> 2) What operations create multi-bytes data?
>>>
>>
>> 1. operations that add characters greater than 255
>> 2. joining text strings with byte strings or byte operations, if the
>> text string is internally prepared for handling characters greater than
>> 255 ("is internally encoded as UTF8", "has the UTF8 flag set").
>>
>> This is, in essence, the same list as before :)
>>
>
> In essence.
>
>
>>> 3) What operations create bytes data?
>>>
>>
>> In general, everything that creates a new string value with only
>> characters in the 0..255 range, with the possible exception of
>> operations that are designed to create text strings only (like decode).
>>
>> Some examples:
>>
>> "\xdf\xff\xa0\x00\xa1" # binary (but can be used as text)
>> "\x{df}\x{ff}\x{a0}\x{00}\x{a1}" # binary (but can be used as text)
>> "\x{100}..." # text (should not be used as bin)
>> chr 255 # binary (but can be used as text)
>> chr 256 # text (should not be used as bin)
>> readline $fh # either, depends on io layers
>>
>> Note that "use encoding" will turn almost everything into a text-only
>> creating thing, and makes using binary data very hard. My advise is to
>> avoid the module altogether.
>>
>>
>>> 4) What operations implicitly upgrade data from binary, assuming
>>> that because of context it must be ISO-8859-1 encoded data?
>>>
>>
>> In the core, only concatenating with an internally-UTF8 string. All
>> other operations that require UTF8 only upgrade temporarily.
>>
>> There are modules that carelessly upgrade strings; this causes no
>> problem if your string is a text string and you decode/encode properly
>> and keep the string properly separated from byte strings. But it might
>> otherwise.
>>
>>
>>> My perception is _any operation_ that also includes another operand
>>> that is UTF8 already.
>>>
>>
>> When the "other operand" becomes part of the target string at some
>> point, yes.
>>
>
>
> OK, good clarification, thanks.
>
>
>>> 5) It seems that there should be documented lists of operations and
>>> core modules that a) never upgrade b) never downgrade c) always
>>> upgrade d) always downgrade e) may upgrade f) may downgrade and the
>>> conditions under which it may happen.
>>>
>>
>> Instead of compiling lists, isn't adding this information to the
>> existing documentation a better idea?
>>
>> Lists like this would suggest that you'd have to learn them by heart in
>> order to write safe code, while in fact it's only needed to keep text
>> from byte values and byte operations. The latter is a lot easier to
>> learn and universally applicable.
>>
>
> Right. As long as the rules apply everywhere, the only things we need
> to learn are the exceptions. I think now we just need to discuss the
> exceptions...
>
>
>>> Juerd's forthcoming perlunitut document seems to imply that the
>>> rules are indeed common to all operations, but this discussion seems
>>> to indicate that there might be a few exceptions to that...
>>
>> It dose seem to indicate so, but I've yet to see proof of these other
>> supposed implicit upgrades...
>>
>
> OK, so the working assumption is that the rules apply everywhere,
> except for the exceptions below. Otherwise, it is a bug.
>
>
>>> A) pack -- it is not clear to me how this operation could produce
>>> anything except bytes for the packed buffer parameter, regardless of
>>> other parameters supplied.
>>>
>>
>> pack "U" works like chr. I'd strongly advise against using U with other
>> letters, because U makes text strings, and the other letters are byte
>> operations (so using them together would break the text/byte rule).
>>
>> pack "U*", LIST is useful because it is more convenient than writing
>> join "", map chr, LIST.
>>
>
> So it would be good to have this advice noted in the documentation for
> pack, eh? I haven't read the 5.10 pack documentation, only 5.8.8 and
> before, and the Perl 5 Pocket Reference Manual, but no such advice is
> given there. Actually, the documentation conflicts with what you say
> above (regarding U making text strings). It says:
>
> From perlfunc:
>> If the pattern begins with a |U|, the resulting string will be
>> treated as UTF-8-encoded Unicode. You can force UTF-8 encoding on in
>> a string with an initial |U0|, and the bytes that follow will be
>> interpreted as Unicode characters. If you don't want this to happen,
>> you can begin your pattern with |C0| (or anything else) to force Perl
>> not to UTF-8 encode your string, and then follow this with a |U*|
>> somewhere in your pattern.
>
> Immediately afterwards it contradicts itself by saying what I thought:
>
>> You must yourself do any alignment or padding by inserting for
>> example enough |'x'|es while packing. There is no way to |pack()| and
>> |unpack()| could know where the bytes are going to or coming from.
>> Therefore |pack| (and |unpack|) handle their output and input as flat
>> sequences of bytes.
>
> So it is quite certain that this section of documentation needs to be
> improved, before I can understand what actually will happen. In the
> presence of such contradictions, your advice to avoid using both U and
> other template characters looks reasonable, even if your justification
> contradicts some of the self-contradictory documentation.
>
> The way I read this is that any pack template that starts with U
> produces multi-bytes, and any pack template that does not start with U
> produces bytes. So then if unpack is given a "multi-bytes producing
> template" (starts with U) it should expect a multi-bytes value, and if
> handed bytes, should upgrade it. And if unpack is given a "bytes
> producing template" (starts with anything but U) it should expect a
> bytes value, and if handed multi-bytes, should downgrade it. Or why not?
>
> I think that would "cure" Mark's oft-repeated "unpack is broken" claim...
>
>
>>> B) unpack -- it is not clear to me how this operation could
>>> successfully process a multi-bytes buffer parameter, except by first
>>> downgrading it, if it contains no values > 255, since all the
>>> operations on it are defined in terms of unpacking bytes.
>>>
>>
>> Indeed. Even downgrading is questionable, because its operand should
>> never have been upgraded in the first place.
>>
>
>
> I could imagine something like the following being invented as a
> communication protocol...
>
> $x = $text_string . "\0" . pack( "template", @params);
>
> # send somewhere, that does the following
>
> ( $retrieve_text_string, $unpack_me ) = split( "\0", $x );
>
> @retrieve_params = unpack( "template", $unpack_me );
>
>
> It seems that would violate your recommendation of keeping things
> separate, but one needs to avoid two separate communications, for
> efficiency, eh? So do you have a recommended practice for this sort
> of action?
>
> It seems that fixing unpack to downgrade multi-byte strings if the
> template doesn't start with U and to upgrade byte strings if the
> template starts with U would cure this problem, if the parameter is
> upgraded for some reason.
I read perlunifaq again... so one can "encode_utf8", which presumably
causes a binary byte stream that contains the UTF-8 style of octets
sequences. So truly, truly you can have UTF-8 data in bytes or
multi-bytes forms!
OK, I guess this could be written as
$x = encode_utf8( $text_string ) . "\0" . pack( "template", @params);
# send somewhere, that does the following
( $retrieve_text_string, $unpack_me ) = split( "\0", $x );
$retrieve_text_string = decode_utf8( $retrieve_text_string );
@retrieve_params = unpack( "template", $unpack_me );
And doing so would sidestep the implicit upgrade problem.
>> Binary strings don't have any encoding, as far as Perl is concerned.
>> When it gets a string of which it is certain that it does have an
>> encoding, it can't possibly be binary.
>
> Binary has two encodings. That's how I started out this discussion
> (but that has been snipped by now). Multiple sequential binary values
> in the range [0:255] can be packed into byte strings (pack "C*"), and
> multiple sequential binary values in the range [0:2^30] can be packed
> into multi-byte strings (pack "U*"). That's two different encodings
> for binary sequences (strings). And that seems to really be all that
> Perl knows about, with three exceptions:
> 1) encode, and the encoding layers, apply a variety of character set
> semantics to transform characters from one type of binary encoding and
> values to another (either bytes to bytes, bytes to multi-bytes,
> multi-bytes to bytes, and even (if Shift JIS or Hangul is supported)
> other stuff stored i bytes format to/from multi-bytes).
>
> 2) the regexp engine has the concept of character classes, which are a
> collection of sets of specific binary values that are assumed to have
> a particular character set semantic.
>
> 3) the regexp engine has the concept of "case insensitivity" which is
> a large collection of rules about binary values should should be
> treated identically, based on the assumption of the rules of a
> particular character set.
>
> If we stick with binary values in the range [0:255] in byte strings,
> then it could be considered inappropriate and wasteful of resources to
> "upgrade" them, or "downgrade" them, but it works. Except, um, for
> unpack, sometimes?
>
>> non-U unpacking might as well return undef or random values, when it
>> gets a string that has the UTF8 flag set :)
>>
>
> Not sure how it would return undef??? The values are hardly random,
> either, they come out of the buffer it is handed, right? But if a
> multi-bytes string is passed to a bytes-expecting template, then
> clearly it won't produce the results you expect...
>
> So it needs to be downgraded first... because it expects a bytes
> string, whether handed bytes or multi-bytes. Unless U is involved,
> and then it expects a multi-bytes string? Or at least if U is the
> first template character... or something. This pack/unpack stuff
> doesn't seem to be well documented. It seems to matter than U is
> first or not, and then it seems to matter if an implicit upgrade
> happened or not, and it apparently isn't willing to do an implicit
> downgrade of a multi-bytes buffer even if the first template character
> is not U, even though pack's behaviour suggests that such would be
> appropriate.
>
>
>> Again, here's the special case for U, that works like "ord", but again
>> supports lists in a nicer way. And with unpack too, I think it's wrong
>> to mix U with other letters.
>>
>
> Yes, it does seem that avoiding the mixing avoids the problems, but it
> also seems that unpack could behave better.
>
>
>>> C) use bytes; -- clearly this impacts lots of other operations.
>>>
>>
>> I advise against "use bytes", and "use encodings".
>>
>
> OK. What do you recommend when needing to store a UTF-8 string into a
> struct? I have a program that uses a string variable to "simulate"
> the memory of another computer... so it truly wants to stay bytes.
> But I want to store UTF-8 encoded strings into it. Seems like "use
> bytes;" is a perfect match for the operations that work on the
> simulated memory. Maybe this would be a place where you would agree
> to make an exception to your above advice? Or do you have another
> recommended technique?
And so encode_utf8 is probably the solution to my question just above,
allowing "length in bytes" to be obtained for a UTF-8 encoded string.
And pulling it back out, one would decode_utf8 it...
>
>
>>> D) Data::Dumper -- someone made the claim that Data::Dumper simply
>>> ignores the UTF-8 flag, and functions properly. Could someone
>>> elucidate how that happens?
>>
>> It's because the text/byte distinction exists in a programmer's mind,
>> not in the string value. There is a latin1/utf8 distinction in the
>> string value, internally, but the representation of the codepoints
>> doesn't change the value of the codepoints, so effectively, even with a
>> different internal encoding, you maintain the same string value.
>>
>> If you use $Data::Dumper::Useqq, Dumper uses \ notation for non-ASCII,
>> so dumped binary strings even survive encoding layers. (Okay, they
>> should be ASCII compatible, but most are).
>>
>> Without Useqq, D::D works fine on unicode strings and binary strings,
>> but your binary strings might be re-encoded in an inconvenient way.
>>
>
> So you are saying that Data::Dumper treats strings as text, whether
> they are text or binary. But that if you have binary data, it will be
> dumped as part of the string, and reading/writing to binmode files
> probably works, or reading/writing to nearly any file works with
> $Data::Dumper::Useqq turned on.
>
>
>>> G) regular expressions -- lots of reference is made to regular
>>> expressions being broken, or at least different, for multi-byte
>>> stuff. I fail to see why regular expressions are so hard to deal
>>> with.
>>
>> I guess that it is not particularly hard to deal with the bug, because
>> both kinds of semantics are already present. It's dealing with all the
>> code out in the wild that depends on the current buggy semantics that is
>> hard. To remain backwards compatible, new syntax has to be introduced
>> (but may be implied with "use 5.10", for example). See my thread "fixing
>> the regex engine wrt unicode".
>>
>
> Well I had read that, but I didn't understand it the first time.
> Rereading it now I think I do.
>
> The problem is that there are two (<- :) *) kinds of data that
> regexp's can operate on:
>
> 1) Unicode multi-byte
> 1) ASCII byte
> 1) ASCII multi-byte
> 2) Latin-1 byte
> 1) Latin-1 multi-byte
>
> * That's 5 kinds of data, actually, but two outcomes: works, and
> doesn't work, I think, where case 1 works and case 2 doesn't? But
> that's because Latin-1 semantics were never supported before Unicode,
> because people used all sorts of different "code pages", none of which
> were known or understood by Perl, and were hence ignored by Perl, and
> only ASCII semantics were implemented?
>
> So the "unicode regexp" problem is really a "Latin-1 bytes regexp"
> problem? Yes, your /u feature would seem to cure that, then, if that
> is the only problem.
>
>
>>> Firstly, regular expressions deal in "characters", not bytes, or
>>> multi-byte sequences.
>>
>> Some people like to use regular expressions on their binary data, and I
>> think they should be able to keep doing it. Of course, things like /i or
>> the predefined character classes don't make sense there, and those are
>> the broken things.
>>
>
> Sure, even though regexps are character oriented things, if the
> pattern matches, regexp it! That's the same theory under which one
> can have binary data units in the range [0:2^30] stored in
> multi-bytes, right?
>
>
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
Thread Previous
|
Thread Next