On approximately 4/3/2007 5:54 AM, came the following characters from the keyboard of Juerd Waalboer: > Glenn Linderman skribis 2007-04-01 16:34 (-0700): > >> Aha! OK, this is a way that unpack could successfully operate on a >> multi-bytes buffer. But I think it is also equivalent to downgrading it >> (with a warning for values > 255) and then processing it as bytes. >> > > Not if you also have the "U" in the template somewhere, in addition to > other letters. (Bad idea anyway!) > If the U template were adjusted to not pack into a "multi-bytes" buffer, but instead pack into an encoded UTF8 representation of its parameter it a bytes buffer, then all would be well. And unpack U decode, byte by byte, from an encoded representation back to an INT, it would retrieve the value, even if the buffer had been upgraded, using the blead-unpack scheme. >> I think that pack-U should be defined to produce "encoded bytes" >> > > It doesn't do that, though. Right... So I'd consider that a bug. If changing it is impossible, I'd recommend deprecating U, in favor of M, which does as I describe... encodes its parameter to Multi-byte variable-length UTF8... in a bytes buffer. > It produces encodingless characters, not > bytes. However, you inspired me to come up with the following: > > $byte_string = pack "a*[UTF-8]", $text_string > $text_string = unpack "a*[UTF-8]", $byte_string > So I'm not sure what [UTF-8] means, there, but I guess it is a new syntax for modifying the "a" template to do encoding/decoding. Similar to what I'm suggesting with M. For M, the count parameter would be in characters. Are you proposing allowing other encodings? I'm only proposing allowing utf8... anything else, and the user can call encode/decode himself, and deal with the byte lengths that result himself. But since Perl supports two binary encodings, having pack support them seems reasonable. > Likewise for "A" and "Z", and for arbitrary encodings. This would just > call Encode::encode (for pack) or Encode::decode (for unpack) > transparently, before doing the actual packing or unpacking. > Ah, I see... to make it more explicit, to call encode on the parameter, then do a pack a using the result of encode instead of the original parameters. And for unpack, do the unpack, and then call decode on the result of the unpack, placing that result in the parameter. > The quantifier is a number of bytes, not characters. This means that it > can be in the middle of a multibyte encoding for a character. When that > happens, tough luck. We can't help that. (In other words: this really > only makes a lot of sense for multibyte packing if the quantifier is *) > The M template code I propose would take a number of characters as a parameter, and would work for given character lengths in either direction. If U cannot be changed to be what I describe for M, it should be deprecated. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration NetworkingThread Previous | Thread Next