On Sat, Mar 31, 2007 at 03:53:25AM +0200, Juerd Waalboer <juerd@convolution.nl> wrote: > Juerd Waalboer skribis 2007-03-30 21:53 (+0200): > > Personally, I think that unpack with a byte-specific signature should > > die, or at least warn, when its operand has the UTF8 flag set. > > I've since this post changed my mind, and think it should only warn if We are making progress, and I would actually be content with that solution, but it does break "U". The solution, really, is to treat C like an octet in the same way "n" is treated like two octets. That does not break existing code and is what many perl programmers find naturally. Since so many people are confused about why the unpack change breaks code, I will explain it differently: my $k = "\x10\x00"; die unpack "n", $k; this gives me 4096. "n" is documented to take exactly 16 bits, two octets. I get 4096 regardless of how perl chooses to represent it internally: If perl goes to using UCS-4 (something that won't happen for sure, but has been stated before to remind people that internal encoding can change), it would still work. Same thing for "L", which is documented to be exactly 32 bit. Now, when people want an 8 bit value followed by a 16 bit big endian value, they used "Cn" in the old times. In fact, they still use that, as "C" always has been the octet companion to the 16 bit and 32 bit sSlLnNvV etc. However, in a weird stroke, somebody decided that "C" no longer gives you a single octet of your string, but, depending on internal encoding, depending on an internal flag, part of that octet or the octet. Now, what has been unpack "CCV" in perl 5.005 must be written as unpack "UUV" in perl 5.8, as "U" has the right semantics for decoding a single octet out of a binary string. Thats weird, because now code that _doesn't_ want to deal with unicode at all, but in fact only deals with binary data must use this unicode thingy "U", even though the documentation for "C" clearly says its an octet, and even says its an octet in C, which is exactly what those people decoding structures or network packets want. That is the problem. Now, I don't mind at all if I get a die when trying "C" on a byte=character that is >255 (i.e. not representable as an object). Or a die when attempting that on a two byte=character string with "n". I personally dislike the warning, because the warning only ever comes up when there is a bug. It doesn't matter much to me persoanlly, though. What matters to me is that binary-only code now needs to use "U" when formerly "C" as meant to get correct behaviour. This *needs* to be fixed. -- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPEThread Previous | Thread Next