demerphq skribis 2007-04-02 0:26 (+0200): > This makes a certain amount of sense if you assume that > strings can (apparenly) randomly change from octect encoding to utf8 > encoding. No, it does not happen randomly. It only happens when confronted with either: (1). characters above 255. These are NEVER encountered in binary data, so this is not a problem. (2). strings that are internally prepared to handle characters above 255. They became this way because of (1) or an explicit text-only operation. This is only a problem in broken code. > my $s=pack 'N',12345678; > $s.=chr(256); # upgrade $s to utf8 by catting on a unicode codepoint That does not fall under "randomly" but under "characters above 255". Once you start adding such a character to your string, any binary operation, such as unpack "N" makes no sense at all. > So 'N' works with codepoints, not with bytes. Apparently this holds > true for most of the pack template formats. HOWEVER, it doesnt apply > to the pattern 'C' (and if i understand his recent posts this is what > Marc was objecting to recently) which reads bytes. If we choose to keep this behaviour, indeed the C pattern should change too. But I think it is suboptimal to keep this behaviour, and suggest that the previous change be reversed. > Which to me says that almost any use of 'C' as an unpack template in > Perl 5.9.x and later will be totally wrong. In fact, any use of C as an unpack template, on an internally UTF8 encoded string, is always already wrong. This is fairly irrelevant to the rest of the discussion, though. Just wanted to point it out. > My feeling is that Marc's suggestion about making 'C' and alias for > 'U' and introducing a new template char for what 'C' does currently (O > for octect maybe) is the right thing to do. (...) If unpack for non-U template letters uses codepoints, then it would not make sense to have U. I see the fact that we DO have U as proof that they, who implemented this in the past, thought that using codepoints for byte operations would be wrong. > To repeat, my feeling is that any use of the 'C' template in Perl > 5.9.x and later will be totally incorrect and errorprone. While that may be bad indeed, I believe that the change that has already been applied is more dangerous. The change assumes that it makes sense to use unpack on strings with the UTF8 flag set. While I deny this, let's assume for a moment that it does. If it does make sense, there must be people doing it already, either on purpose or accidentally (I think only the latter). Every single program that does that will BREAK once they upgrade their perl from current stable to current blead, because semantics changed. I feel that changing unpack from operating on bytes to operating on characters is theoretically unnecessary, theoretically wrong, and will cause even more problems for people who haven't managed to keep text data and binary data separate. By reverting the change, backwards compatibility is guaranteed, and the big, complex paragraphs that explain the backwards incompatibility can be dropped from perldelta. Instead of using codepoints, I suggest a different course: 1. Revert the change, to ensure backwards compatibility (admittedly, for broken code). 2. Warn when the template contains both U and byte-specific letters (and that's any letter except U). 3. When the template contains byte-specific letters, and the string unpack will operate on has the UTF8 flag set, emit a warning (always, not just when there are codepoints >255) and operate on the internal octets, ignoring that it may be the result of UTF8 encoding (see point 1). (Actually, I think the U template is a mistake. While unpack "U*" and pack "U*" are great as list operators like ord and chr respectively, unicode data doesn't fit in the functionality of (un)pack at all, because pack/unpack has always been specifically for bit and byte packing. It is way too late to remove U now, but perhaps "U*" can be special-cased, and every other use of U deprecated. Just thinking out loud, now, by the way.) (the rest is just nit picking; feel free to ignore.) > If you were Icelandic youd probably want that funky o with a strike > through it. Icelandic uses ö (ouml) instead of ø (oslash). The funky latin1 word characters for icelandic are þ (thorn), æ (aelig) and ð (eth). And it also has non-funky accented characters. > If you were French youd want all the nice accented vowels and the c > circumflex and stuff. C cedilla :) -- korajn salutojn, juerd waalboer: perl hacker <juerd@juerd.nl> <http://juerd.nl/sig> convolution: ict solutions and consultancy <sales@convolution.nl> Ik vertrouw stemcomputers niet. Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.Thread Previous | Thread Next