Marc Lehmann skribis 2007-03-31 0:41 (+0200): > The reason I wanna know is because I want to know what to tell > people. Either it is "your code is broken, unpack "C" without downgrade > is a bug in your code" or "it is a bug in perl, you can work around by > enabling ->shrink for the time being". If a downgrade is "needed", it means that your byte string was accidentally upgraded. This should only happen if you mix it with a text string. If it happens without mixing it with a text string, that is a bug. Please report. So, neither "your code is broken, unpack "C" without downgrade is a bug in your code" nor "it is a bug in perl". Instead: "your code is broken, don't mix text strings with byte strings" or "it is a bug in perl that your string got upgraded in the first place." > Exactly. But "C" somehow works on UTF-8, while it shouldn't. Agreed! Things that specifically handle bytes, and bytes only, should DIE (or at least warn) when used with a string that has the UTF-8 flag on. This still lets users get away with naively assuming that byte == character for latin1 strings, as designed, but at least catches the cases when you know that the user does something stupid. > It should work on characters, as documented (just like in C, char > array[]; array[i] is one character, regardless of how many bits a > character in C has, or how it is encoded). A C "char" is a byte, not a multibyte character, ever. Besides that, the "C" in Perl's pack() is documented as a single byte. I think that "char value" should be either removed from perlfunc, or explained in more detail. It's NOT OBVIOUS to those who don't know C. > > * The chr and ord functions work on characters > > chr(1).chr(20).chr(300).chr(4000) eq v1.20.3000.4000 > > In other words, chr and ord are like pack("U") and unpack("U"), not like > > pack("C") and unpack("C"). In fact, the latter two are how you now emulate > > byte-orientated chr and ord if you're too lazy to use bytes. > So due to that documentation insanity it is now suggested that all code that > used "C" beforee muts use "U" now to get the same effect as in earlier perl > versions? The earlier Perl versions didn't support character values greater than 255, and if you never have those characters, C still works perfectly. But yes, if you're dealing with characters and want your program to be able to handle those fancy new >255 characters, you should change that C to a U. > Besides, perl 5.8 does not follow that description: > perl -e '$x = "\xc3\xbc"; die unpack "U*", $x' > This gives me 195188, two characters, although it is a single UTF-8 > character, so why does it wrongly give me two? $x certainly is utf-8-encoded > (try Encode::encode_utf8 chr 252, it results in the above string). You asked for the codepoints U+00C3 and U+00BC, and got them. It's a UTF-8 encoded byte string, alright, but "U" is for Unicode, not UTF-8. > Ok, so I will tell people to replace "C" by "U" in theor code then. If they do Unicode text strings, that's indeed very good advice. But you still want C for byte strings, simply because some protocols or formats expect a byte value. :) > Right, while the documentation on unpack "U" disagrees with it, as it talks > about UTF-8. That would be a bug, but I can't find it in my copy (5.8.8). It only says "Encodes to UTF-8 internally" for pack(), which as far as I can tell, is true. -- korajn salutojn, juerd waalboer: perl hacker <juerd@juerd.nl> <http://juerd.nl/sig> convolution: ict solutions and consultancy <sales@convolution.nl> Ik vertrouw stemcomputers niet. Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.Thread Previous | Thread Next