On 1 May 2013 16:46, Nicholas Clark <nick@ccl4.org> wrote: > On Wed, May 01, 2013 at 04:32:07PM +0200, demerphq wrote: >> It used to be nice and safe to do this: >> >> print unpack("H*", $_),"\n"; # lets see what the string looks like in the raw. >> >> >> This is no longer an effective debugging technique. It will NOT tell >> you what your string looks like. It takes a "daddy knows best" >> attitude and tries to do the right thing depending on whether the data >> is utf8 or the data is not. Which means that this: >> >> perl -le'unpack "H*", "\x{DF}\x{100}"' >> >> Produces completely different results depending on which Perl you are >> on. On older perls it produces a relatively useful: >> >> c39fc480 > > Add U0: > > $ ./perl -le'print unpack "U0H*", "\x{DF}\x{100}"' > c39fc480 > > $ perl5.8.9 -le'print unpack "U0H*", "\x{DF}\x{100}"' > c39fc480 Ah thanks. That fixes the hex part. Trying it on the "v/a" part produces a corrupted utf8-on string: $ perl -MDevel::Peek -wle'my $a= "a" x 129; utf8::upgrade($a); print( my $msg= pack("U0v/a*", $a)); Dump($msg);' | hexdump -C Wide character in print at -e line 1. SV = PV(0x5f75150) at 0x5f8b3f0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x5f9b940 "\201\0aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"\0Malformed UTF-8 character (unexpected continuation byte 0x81, with no preceding start byte) in subroutine entry at -e line 1. [UTF8 "\x{0}\x{0}aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"] CUR = 131 LEN = 136 00000000 81 00 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |..aaaaaaaaaaaaaa| 00000010 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa| * 00000080 61 61 61 0a |aaa.| 00000084 Trying it with "U0v/U0a" silences the warning, but produces incorrect (and arguably broken) output: $ perl -MDevel::Peek -wle'my $a= "a" x 129; utf8::upgrade($a); print( my $msg= pack("U0v/U0a*", $a)); Dump($msg);' | hexdump -C SV = PV(0xd45f1d0) at 0xd475410 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0xd485960 "\0\0aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"\0 [UTF8 "\x{0}\x{0}aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"] CUR = 131 LEN = 136 00000000 00 00 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |..aaaaaaaaaaaaaa| 00000010 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa| * 00000080 61 61 61 0a |aaa.| 00000084 I don't understand why the string is still utf8-on. I also dont understand why this new behavior wasn't added by a regression proof "opt-in" mechanism, instead of with the current "opt-out" behavior (assuming said behavior wasn't buggy, which it is). > (apparently today I am supposed to be observing the public holiday. Whether > I want to or not) Well thanks for replying! Enjoy your holiday! cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"Thread Previous