develooper Front page | perl.perl5.porters | Postings from May 2013

How on earth did we manage to break pack() so badly?

Thread Next
From:
demerphq
Date:
May 1, 2013 14:32
Subject:
How on earth did we manage to break pack() so badly?
Message ID:
CANgJU+UU-pLJ7HmtQ0Cq96kySg+FeNRb+_v7oqKRyVrQ+u696Q@mail.gmail.com
It used to be nice and safe to do this:

print unpack("H*", $_),"\n"; # lets see what the string looks like in the raw.


This is no longer an effective debugging technique. It will NOT tell
you what your string looks like. It takes a "daddy knows best"
attitude and tries to do the right thing depending on whether the data
is utf8 or the data is not. Which means that this:

perl -le'unpack "H*", "\x{DF}\x{100}"'

Produces completely different results depending on which Perl you are
on. On older perls it produces a relatively useful:

c39fc480

which as we all know if the hex output of the raw UTF8 form of the
string. On newer perls it produces the completely useless:

df00

Which is not correct regardless of how you look at it. The older
behavior was at least correct in some regard.

I remember some of the discussion relating to pack doing the wrong
thing when strings are accidentally upgraded, but I had the impression
that we were only going to change a few minor aspects, but it seems we
have changed so much that now pack is a) heavily broken in terms of
regression failures, b) relatively useless for various purposes where
it is heavily used.

Consider another example:

pack "v/a", $string;

This should produce a string with a short int length, followed by the
appropriate number of bytes. However in modern perls, if the string is
utf8 enabled it does not:

$ perl -MDevel::Peek -wle'my $a= "a" x 129; utf8::upgrade($a); print(
my $msg= pack("v/a", $a)); Dump($msg);' | hexdump -C
SV = PV(0x778e150) at 0x77a4398
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x77b4840
"\302\201\0aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"\0
[UTF8 "\x{81}\x{0}aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"]
  CUR = 132
  LEN = 136
00000000  81 00 61 61 61 61 61 61  61 61 61 61 61 61 61 61  |..aaaaaaaaaaaaaa|
00000010  61 61 61 61 61 61 61 61  61 61 61 61 61 61 61 61  |aaaaaaaaaaaaaaaa|
*
00000080  61 61 61 0a                                       |aaa.|
00000084

There are two important things to note here, first, the "v" part of
the string has been silently upgraded, completely breaking it as a
shortint. Any external code designed to inter-operate with a program
using this structure will be broken.

The second point is that debugging this stuff is hard, as Perl "hides"
some of the problem by being "clever" about filehandle discipline:
when we print the code point 81 which is internally represented in
utf8 as "\302\201" perls output layers downgrades it, without warning,
back to the correct 81.

Anyway, the bottom line is that there appears to be NO way to get pack
to operate on the binary representation of a string. Given the routine
is partly intended to make it easier to interoperate with things like
C I consider this a really serious regression.

I cannot express how unhappy I am to find out about these changes. The
lack of analytic depth behind these changes is staggering (the
implication on things like v/a should have been immediately obvious).
I cannot believe that we let the "there is no such thing as binary
data" mob paint us into such a ridiculous position.

So lets assume I want the old behavior of pack. How can I get it? My
current understanding is that there is no way to get it at all. The
best I could come up with is something like this:

use Encode;

sub string_as_hex {
  my $str= shift;
  if (utf8::is_utf8($str)) {
     return unpack "H*", Encode::decode_utf8($str);
  } else {
     return unpack "H*", $str;
  }
}

Which seems to be a pretty poor solution to me. Considering the "there
is no such thing as binary data" mob is always banging on about
"representation shouldn't matter, strings are strings" it seems pretty
crappy to require us to inspect the utf8 flag on pretty much any pack
operation that operates on strings. Seems like in attempting to fix
one set of perceived problems we just shifted the problem elsewhere,
and IMO made it worse.

Anyway, I want pack to be able to pack an arbitrary string without a)
ending up with a utf8 on packed string, b) without it corrupting
binary data structures like "v/a*", c) where the output is not
correct. How do I get it? Do I start adding new patterns to pack? Do I
start reverting the patches responsible for this insane behavior for
5.20?

cheers
Yves













--
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About