develooper Front page | perl.perl5.porters | Postings from May 2013

Re: How on earth did we manage to break pack() so badly?

Thread Previous | Thread Next
Dave Mitchell
May 1, 2013 15:23
Re: How on earth did we manage to break pack() so badly?
Message ID:
On Wed, May 01, 2013 at 04:32:07PM +0200, demerphq wrote:
> It used to be nice and safe to do this:
> print unpack("H*", $_),"\n"; # lets see what the string looks like in the raw.
> This is no longer an effective debugging technique. It will NOT tell
> you what your string looks like. It takes a "daddy knows best"
> attitude and tries to do the right thing depending on whether the data
> is utf8 or the data is not. Which means that this:
> perl -le'unpack "H*", "\x{DF}\x{100}"'
> Produces completely different results depending on which Perl you are
> on. On older perls it produces a relatively useful:
> c39fc480

But that's just leaking the internal implementation details.

> which as we all know if the hex output of the raw UTF8 form of the
> string. On newer perls it produces the completely useless:
> df00

It's not particularly useful, but it is consistent. It's reading two
characters, and displaying their values modulo 256 (since H is supposed
to issue exactly two hex digits per character).

If you want the old behaviour, but in a safe way:

    utf8::encode(my $s = "\x{DF}\x{100}");
    print unpack "H*", $s;

Really, the unpack interface was never designed to handle chars > 255.

> I remember some of the discussion relating to pack doing the wrong
> thing when strings are accidentally upgraded, but I had the impression
> that we were only going to change a few minor aspects, but it seems we
> have changed so much that now pack is a) heavily broken in terms of
> regression failures, b) relatively useless for various purposes where
> it is heavily used.
> Consider another example:
> pack "v/a", $string;
> This should produce a string with a short int length, followed by the
> appropriate number of bytes. However in modern perls, if the string is
> utf8 enabled it does not:
> $ perl -MDevel::Peek -wle'my $a= "a" x 129; utf8::upgrade($a); print(
> my $msg= pack("v/a", $a)); Dump($msg);' | hexdump -C
> SV = PV(0x778e150) at 0x77a4398
>   REFCNT = 1
>   PV = 0x77b4840
> "\302\201\0aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"\0
> [UTF8 "\x{81}\x{0}aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"]
>   CUR = 132
>   LEN = 136
> 00000000  81 00 61 61 61 61 61 61  61 61 61 61 61 61 61 61  |..aaaaaaaaaaaaaa|
> 00000010  61 61 61 61 61 61 61 61  61 61 61 61 61 61 61 61  |aaaaaaaaaaaaaaaa|
> *
> 00000080  61 61 61 0a                                       |aaa.|
> 00000084
> There are two important things to note here, first, the "v" part of
> the string has been silently upgraded, completely breaking it as a
> shortint. Any external code designed to inter-operate with a program
> using this structure will be broken.

That looks like a bug.

> The second point is that debugging this stuff is hard, as Perl "hides"
> some of the problem by being "clever" about filehandle discipline:
> when we print the code point 81 which is internally represented in
> utf8 as "\302\201" perls output layers downgrades it, without warning,
> back to the correct 81.

If you want perl to output utf8, tell it that STDOUT supports this, e.g.
with perl -CO.

> Anyway, the bottom line is that there appears to be NO way to get pack
> to operate on the binary representation of a string.

Yes there is, just make sure you're feeding it a bunch of characters with
ords < 256, by using utf8::encode/decode where appropriate.

> I cannot express how unhappy I am to find out about these changes. The
> lack of analytic depth behind these changes is staggering (the
> implication on things like v/a should have been immediately obvious).
> I cannot believe that we let the "there is no such thing as binary
> data" mob paint us into such a ridiculous position.

I think the issue can be summed up as:

* un/pack were designed in a world where ord($chr) was always < 256,
  and there was always a 1:1 mapping between chars and their byte storage;
* utf8 and unicode broke this assumption;
* the semantics of a lot of template actions are/were poorly defined for
  chars > 255, and a lot of their behaviours were broken, or broke
* some of these behaviours have now been fixed, and others still need
* Some of those fixes have clashes withg your mental model of how pack
  should work.

> So lets assume I want the old behavior of pack. How can I get it? My
> current understanding is that there is no way to get it at all

See my two-line example above.

> Which seems to be a pretty poor solution to me. Considering the "there
> is no such thing as binary data" mob is always banging on about
> "representation shouldn't matter, strings are strings" it seems pretty
> crappy to require us to inspect the utf8 flag on pretty much any pack
> operation that operates on strings.

As I have shown, you don't need to inspect the flag.  In perl now, a
string is just a list of ordinal numbers, where sometimes those numbers
are > 256. If you try to do packing and unpacking on such non-byte numbers,
you're going to be in a world of pain. Either avoid such strings, or use
utf8::decode/encode or pack "U" as appropriate.

> Seems like in attempting to fix
> one set of perceived problems we just shifted the problem elsewhere,
> and IMO made it worse.

I think I disagree with you, but I could potentially be convinced with
further examples.

> Anyway, I want pack to be able to pack an arbitrary string without

> a) ending up with a utf8 on packed string, b) without it corrupting
> binary data structures like "v/a*", c) where the output is not
> correct. How do I get it? Do I start adding new patterns to pack?

I don't see any such need. Modulo bug fixing (such as v/a), I think perl
does everything you need.

> Do I
> start reverting the patches responsible for this insane behavior for
> 5.20?

No ;-)

A major Starfleet emergency breaks out near the Enterprise, but
fortunately some other ships in the area are able to deal with it to
everyone's satisfaction.
    -- Things That Never Happen in "Star Trek" #13

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About