develooper Front page | perl.perl5.porters | Postings from May 2013

Re: How on earth did we manage to break pack() so badly?

Thread Previous | Thread Next
May 1, 2013 16:44
Re: How on earth did we manage to break pack() so badly?
Message ID:
On 1 May 2013 17:22, Dave Mitchell <> wrote:
> On Wed, May 01, 2013 at 04:32:07PM +0200, demerphq wrote:
>> It used to be nice and safe to do this:
>> print unpack("H*", $_),"\n"; # lets see what the string looks like in the raw.
>> This is no longer an effective debugging technique. It will NOT tell
>> you what your string looks like. It takes a "daddy knows best"
>> attitude and tries to do the right thing depending on whether the data
>> is utf8 or the data is not. Which means that this:
>> perl -le'unpack "H*", "\x{DF}\x{100}"'
>> Produces completely different results depending on which Perl you are
>> on. On older perls it produces a relatively useful:
>> c39fc480
> But that's just leaking the internal implementation details.

Had this feature just been released I might agree with you. That it
did this for the entire 5.8.x line makes me think this is a weak

IOW: so what? (but said politely.)

>> which as we all know if the hex output of the raw UTF8 form of the
>> string. On newer perls it produces the completely useless:
>> df00
> It's not particularly useful, but it is consistent. It's reading two
> characters, and displaying their values modulo 256 (since H is supposed
> to issue exactly two hex digits per character).
> If you want the old behaviour, but in a safe way:
>     utf8::encode(my $s = "\x{DF}\x{100}");
>     print unpack "H*", $s;
> Really, the unpack interface was never designed to handle chars > 255.

I feel like this is at cross purposes to my point. What you show is
how I get a utf8 representation of an arbitrary string. What happens
if I dont care? I just want the raw buffer from perl. How do I do that
from Perl without didling with things I shouldnt?

>> I remember some of the discussion relating to pack doing the wrong
>> thing when strings are accidentally upgraded, but I had the impression
>> that we were only going to change a few minor aspects, but it seems we
>> have changed so much that now pack is a) heavily broken in terms of
>> regression failures, b) relatively useless for various purposes where
>> it is heavily used.
>> Consider another example:
>> pack "v/a", $string;
>> This should produce a string with a short int length, followed by the
>> appropriate number of bytes. However in modern perls, if the string is
>> utf8 enabled it does not:
>> $ perl -MDevel::Peek -wle'my $a= "a" x 129; utf8::upgrade($a); print(
>> my $msg= pack("v/a", $a)); Dump($msg);' | hexdump -C
>> SV = PV(0x778e150) at 0x77a4398
>>   REFCNT = 1
>>   PV = 0x77b4840
>> "\302\201\0aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"\0
>> [UTF8 "\x{81}\x{0}aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"]
>>   CUR = 132
>>   LEN = 136
>> 00000000  81 00 61 61 61 61 61 61  61 61 61 61 61 61 61 61  |..aaaaaaaaaaaaaa|
>> 00000010  61 61 61 61 61 61 61 61  61 61 61 61 61 61 61 61  |aaaaaaaaaaaaaaaa|
>> *
>> 00000080  61 61 61 0a                                       |aaa.|
>> 00000084
>> There are two important things to note here, first, the "v" part of
>> the string has been silently upgraded, completely breaking it as a
>> shortint. Any external code designed to inter-operate with a program
>> using this structure will be broken.
> That looks like a bug.

To me it looks like a funadmental clash between the "we shouldnt care
about string representation" world view (which I think is BS anyway)
and the requirements that pack produce a buffer that contains a
sequence which is of a valid length.

The problem seems to me that we cannot in the general case
simultaneously have the buffer contain a valid short and also be utf8
on and not contain corrupted utf8 sequences.

My understanding of pack()s original intention was to allow one to
construct C structures in Perl. As such introduces a change which
basically means one cannot safely put a string into a C structure
without the possibility of the entire structure being corrupted by
doing so doesn't make sense.

>> The second point is that debugging this stuff is hard, as Perl "hides"
>> some of the problem by being "clever" about filehandle discipline:
>> when we print the code point 81 which is internally represented in
>> utf8 as "\302\201" perls output layers downgrades it, without warning,
>> back to the correct 81.
> If you want perl to output utf8, tell it that STDOUT supports this, e.g.
> with perl -CO.

I dont. I want it to output the same thing it has in its buffer. I am
fine if it warns if I ask to output utf8. However if it silently
downgrades output just because it can then things are really bad. One
line it might downgrade, the next it wont. What kind of soup does one
get then?

>> Anyway, the bottom line is that there appears to be NO way to get pack
>> to operate on the binary representation of a string.
> Yes there is, just make sure you're feeding it a bunch of characters with
> ords < 256, by using utf8::encode/decode where appropriate.

But this forces me to know about the utf8 flag every time I encode a
string. Currently you cannot mix "A","a" or "Z" with any other pack
pattern without manual checking whether the string you are encoding is
utf8 or not, and then manually downgrading, or decoding it. That
doesnt seem like a step forward at al.

>> I cannot express how unhappy I am to find out about these changes. The
>> lack of analytic depth behind these changes is staggering (the
>> implication on things like v/a should have been immediately obvious).
>> I cannot believe that we let the "there is no such thing as binary
>> data" mob paint us into such a ridiculous position.
> I think the issue can be summed up as:
> * un/pack were designed in a world where ord($chr) was always < 256,
>   and there was always a 1:1 mapping between chars and their byte storage;
> * utf8 and unicode broke this assumption;
> * the semantics of a lot of template actions are/were poorly defined for
>   chars > 255, and a lot of their behaviours were broken, or broke
>   encapsulation;

I remember some of these, but I also remember that there was a lot of
debate about what was broken and which of multiple reasonable options
could be chosen. I think we chose badly.

> * some of these behaviours have now been fixed, and others still need
>   fixing.
> * Some of those fixes have clashes withg your mental model of how pack
>   should work.

Much much much more importantly they were regressions in terms of
behavior. The one of "H*" for instance. That could have been solved by
using providing a way to make H* act like it does now, instead of
providing a modifier which makes it behave like it used to.

>> So lets assume I want the old behavior of pack. How can I get it? My
>> current understanding is that there is no way to get it at all
> See my two-line example above.

But it doesnt do what I want.

>> Which seems to be a pretty poor solution to me. Considering the "there
>> is no such thing as binary data" mob is always banging on about
>> "representation shouldn't matter, strings are strings" it seems pretty
>> crappy to require us to inspect the utf8 flag on pretty much any pack
>> operation that operates on strings.
> As I have shown, you don't need to inspect the flag.  In perl now, a
> string is just a list of ordinal numbers, where sometimes those numbers
> are > 256. If you try to do packing and unpacking on such non-byte numbers,
> you're going to be in a world of pain. Either avoid such strings, or use
> utf8::decode/encode or pack "U" as appropriate.

Well either you check the flag, or you require people to force a
particular type of string. What happens if you just want to be able to
safely round trip data along with other packed data? If I force it to
utf8 then I cant round trip safely. Ditto for downgrading.

>> Seems like in attempting to fix
>> one set of perceived problems we just shifted the problem elsewhere,
>> and IMO made it worse.
> I think I disagree with you, but I could potentially be convinced with
> further examples.
>> Anyway, I want pack to be able to pack an arbitrary string without
>> a) ending up with a utf8 on packed string, b) without it corrupting
>> binary data structures like "v/a*", c) where the output is not
>> correct. How do I get it? Do I start adding new patterns to pack?
> I don't see any such need. Modulo bug fixing (such as v/a), I think perl
> does everything you need.

See above.

>> Do I
>> start reverting the patches responsible for this insane behavior for
>> 5.20?
> No ;-)



perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About