develooper Front page | perl.perl5.porters | Postings from May 2013

Re: How on earth did we manage to break pack() so badly?

Thread Previous | Thread Next
Aristotle Pagaltzis
May 1, 2013 16:01
Re: How on earth did we manage to break pack() so badly?
Message ID:
* demerphq <> [2013-05-01 16:35]:
> It used to be nice and safe to do this:
> print unpack("H*", $_),"\n"; # lets see what the string looks like in the raw.

unpack is not Devel::Peek.

> On older perls it produces a relatively useful:
> c39fc480
> which as we all know if the hex output of the raw UTF8 form of the
> string. On newer perls it produces the completely useless:
> df00
> Which is not correct regardless of how you look at it. The older
> behavior was at least correct in some regard.

To me the new output looks half correct, the old one utterly broken.
This isn’t a step backward so much as a step sideways.

> There are two important things to note here, first, the "v" part of
> the string has been silently upgraded, completely breaking it as a
> shortint. Any external code designed to inter-operate with a program
> using this structure will be broken.

You cannot use the string buffer of a scalar without looking at its UTF8
flag! If the scalar has utf8=on then you must decode the string buffer
to get its meaning back out of it. If you blindly give someone else
a pointer to a string buffer without ever checking the UTF8 flag on it,
then YOUR code is broken.

> The second point is that debugging this stuff is hard, as Perl "hides"
> some of the problem by being "clever" about filehandle discipline:

It’s not being clever-in-quotation-marks but correct-as-in-correct.

> when we print the code point 81 which is internally represented in
> utf8 as "\302\201" perls output layers downgrades it, without warning,
> back to the correct 81.

Exactly as it should, because there’s a chr 0x81 in there, and it should
print a chr 0x81.

> Anyway, the bottom line is that there appears to be NO way to get pack
> to operate on the binary representation of a string.

ENCODE THE STRING. That is the correct way! Assuming the string you were
talking about is text. If you have text, then you do not have bytes. You
have characters. To go from characters to bytes you need to encode. Just
that! Nothing more! And nothing less – and never less: you need to do
this always.

And if you had bytes in the first place? Then to hell with the binary
representation. The binary representation does not matter! Not if you
are in Perl land.

And if it *does* matter to you, because you are dealing with scalars at
the XS level? Then you MUST look at the UTF8 and downgrade-if-necessary!
(As I said in the beginning of the mail.)

If you are actually looking at the string buffer then you need to look
at the UTF8 flag. And if you aren’t looking at the string buffer, then
you *must not* look at the UTF8 flag.

You must look at both, or at neither; never just the one or the other.

> Given the routine is partly intended to make it easier to interoperate
> with things like C I consider this a really serious regression.

Are you trying to write XS or are you just doing I/O?

> I cannot express how unhappy I am to find out about these changes. The
> lack of analytic depth behind these changes is staggering (the
> implication on things like v/a should have been immediately obvious).

With all due respect, the lack of analytic depth here might be somewhere
else. :-)

> I cannot believe that we let the "there is no such thing as binary
> data" mob paint us into such a ridiculous position.

Maybe it is your own understanding that has painted “us” (for smaller
values of “us” than you thought) into a corner?

> So lets assume I want the old behavior of pack. How can I get it? My
> current understanding is that there is no way to get it at all. The
> best I could come up with is something like this:
> use Encode;
> sub string_as_hex {
>   my $str= shift;
>   if (utf8::is_utf8($str)) {
>      return unpack "H*", Encode::decode_utf8($str);
>   } else {
>      return unpack "H*", $str;
>   }
> }

This will not work. You are trying to shoehorn both byte strings and
character strings into $str (which is where your confusion is actually
located), then you are turning to the UTF8 to help undo your confusion
by asking it something it cannot tell you. The result is an unfixably
broken routine.

You MUST know whether are string contains characters or bytes, and if
you need bytes, then if (and ONLY if) and when (and ALWAYS when) the
string contains *characters* you MUST encode it. (No matter whether it
has the UTF8 flag set or not!) Then, after that step, you have a byte
string, and you can treat its contents as a string of bytes. (No matter
whether it has the UTF8 flag set or not!)

All you *can* do is this:

    sub character_string_as_hex {
        my ($str, $encoding) = @_;
        return unpack "H*", Encode::encode($encoding//'UTF-8', $str);

    sub byte_string_as_hex { unpack "H*", shift }

And then you must keep track of which of your variables are to contain
characters and which contain to bytes, and use the appropriate function.

> Which seems to be a pretty poor solution to me. Considering the "there
> is no such thing as binary data" mob is always banging on about
> "representation shouldn't matter, strings are strings" it seems pretty
> crappy to require us to inspect the utf8 flag on pretty much any pack
> operation that operates on strings.

How do you come up with the impression that the same “mob” that has been
forever banging on the “never EVER look at the UTF8 flag (in Perl land)”
is in fact somehow *requiring* you to look at it – and constantly at

> Seems like in attempting to fix one set of perceived problems we just
> shifted the problem elsewhere, and IMO made it worse.

No. We just have an incomplete solution, because Perl only has a single
scalar type, which must house both byte and character strings. So we can
do nothing more for the programmer than a) 100% consistency in assigning
no semantics to the UTF8 flag and b) telling them that it’s their job to
decide for every single function in their program what kind of string it
expects, and then only passing that kind of string to it.

You are complaining about what has been done to achieve (a) here.

> Anyway, I want pack to be able to pack an arbitrary string

I think the answer is “you can’t”. The reason is what I just said – just
looking at a string does not tell if it contains characters or bytes.
Not even by checking the UTF8 flag, because that is a red herring.

> without a) ending up with a utf8 on packed string

Downgrade byte strings and encode text strings *before* packing.

> b) without it corrupting binary data structures like "v/a*",

It doesn’t.

> c) where the output is not correct.

Gotta do something about that. Maybe do what `print` and co do and treat
ord > 255 chars as UTF8 byte sequences with a “Wide character” warning.
Which will give *still* other output than what you complained about, and
not really useful output at that, but at least it warns, which is what
matters, since what it really should do is throw an exception – except
that nothing else does that. The warning can be fatalised, so, OK.

> How do I get it? Do I start adding new patterns to pack? Do I start
> reverting the patches responsible for this insane behavior for 5.20?

I’ll tell you how: join the mob! You would be welcome here… and it’s
good to be a gangster. :-)

Aristotle Pagaltzis // <>

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About