develooper Front page | perl.perl5.porters | Postings from May 2013

Re: How on earth did we manage to break pack() so badly?

Thread Previous | Thread Next
From:
Rafael Garcia-Suarez
Date:
May 2, 2013 08:35
Subject:
Re: How on earth did we manage to break pack() so badly?
Message ID:
CAMoYMM-GOL6VW_feT6efPUztWn-szRW5hqORwyWRo6M1ck9sFw@mail.gmail.com
On 1 May 2013 16:32, demerphq <demerphq@gmail.com> wrote:
> Consider another example:
>
> pack "v/a", $string;
>
> This should produce a string with a short int length, followed by the
> appropriate number of bytes. However in modern perls, if the string is
> utf8 enabled it does not:
[...]

Some more analysis of this bug: (with a perl 5.14.1)

~§ perl -wE 'say unpack "U0H*", pack "v/a","foo"'
0300666f6f
~§ perl -wE 'say unpack "U0H*", pack "v/a","foo\x{100}"'
0400666f6fc480

The leading 0400 in the 2nd example is obviously wrong, since the
packed int is followed by five bytes, not four. (packing with "C0v/a"
yields the same result, character mode being the default)

Moreover, that packed string has the UTF8 flag on, which makes no sense to me:
~§ perl -MDevel::Peek -wE 'Dump pack "v/a","foo\x{100}"'
SV = PV(0x7fa084004270) at 0x7fa084029de8
  REFCNT = 1
  FLAGS = (PADTMP,POK,pPOK,UTF8)
  PV = 0x7fa083c050a0 "\4\0foo\304\200"\0 [UTF8 "\x{4}\x{0}foo\x{100}"]
  CUR = 7
  LEN = 16

Let's see what pack does when told to operate in byte mode:

~§ perl -wE 'say unpack "U0H*", pack "U0v/a","foo\x{100}"'
Character(s) in 'a' format wrapped in pack at -e line 1.
0400666f6f00

Here, the packed int 4 is correctly followed by 4 bytes, and the last
character has been truncated, as documented when byte mode is used --
under U0 pack expects byte input and discards what does not fit.

However the packed string *still* has the UTF8 flag on. This is very
wrong since it's possible to generate invalid UTF8 with it:

~§ perl -MDevel::Peek -wE 'Dump pack "U0v/a","foo\x{1f0}"'
Character(s) in 'a' format wrapped in pack at -e line 1.
SV = PV(0x7ff4a1804270) at 0x7ff4a1829de8
  REFCNT = 1
  FLAGS = (PADTMP,POK,pPOK,UTF8)
  PV = 0x7ff4a14050a0 "\4\0foo\360"\0Malformed UTF-8 character
(unexpected non-continuation byte 0x00, immediately after start byte
0xf0) in subroutine entry at -e line 1.
 [UTF8 "\x{4}\x{0}foo\x{0}"]
  CUR = 6
  LEN = 16

What should be done in my opinion :
- the output of pack should never have the utf8 flag on, it's just not
the purpose of pack.
- C0<length>/<format> should be fixed so the packed length correctly
reflects the length of the following data.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About