develooper Front page | perl.perl5.porters | Postings from March 2007

unpack - a CPAN statistic

Thread Next
From:
Marc Lehmann
Date:
March 31, 2007 13:31
Subject:
unpack - a CPAN statistic
Message ID:
20070331203036.GA26480@schmorp.de
Hi!

Today, I unpacked all 28216 distributions on my CPAN mirror (everything in
modules that ended in .atr.gz) and did a number of greps.

The word "unpack" was found 68114 times. Of those, 37436 matches the
following suboptimal regex:

   unpack\s*\(?\s*["']([^'"]+)["']

(this did not match such gems as unpack H64, ... (without the quotes)
and the small but nonzero set of unpack calls with a variable as format
string).

Of those, 11127 contained at least one "C", and 379 contained an "U"
(which was surprisingly much).

Of the 11127 unpacks that used "C", 1444 also used one of nNsSiIlLvVqQ.
As the combination of C and one of those formats gives different results
between 5.8.8 and 5.10 and 5.8.9-to-be, and as the interaction between
those it badly documented at best and is not useful for anything, that
menas potentially 1444 unpack uses on CPAN are broken. They are definitely
broken when they come in contact with an UTF-X encoded scalar.

Here are a few very typical examples:

   ./p5-Palm-1.007/lib/Datebook.pm:                        unpack "Cx n C C C C", $data;
   ./RPM-Header-PurePerl-1.0.2/lib/RPM/Header/PurePerl.pm:    ) = unpack("a4CCssA66ssA16", $buff);
   ./Archie.pm:             unpack("CnnnnnCC", $header);
   ./POE-Filter-PPPHDLC-0.01/ex/poeppp.pl:      = unpack "x[$pat_prefix] CCn", $packet;
   ./IO-Mux-0.04/lib/IO/Mux/Packet.pm:     ($mb, $len, $me) = unpack("CLC", $len) ;
   ./LEGO-RCX-1.00/RCX.pm:      push @out, ( unpack "Cv", $data );
   ./PDF-API2-0.57/lib/PDF/API2/Resource/Font/Postscript.pm:    ) = unpack("vVa60vvvvvvvCCCvCvvCvvCCCCvVVVV vVVVVVVV",$buf); # PFM Header + Ext
   ./DBIx-FullTextSearch-0.73/test_data/Index.modul:                       = unpack 'Ca3A16vvccccvvVVVa3', $header;
   ./Compress-Zlib-2.004/lib/Compress/Zlib.pm:        unpack ('CCCCVCC', $$string);
   ./Bio-Das-1.00/Das/Request.pm:      = unpack("nccVcc",substr($cd,0,10));
   ./Device-Modem-1.46/lib/Device/Modem/UsRobotics.pm:        my @msg = unpack('CCCCCCCCA20CSCS', $header);

The risk for the majority of those modules is low, as in most cases, data
isn't passed _in_ to them, but often modules accept data from a caller
to decode it, in which cases those calls are wrong in practise (all of
those clals to unpack are broken in theory, though, as they all decode
file or packet headers of some sort, where you do expect octet-semantics,
not internal-peek-semantics.).

I also looked through the ~9.6k of format specifiers that didn't use "C"
nor "U" with any of the other integer decoders.

Again, here are a few typical examples:

   ./Gtk2-1.144/examples/color_snooper.pl:         unpack "C*",
   ./MD4/test.pl:    pack 'v*', unpack 'C*', $_[0];
   ./MP3-ID3v1Tag-1.11/lib/MP3/ID3v1Tag.pm:    unpack('a3a30a30a30a4a30C1', $buffer);
   ./NetServer-Generic-1.03/Generic.pm:    my ($peeraddr) = join(".", unpack("C4", $new_sock->peeraddr()));
   ./Net-DNS-0.45/contrib/loc2earth.fcgi:          join(".", reverse (unpack("CCCC",pack("N",$ipnum & $mask))))
   ./IO-Compress-Base-2.003/t/compress/CompTestUtils.pm:        my @array = unpack('C*', $data);
   ./Image-Pngslimmer-0.1/lib/Image/Pngslimmer.pm:                 $origbyte = unpack("C", substr($unfiltereddata, 1 + ($count * $totalwidth)  + $count_width + ...
   ./String-CRC-Cksum-0.03/Cksum.pm:            my $c = unpack 'C', substr $_[0], $i, 1;
   ./Net-SMPP-1.01/SMPP.pm:     $pdu->{schedule_delivery_time}) = unpack 'CCZ*', substr($pdu->{data}, $len);
   ./Convert-Binary-C-0.63/lib/Convert/Binary/C.pm:      buf => [ unpack 'c*', $s ],
   ./MP3-Tag-0.90/examples/extractID3v2.pl:    foreach (unpack("x6C4", $header)) {
   ./DBD-mysql-2.1017/lib/Mysql/Statement.pm:    $x =~ s/([\001-\037\177])/sprintf("\\%03o",unpack("C",$1))/eg;

I looked at most of those above as well as an additional 20 files, and
found that all uses I could find were broken, relying on "C" as if it had
"U" semantics (without high characters).

As for the 379 unpacks with "U", I found those typical examples:

   ./libintl-perl-1.14/lib/Locale/RecodeData/UTF_8.pm:     $_[1] = [ unpack "U*", $_[1] ];
   ./XML-XPathScript-0.16/t/04unicode.t:   return pack("C*",grep {$_<255} (unpack("U*",$orig)));
   ./PDF-API2-0.57/lib/PDF/API2/Resource/UniFont.pm:        foreach my $u (unpack('U*',$text))
   ./Spreadsheet-WriteExcel-2.15/doc/WriteExcel.html:    $new_str = pack 'C*', unpack 'U*', $utf8_str;
   ./RDFStore-0.31/samples/utf-test.pl:print "---->'",join('',unpack("U*",$a)),"'\n";

In all those cases, "U" was used correctly, with the old "C" semantics of
giving me one character, to get at the characters, with most such uses
having high-characters in mind.

There were quite a few "gems", such as:

   ./Apache-LoggedAuthDBI-0.12/DBI.pm:    my @b_chars = (utf8::is_utf8($b)) ? unpack("U*", $b) : unpack("C*", $b);
   ./Encode-Arabic-1.09/index.pl:            (join " ", map { sprintf "&amp;#%d;", $_ } unpack "U*", Encode::is_utf8($_) ? $_ : decode 'utf8', $_)
   ./DBI-1.51/DBI.pm:    my @b_chars = (utf8::is_utf8($b)) ? unpack("U*", $b) : unpack("C*", $b);
   ./HTML-Template-HTX-0.07/HTX.pm:        $value = pack("U0C*", unpack("U*", $value)) unless($_[0]->{_utf8});

(quite a tribute to how perl does the wrong thing)

And somebody who actually understood that "C" is now called "U":

   ./Net-Frame-1.00/lib/Net/Frame/ARP.pm:         $self->SUPER::unpack('UUnH12a16H12a16 a*', $tail)

My summary findings are:

- 1444 unpack uses on CPAN are currently broken when confronted with binary data encoded as UTF-X, but
  actual risk or breakage are low for most modules.
- about 9500 uses of "C" might or might not rely on the current behaviour of "C", but I could
  fine none of thise uses, but a lot of uses where "C", too, relies on the one-octet-from-string
  behaviour not in current perls and are potentially broken.

- this makes it likely that about 10k uses of unpack on CPAN are broken w.r.t. to current perl
  unpack semantics
- no examples of programs relying on the new semantics have ever been found or shown.

- _1_ distribution correctly uses "U" to decode binary data.

Now, to get constructive, here is the plan to fix all those CPAN modules
for good without breaking anything known:

   1. make unpack's "C" and alias for unpack's "U": it decodes one character from the string and returns it.
      optionally make it decode on octet from the string, warning or even croaking if the character at that
      position is not an octet (not in the range 0..255).
      both of these work, as existing code does not expect to decode high characters with "C", only octets.
   2. leave "U" as it is, either as an alias for "C", or as the version of "C" that doesn't
      warn/croak when it encounters a high character.
   3. do not downgrade.
   4. do not change anything else (n, v etc.).

(that plan entails only a single change if you look closely, the change to
make "C" decode an octet from a string, perl-level semantics).

-- 
                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __      pcg@goof.com
      --==---/ / _ \/ // /\ \/ /      http://schmorp.de/
      -=====/_/_//_/\_,_/ /_/\_\      XX11-RIPE

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About