[perl #113088] Data::Dumper::Useqq('utf8') broken [PATCH]

Jim Avera
May 25, 2012 18:07
Data::Dumper contains support for encoding non-ASCII characters
as themselves, not \x{...} escapes.  This is controlled by setting
Useqq() to one of the special values 'iso8859', 'utf8', or '8bit'.

The code is commented as "not supported...SUBJECT TO CHANGE".  Fair 
enough. But it's currently completely broken, and I think the fix is 
simple (patch below).

Early in sub qquote() there is the following:

   my $bytes; { use bytes; $bytes = length }
   s/([^\x00-\x7f])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes > length;

This removes all wide characters and the upper half of the
single-octet range before reaching the encoding-support code.
Therefore, the encoding-support can't do anything useful; all the
"interesting" characters have already been converted to \x{...} escapes.

I suspect those lines were added to speed up dumping of huge binary 
blobs which are not really printable strings.  However it seems wrong to
test for #chars != #bytes, because binary data _should_ be passed
as byte strings, that is, with Perl's internal utf8 flag off.
In that case #chars===#bytes and the optimization would not happen anyway.

So I'd like to propose to fix this by changing the above code to

     unless utf8::is_utf8($_);

This will make the "fast exit" occur for
   . character strings which contain only ASCII characters
   . binary strings with no values below \x20 (space)

Strings with non-ASCII characters (or bytes < \x20, if a binary string)
will fall through to the encoding-support code.

Here is a patch, followed by a test program (sorry about line-wraps; 
Thunderbird is not nice...):

--- Data/	2012-05-25 19:06:41.775175838 +0000
+++ Data/	2012-05-26 00:43:04.632097843 +0000
@@ -680,8 +680,9 @@
  sub qquote {
    local($_) = shift;
-  my $bytes; { use bytes; $bytes = length }
-  s/([^\x00-\x7f])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes > length;
+  { use utf8;
+    s/([^\x00-\x7f])/'\x{'.sprintf("%x",ord($1)).'}'/ge unless 
+  }
    return qq("$_") unless
      /[^ !"\#\$%&'()*+,\-.\/0-9:;<=>?\@A-Z[\\\]^_`a-z{|}~]/;  # fast exit

use strict; use warnings;
use utf8;
use Encode ();
use Data::Dumper;
binmode STDOUT, 'encoding(utf-8)';

my $chars = "Hello world \N{U+263A} \x{7F}\n";  # smiley DEL
my $octets = Encode::encode('utf-8', $chars);

utf8::is_utf8($chars) && print "chars is_utf8\n";
utf8::is_utf8($octets) && print "octets is_utf8\n";

print "length(chars)=", length($chars), "\n";
print "length(octets)=", length($octets), "\n";

print Data::Dumper->new([$octets],['*octets'])->Useqq(1)->Dump;
print Data::Dumper->new([$chars],['*chars'])->Useqq(1)->Dump;
print Data::Dumper->new([$octets],['*octets'])->Useqq('utf8')->Dump;
print Data::Dumper->new([$chars],['*chars'])->Useqq('utf8')->Dump;
exit 0;

# ---OUTPUT---
# chars is_utf8
# length(chars)=16
# length(octets)=18
# $octets = "Hello world \x{e2}\x{98}\x{ba} \177\n";
# $chars = "Hello world \x{263a} \177\n";
# $octets = "Hello world \x{e2}\x{98}\x{ba} \177\n";
# $chars = "Hello world ☺ \177\n"; # need a utf-8 terminal to see this

