develooper Front page | perl.perl5.porters | Postings from May 2012

[perl #113088] Data::Dumper::Useqq('utf8') broken [PATCH]

Thread Next
Jim Avera
May 25, 2012 18:07
[perl #113088] Data::Dumper::Useqq('utf8') broken [PATCH]
Message ID:
# New Ticket Created by  Jim Avera 
# Please include the string:  [perl #113088]
# in the subject line of all future correspondence about this issue. 
# <URL: >

This is a bug report for perl from,
generated with the help of perlbug 1.39 running under perl 5.12.4.

Data::Dumper contains support for encoding non-ASCII characters
as themselves, not \x{...} escapes.  This is controlled by setting
Useqq() to one of the special values 'iso8859', 'utf8', or '8bit'.

The code is commented as "not supported...SUBJECT TO CHANGE".  Fair 
enough. But it's currently completely broken, and I think the fix is 
simple (patch below).

Early in sub qquote() there is the following:

   my $bytes; { use bytes; $bytes = length }
   s/([^\x00-\x7f])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes > length;

This removes all wide characters and the upper half of the
single-octet range before reaching the encoding-support code.
Therefore, the encoding-support can't do anything useful; all the
"interesting" characters have already been converted to \x{...} escapes.

I suspect those lines were added to speed up dumping of huge binary 
blobs which are not really printable strings.  However it seems wrong to
test for #chars != #bytes, because binary data _should_ be passed
as byte strings, that is, with Perl's internal utf8 flag off.
In that case #chars===#bytes and the optimization would not happen anyway.

So I'd like to propose to fix this by changing the above code to

     unless utf8::is_utf8($_);

This will make the "fast exit" occur for
   . character strings which contain only ASCII characters
   . binary strings with no values below \x20 (space)

Strings with non-ASCII characters (or bytes < \x20, if a binary string)
will fall through to the encoding-support code.

Here is a patch, followed by a test program (sorry about line-wraps; 
Thunderbird is not nice...):

--- Data/	2012-05-25 19:06:41.775175838 +0000
+++ Data/	2012-05-26 00:43:04.632097843 +0000
@@ -680,8 +680,9 @@
  sub qquote {
    local($_) = shift;
-  my $bytes; { use bytes; $bytes = length }
-  s/([^\x00-\x7f])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes > length;
+  { use utf8;
+    s/([^\x00-\x7f])/'\x{'.sprintf("%x",ord($1)).'}'/ge unless 
+  }
    return qq("$_") unless
      /[^ !"\#\$%&'()*+,\-.\/0-9:;<=>?\@A-Z[\\\]^_`a-z{|}~]/;  # fast exit

use strict; use warnings;
use utf8;
use Encode ();
use Data::Dumper;
binmode STDOUT, 'encoding(utf-8)';

my $chars = "Hello world \N{U+263A} \x{7F}\n";  # smiley DEL
my $octets = Encode::encode('utf-8', $chars);

utf8::is_utf8($chars) && print "chars is_utf8\n";
utf8::is_utf8($octets) && print "octets is_utf8\n";

print "length(chars)=", length($chars), "\n";
print "length(octets)=", length($octets), "\n";

print Data::Dumper->new([$octets],['*octets'])->Useqq(1)->Dump;
print Data::Dumper->new([$chars],['*chars'])->Useqq(1)->Dump;
print Data::Dumper->new([$octets],['*octets'])->Useqq('utf8')->Dump;
print Data::Dumper->new([$chars],['*chars'])->Useqq('utf8')->Dump;
exit 0;

# ---OUTPUT---
# chars is_utf8
# length(chars)=16
# length(octets)=18
# $octets = "Hello world \x{e2}\x{98}\x{ba} \177\n";
# $chars = "Hello world \x{263a} \177\n";
# $octets = "Hello world \x{e2}\x{98}\x{ba} \177\n";
# $chars = "Hello world ☺ \177\n"; # need a utf-8 terminal to see this

Site configuration information for perl 5.12.4:

Configured by Debian Project at Tue Sep  6 08:08:24 UTC 2011.

Summary of my perl5 (revision 5 version 12 subversion 4) configuration:

     osname=linux, osvers=2.6.24-28-server, 
     uname='linux allspice 2.6.24-28-server #1 smp wed aug 18 21:17:51 
utc 2010 x86_64 x86_64 x86_64 gnulinux '
     config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN 
-Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr 
-Dprivlib=/usr/share/perl/5.12 -Darchlib=/usr/lib/perl/5.12 
-Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 
-Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local 
-Dsitearch=/usr/local/lib/perl/5.12.4 -Dman1dir=/usr/share/man/man1 
-Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 
-Dsiteman3dir=/usr/local/man/man3 -Duse64bitint -Dman1ext=1 
-Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm 
-Uusesfio -Uusenm -Ui_libutil -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -des'
     hint=recommended, useposix=true, d_sigaction=define
     useithreads=define, usemultiplicity=define
     useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
     use64bitint=define, use64bitall=define, uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
     cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN 
-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include 
     optimize='-O2 -g',
     cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing 
-pipe -fstack-protector -I/usr/local/include'
     ccversion='', gccversion='4.6.1', gccosandvers=''
     intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
     d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
     ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', 
     alignbytes=8, prototype=define
   Linker and Libraries:
     ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
     libpth=/usr/local/lib /lib/x86_64-linux-gnu /lib/../lib 
/usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib /usr/lib
     libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
     perllibs=-ldl -lm -lpthread -lc -lcrypt
     libc=, so=so, useshrplib=true,
   Dynamic Linking:
     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
     cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib 

Locally applied patches:

@INC for perl 5.12.4:

Environment for perl 5.12.4:
     LANGUAGE (unset)
     LOGDIR (unset)
     PERL_BADLANG (unset)

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About