develooper Front page | perl.perl5.porters | Postings from May 2012

[perl #113088] Data::Dumper::Useqq('utf8') broken [PATCH]

Thread Next
From:
Jim Avera
Date:
May 25, 2012 18:07
Subject:
[perl #113088] Data::Dumper::Useqq('utf8') broken [PATCH]
Message ID:
rt-3.6.HEAD-7788-1337994418-1618.113088-75-0@perl.org
# New Ticket Created by  Jim Avera 
# Please include the string:  [perl #113088]
# in the subject line of all future correspondence about this issue. 
# <URL: https://rt.perl.org:443/rt3/Ticket/Display.html?id=113088 >


This is a bug report for perl from james_avera@yahoo.com,
generated with the help of perlbug 1.39 running under perl 5.12.4.

-----------------------------------------------------------------
Data::Dumper contains support for encoding non-ASCII characters
as themselves, not \x{...} escapes.  This is controlled by setting
Useqq() to one of the special values 'iso8859', 'utf8', or '8bit'.

The code is commented as "not supported...SUBJECT TO CHANGE".  Fair 
enough. But it's currently completely broken, and I think the fix is 
simple (patch below).

Early in sub qquote() there is the following:

   my $bytes; { use bytes; $bytes = length }
   s/([^\x00-\x7f])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes > length;

This removes all wide characters and the upper half of the
single-octet range before reaching the encoding-support code.
Therefore, the encoding-support can't do anything useful; all the
"interesting" characters have already been converted to \x{...} escapes.

I suspect those lines were added to speed up dumping of huge binary 
blobs which are not really printable strings.  However it seems wrong to
test for #chars != #bytes, because binary data _should_ be passed
as byte strings, that is, with Perl's internal utf8 flag off.
In that case #chars===#bytes and the optimization would not happen anyway.

So I'd like to propose to fix this by changing the above code to

   s/([^\x00-\x7f])/'\x{'.sprintf("%x",ord($1)).'}'/ge
     unless utf8::is_utf8($_);

This will make the "fast exit" occur for
   . character strings which contain only ASCII characters
   . binary strings with no values below \x20 (space)

Strings with non-ASCII characters (or bytes < \x20, if a binary string)
will fall through to the encoding-support code.

Here is a patch, followed by a test program (sorry about line-wraps; 
Thunderbird is not nice...):

--- Data/Dumper.pm.ORIG	2012-05-25 19:06:41.775175838 +0000
+++ Data/Dumper.pm	2012-05-26 00:43:04.632097843 +0000
@@ -680,8 +680,9 @@
  sub qquote {
    local($_) = shift;
    s/([\\\"\@\$])/\\$1/g;
-  my $bytes; { use bytes; $bytes = length }
-  s/([^\x00-\x7f])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes > length;
+  { use utf8;
+    s/([^\x00-\x7f])/'\x{'.sprintf("%x",ord($1)).'}'/ge unless 
utf8::is_utf8($_);
+  }
    return qq("$_") unless
      /[^ !"\#\$%&'()*+,\-.\/0-9:;<=>?\@A-Z[\\\]^_`a-z{|}~]/;  # fast exit


#!/usr/bin/perl
use strict; use warnings;
use utf8;
use Encode ();
use Data::Dumper;
binmode STDOUT, 'encoding(utf-8)';

my $chars = "Hello world \N{U+263A} \x{7F}\n";  # smiley DEL
my $octets = Encode::encode('utf-8', $chars);

utf8::is_utf8($chars) && print "chars is_utf8\n";
utf8::is_utf8($octets) && print "octets is_utf8\n";

print "length(chars)=", length($chars), "\n";
print "length(octets)=", length($octets), "\n";

print Data::Dumper->new([$octets],['*octets'])->Useqq(1)->Dump;
print Data::Dumper->new([$chars],['*chars'])->Useqq(1)->Dump;
print Data::Dumper->new([$octets],['*octets'])->Useqq('utf8')->Dump;
print Data::Dumper->new([$chars],['*chars'])->Useqq('utf8')->Dump;
exit 0;

# ---OUTPUT---
# chars is_utf8
# length(chars)=16
# length(octets)=18
# $octets = "Hello world \x{e2}\x{98}\x{ba} \177\n";
# $chars = "Hello world \x{263a} \177\n";
# $octets = "Hello world \x{e2}\x{98}\x{ba} \177\n";
# $chars = "Hello world ☺ \177\n"; # need a utf-8 terminal to see this

-----------------------------------------------------------------
---
Flags:
     category=library
     severity=low
     module=Data::Dumper
---
Site configuration information for perl 5.12.4:

Configured by Debian Project at Tue Sep  6 08:08:24 UTC 2011.

Summary of my perl5 (revision 5 version 12 subversion 4) configuration:

   Platform:
     osname=linux, osvers=2.6.24-28-server, 
archname=x86_64-linux-gnu-thread-multi
     uname='linux allspice 2.6.24-28-server #1 smp wed aug 18 21:17:51 
utc 2010 x86_64 x86_64 x86_64 gnulinux '
     config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN 
-Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr 
-Dprivlib=/usr/share/perl/5.12 -Darchlib=/usr/lib/perl/5.12 
-Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 
-Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local 
-Dsitelib=/usr/local/share/perl/5.12.4 
-Dsitearch=/usr/local/lib/perl/5.12.4 -Dman1dir=/usr/share/man/man1 
-Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 
-Dsiteman3dir=/usr/local/man/man3 -Duse64bitint -Dman1ext=1 
-Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm 
-Uusesfio -Uusenm -Ui_libutil -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib 
-Dlibperl=libperl.so.5.12.4 -des'
     hint=recommended, useposix=true, d_sigaction=define
     useithreads=define, usemultiplicity=define
     useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
     use64bitint=define, use64bitall=define, uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
   Compiler:
     cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN 
-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include 
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
     optimize='-O2 -g',
     cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing 
-pipe -fstack-protector -I/usr/local/include'
     ccversion='', gccversion='4.6.1', gccosandvers=''
     intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
     d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
     ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', 
lseeksize=8
     alignbytes=8, prototype=define
   Linker and Libraries:
     ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
     libpth=/usr/local/lib /lib/x86_64-linux-gnu /lib/../lib 
/usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib /usr/lib
     libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
     perllibs=-ldl -lm -lpthread -lc -lcrypt
     libc=, so=so, useshrplib=true, libperl=libperl.so.5.12.4
     gnulibc_version='2.13'
   Dynamic Linking:
     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
     cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib 
-fstack-protector'

Locally applied patches:


---
@INC for perl 5.12.4:
     /home/jima/local/share/perl/5.12.4
     /home/jima/local/share/perl
     /home/jima/lib/perl
     /etc/perl
     /usr/local/lib/perl/5.12.4
     /usr/local/share/perl/5.12.4
     /usr/lib/perl5
     /usr/share/perl5
     /usr/lib/perl/5.12
     /usr/share/perl/5.12
     /usr/local/lib/site_perl
     .

---
Environment for perl 5.12.4:
     HOME=/home/jima
     LANG=en_US.UTF-8
     LANGUAGE (unset)
     LD_LIBRARY_PATH=/home/jima/local/lib
     LOGDIR (unset)
 
PATH=/home/jima/bin:/home/jima/local/bin:/home/jima/jima_tools/x86_64/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/bin/X11:/usr/local/bin:/opt/openoffice.org3/program:/usr/local/games:/usr/games:.
     PERL5LIB=/home/jima/local/share/perl:/home/jima/lib/perl
     PERL_BADLANG (unset)
     SHELL=/bin/bash


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About