develooper Front page | perl.perl5.porters | Postings from July 2013

[perl #113088] Data::Dumper::Useqq('utf8') broken [PATCH]

Thread Previous | Thread Next
From:
Tony Cook via RT
Date:
July 24, 2013 04:36
Subject:
[perl #113088] Data::Dumper::Useqq('utf8') broken [PATCH]
Message ID:
rt-3.6.HEAD-2552-1374640574-462.113088-15-0@perl.org
On Fri May 25 18:06:58 2012, jimav wrote:
> Data::Dumper contains support for encoding non-ASCII characters
> as themselves, not \x{...} escapes.  This is controlled by setting
> Useqq() to one of the special values 'iso8859', 'utf8', or '8bit'.
> 
> The code is commented as "not supported...SUBJECT TO CHANGE".  Fair
> enough. But it's currently completely broken, and I think the fix is
> simple (patch below).
> 
> Early in sub qquote() there is the following:
> 
>    my $bytes; { use bytes; $bytes = length }
>    s/([^\x00-\x7f])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes >
> length;
> 
> This removes all wide characters and the upper half of the
> single-octet range before reaching the encoding-support code.
> Therefore, the encoding-support can't do anything useful; all the
> "interesting" characters have already been converted to \x{...}
> escapes.
> 
> I suspect those lines were added to speed up dumping of huge binary
> blobs which are not really printable strings.  However it seems wrong
> to
> test for #chars != #bytes, because binary data _should_ be passed
> as byte strings, that is, with Perl's internal utf8 flag off.
> In that case #chars===#bytes and the optimization would not happen
> anyway.
> 
> So I'd like to propose to fix this by changing the above code to
> 
>    s/([^\x00-\x7f])/'\x{'.sprintf("%x",ord($1)).'}'/ge
>      unless utf8::is_utf8($_);
> 
> This will make the "fast exit" occur for
>    . character strings which contain only ASCII characters
>    . binary strings with no values below \x20 (space)
> 
> Strings with non-ASCII characters (or bytes < \x20, if a binary
> string)
> will fall through to the encoding-support code.
> 
> Here is a patch, followed by a test program (sorry about line-wraps;
> Thunderbird is not nice...):
> 
> --- Data/Dumper.pm.ORIG	2012-05-25 19:06:41.775175838 +0000
> +++ Data/Dumper.pm	2012-05-26 00:43:04.632097843 +0000
> @@ -680,8 +680,9 @@
>   sub qquote {
>     local($_) = shift;
>     s/([\\\"\@\$])/\\$1/g;
> -  my $bytes; { use bytes; $bytes = length }
> -  s/([^\x00-\x7f])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes >
> length;
> +  { use utf8;
> +    s/([^\x00-\x7f])/'\x{'.sprintf("%x",ord($1)).'}'/ge unless
> utf8::is_utf8($_);
> +  }
>     return qq("$_") unless
>       /[^ !"\#\$%&'()*+,\-.\/0-9:;<=>?\@A-Z[\\\]^_`a-z{|}~]/;  # fast
> exit

I'm inclined to reject this as is.

I think a patch that turned the "--BEHAVIOR SUBJECT TO CHANGE--" code
into a documented, tested feature would be useful.

It isn't clear to me exactly what the differences are meant to be
betweeen Useqq='8bit' and Useqq='utf8' (there doesn't seem to be any
code difference).

Of course, with the recent XS Useqq implementation [perl #74798], there
will need to be either changes to XS to support this feature, or a check
to fallback to the pure perl implementation.

Tony


---
via perlbug:  queue: perl5 status: open
https://rt.perl.org:443/rt3/Ticket/Display.html?id=113088

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About