develooper Front page | perl.perl5.porters | Postings from June 2012

Re: [perl #113088] Data::Dumper::Useqq('utf8') broken [PATCH]

Thread Previous | Thread Next
From:
demerphq
Date:
June 7, 2012 15:45
Subject:
Re: [perl #113088] Data::Dumper::Useqq('utf8') broken [PATCH]
Message ID:
CANgJU+XufNarbMa+6WDamN9FAuTP1e+RaJmuA9hZC4hj_nwz_A@mail.gmail.com
On 26 May 2012 03:06, Jim Avera <perlbug-followup@perl.org> wrote:
> # New Ticket Created by  Jim Avera
> # Please include the string:  [perl #113088]
> # in the subject line of all future correspondence about this issue.
> # <URL: https://rt.perl.org:443/rt3/Ticket/Display.html?id=113088 >
>
>
> This is a bug report for perl from james_avera@yahoo.com,
> generated with the help of perlbug 1.39 running under perl 5.12.4.
>
> -----------------------------------------------------------------
> Data::Dumper contains support for encoding non-ASCII characters
> as themselves, not \x{...} escapes.  This is controlled by setting
> Useqq() to one of the special values 'iso8859', 'utf8', or '8bit'.
>
> The code is commented as "not supported...SUBJECT TO CHANGE".  Fair
> enough. But it's currently completely broken, and I think the fix is
> simple (patch below).
>
> Early in sub qquote() there is the following:
>
>   my $bytes; { use bytes; $bytes = length }
>   s/([^\x00-\x7f])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes > length;
>
> This removes all wide characters and the upper half of the
> single-octet range before reaching the encoding-support code.
> Therefore, the encoding-support can't do anything useful; all the
> "interesting" characters have already been converted to \x{...} escapes.
>
> I suspect those lines were added to speed up dumping of huge binary
> blobs which are not really printable strings.  However it seems wrong to
> test for #chars != #bytes, because binary data _should_ be passed
> as byte strings, that is, with Perl's internal utf8 flag off.
> In that case #chars===#bytes and the optimization would not happen anyway.

They are only converted if the string is utf8.

I think this is an attempt to preserve Unicode semantics on the string
after serialization. I believe that the idea is that \x{..} produces a
unicode codepoint, although whether it actually does in all perls is
another matter. Data::Undump however *will* however respect this.

> So I'd like to propose to fix this by changing the above code to
>
>   s/([^\x00-\x7f])/'\x{'.sprintf("%x",ord($1)).'}'/ge
>     unless utf8::is_utf8($_);

No, I really dont think this is a good idea.

> This will make the "fast exit" occur for
>   . character strings which contain only ASCII characters
>   . binary strings with no values below \x20 (space)
>
> Strings with non-ASCII characters (or bytes < \x20, if a binary string)
> will fall through to the encoding-support code.

I dont think this is the right fix. IMO the right fix is to use a
different routine than qquote() to handle strings for alternate
encodings.

Try using $Useqq and then doing something like

local *qquote = sub { ... };

before you call Data::Dumper. Not sure if you can override key quoting
as easily.

cheers
Yves


-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About