develooper Front page | perl.perl5.porters | Postings from May 2013

[perl #72534] $PerlIO::encoding::fallback corrupts UTF-8 output

Thread Previous
From:
James E Keenan via RT
Date:
May 26, 2013 16:57
Subject:
[perl #72534] $PerlIO::encoding::fallback corrupts UTF-8 output
Message ID:
rt-3.6.HEAD-2650-1369587456-796.72534-15-0@perl.org
On Thu Feb 04 06:26:52 2010, loic.etienne@tech.swisssign.com wrote:
> To: perlbug@perl.org
> Subject: $PerlIO::encoding::fallback corrupts UTF-8 output
> Reply-To: loic.etienne@tech.swisssign.com
> Message-Id: <5.10.0_4674_1265291836@dev0003.int.swisssign.net>
> 
> This is a bug report for perl from loic.etienne@tech.swisssign.com,
> generated with the help of perlbug 1.36 running under perl 5.10.0.
> 
> 
> -----------------------------------------------------------------
> Setting
>     $PerlIO::encoding::fallback = 0x0400;
> before
>     binmode(STDOUT, ':encoding(UTF-8)');
> may corrupt the UTF-8 output of print STDOUT
> when a UTF-8 multi-byte character lays over two output buffers.
> Each part of a split multi-byte character is outputted as XML
> entities,
> although the byte sequence itself is a correct UTF-8 byte sequence.
> 
> Example: &#xc3;&#x84; instead of the corresponding bytes.
> 
> IMHO, the encoding fallback should apply only for input, and not for
> output,
> since perl itself generates the bytes to be outputted.
> A corrupted UTF-8 sequence can only occur if perl's internal string
> handling
> is buggy (very unlikely).
> 
> Code to reproduce the bug (assuming that the output buffer size is
> 1024):
> 
> use strict;
> use warnings;
> 
> use PerlIO::encoding;
> 
> #
> # 00C4 � LATIN CAPITAL LETTER A WITH DIAERESIS
> # 2-bytes UTF-8 sequence 0xC3 0x84
> #
> my $two_bytes_in_utf8 = chr(0xC4);
> 
> #
> # The following $string is constructed in such a way that
> # the last UTF-8 character of $string overlaps the output buffer
> boundary:
> # ... <0xC3 0x84> 0xC3 |buffer boundary| 0x84 <0xC3 0x84> ...
> #     <utf8-char> oops                   oops <utf8-char>
> #
> # Note that $string itself is internally represented in ISO-8859-1
> # but converted to UTF-8 by the output layer :encoding(UTF-8)
> #
> # The output buffer is assumed to consist of 1024 bytes, thus 'x 512'.
> # Use a value higher than 'x 512' on systems with bigger output buffer
> size.
> #
> my $string = 'a' . $two_bytes_in_utf8 x 512;
> 
> $PerlIO::encoding::fallback = 0x0400; # xml entities
> binmode(STDOUT, ':encoding(UTF-8)');
> 
> # wrong output (&#xc3;&#x84; at the end)
> print STDOUT "$string\n";
> 
> # correct output
> syswrite(STDOUT, "$string\n");
> 
> # Remark that if
> #     binmode(STDOUT, ':encoding(UTF-8)');
> # occured before
> #     $PerlIO::encoding::fallback = 0x0400;
> # then the output of both print and syswrite would be correct.
> 
> 

This problem persists in Perl 5.18.0.  Can someone familiar with PerlIO,
etc. take a crack at this?

Thank you very much.
Jim Keenan


---
via perlbug:  queue: perl5 status: new
https://rt.perl.org:443/rt3/Ticket/Display.html?id=72534

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About