Front page | perl.perl5.porters |
Postings from February 2007
Encode::Encoding and "Malformed UTF-8 character" warnings
From:
Davies, Alex
Date:
February 26, 2007 10:13
Subject:
Encode::Encoding and "Malformed UTF-8 character" warnings
Message ID:
A69AA663CE9BBC44AE1DA72483DE15DE01D35110@HQ-MAIL3.ptcnet.ptc.com
Hi All,
I've come across a problem in writing encoding layers
derived from Encode::Encoding and think this either
needs fixing or documenting in Encode::Encoding.
Basically, when passing unicode data thro to an encode(),
it is easy to get "Malformed UTF-8" warnings from
multibyte UTF8 chracters being split in the middle of a 1024
byte buffer. Now i was able to fix this in the code i was
writing by using $| = 1, on the channel to force full lines
to be passed thro to the encode() calls. This solved my
problem (as all my lines were <1024 bytes). I then reread
the docs and noticed the needs_lines() setting, which would
also have fixed the case where my code was breaking.
However, if you *do* have a line whose byte length is >1024,
then it still cuts the string off at that point even if
needs_lines is set as 1 - thus risking splitting a
multibyte character. This also means that an encoding
with C<sub needs_lines {1}> is susceptible to
(1) not getting complete lines, and (2) not getting complete
strings (the final bytes of a multibyte character could be
missing, or the first few bytes are the remaining bytes of
the previously chopped multibyte character).
Here's some code to demonstrate the "Malformed UTF-8" warning:
# %<
package Encode::Ident;
use warnings;
use strict;
use base qw(Encode::Encoding);
__PACKAGE__->Define('ident');
sub needs_lines { 1 };
sub encode {
my ($obj, $str, $chk) = @_;
my $result = $str;
my $byte_len = bytes::length $str;
# XXX calling length() below may trigger "Malformed UTF-8":
my $str_len = length $str;
print STDERR "bytes::length=$byte_len \tlength=$str_len\n";
$_[1] = '' if $chk; # this is what in-place edit means
return $result;
}
sub decode { die };
##
package main;
use warnings;
use strict;
my $tmp_file = 'deme.tmp';
#END { unlink $tmp_file };
open FOUT, ">", $tmp_file or die "open: $!";
binmode FOUT, ':encoding(Ident)' or die "binmode: $!";
if (0) { # mimic needs_lines
select( (select(FOUT), $| = 1)[0] );
}
# A long (>1024 bytes) string of multibyte UTF-8 characters:
my $long_uni = join '', map chr, 130..180, 21000..22000;
print STDERR "Data length=", length($long_uni), "\n";
print STDERR "Data bytes::length=", bytes::length($long_uni), "\n\n";
# Pass the string thro to Encode::Ident::encode()
print FOUT $long_uni, "\n";
close FOUT;
# >%
This gives me:
Data length=1052
Data bytes::length=3105
Malformed UTF-8 character (unexpected end of string) at
D:\src\dev\exe\encoding_bug.pl line 15.
bytes::length=1024 length=358
Malformed UTF-8 character (unexpected end of string) at
D:\src\dev\exe\encoding_bug.pl line 15.
bytes::length=1024 length=342
bytes::length=1024 length=342
bytes::length=34 length=12
-
It would be preferable if C<encode()> could guarantee
that it passed in a well-formed string; but if this not
feasible then this should be mentioned in the docs
for Encode::Encoding with perhaps a suggested method
(eg. Encode::CN::HZ::encode) of rebuilding the malformed
parts into a (well formed UTF-8) string.
Many thanks,
alex.
-
Encode::Encoding and "Malformed UTF-8 character" warnings
by Davies, Alex