develooper Front page | perl.perl5.porters | Postings from February 2007

Encode::Encoding and "Malformed UTF-8 character" warnings

From:
Davies, Alex
Date:
February 26, 2007 10:13
Subject:
Encode::Encoding and "Malformed UTF-8 character" warnings
Message ID:
A69AA663CE9BBC44AE1DA72483DE15DE01D35110@HQ-MAIL3.ptcnet.ptc.com

Hi All,

I've come across a problem in writing encoding layers
derived from Encode::Encoding and think this either
needs fixing or documenting in Encode::Encoding.

Basically, when passing unicode data thro to an encode(),
it is easy to get "Malformed UTF-8" warnings from
multibyte UTF8 chracters being split in the middle of a 1024
byte buffer. Now i was able to fix this in the code i was
writing by using $| = 1, on the channel to force full lines
to be passed thro to the encode() calls. This solved my
problem (as all my lines were <1024 bytes). I then reread
the docs and noticed the needs_lines() setting, which would
also have fixed the case where my code was breaking.

However, if you *do* have a line whose byte length is >1024,
then it still cuts the string off at that point even if
needs_lines is set as 1 - thus risking splitting a
multibyte character. This also means that an encoding
with C<sub needs_lines {1}> is susceptible to
(1) not getting complete lines, and (2) not getting complete
strings (the final bytes of a multibyte character could be
missing, or the first few bytes are the remaining bytes of
the previously chopped multibyte character).


Here's some code to demonstrate the "Malformed UTF-8" warning:

# %<
package Encode::Ident;
use warnings;
use strict;
use base qw(Encode::Encoding);

__PACKAGE__->Define('ident');

sub needs_lines { 1 };
 
sub encode {
	my ($obj, $str, $chk) = @_;
	my $result = $str;
	my $byte_len = bytes::length $str;
	# XXX calling length() below may trigger "Malformed UTF-8":
	my $str_len = length $str;
	print STDERR "bytes::length=$byte_len  \tlength=$str_len\n";
	$_[1] = '' if $chk; # this is what in-place edit means
	return $result;
}

sub decode { die };

##

package main;

use warnings;
use strict;

my $tmp_file = 'deme.tmp';

#END { unlink $tmp_file };

open FOUT, ">", $tmp_file or die "open: $!";

binmode FOUT, ':encoding(Ident)' or die "binmode: $!";

if (0) { # mimic needs_lines
	select( (select(FOUT), $| = 1)[0] );
}

# A long (>1024 bytes) string of multibyte UTF-8 characters:
my $long_uni = join '', map chr, 130..180, 21000..22000;

print STDERR "Data length=", length($long_uni), "\n";
print STDERR "Data bytes::length=", bytes::length($long_uni), "\n\n";

# Pass the string thro to Encode::Ident::encode()
print FOUT $long_uni, "\n";

close FOUT;
# >%

This gives me:

Data length=1052
Data bytes::length=3105

Malformed UTF-8 character (unexpected end of string) at
D:\src\dev\exe\encoding_bug.pl line 15.
bytes::length=1024  	length=358
Malformed UTF-8 character (unexpected end of string) at
D:\src\dev\exe\encoding_bug.pl line 15.
bytes::length=1024  	length=342
bytes::length=1024  	length=342
bytes::length=34  	length=12

-

It would be preferable if C<encode()> could guarantee
that it passed in a well-formed string; but if this not
feasible then this should be mentioned in the docs
for Encode::Encoding with perhaps a suggested method
(eg. Encode::CN::HZ::encode) of rebuilding the malformed
parts into a (well formed UTF-8) string. 

Many thanks,
alex.



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About