develooper Front page | perl.i18n | Postings from May 2007

Re: Problems with Perl Asian encodings?

Thread Previous
From:
Ciaran Hamilton
Date:
May 17, 2007 02:29
Subject:
Re: Problems with Perl Asian encodings?
Message ID:
464C1FD1.2090303@tnauk.org.uk
Hi,

Samuel L. Bayer wrote:
> So the outcome was that there's a mode in GNU recode which will drop 
> these illegal first bytes. So the question is: is the same thing 
> possible in Perl Encode? The documentation for some of the FB_ variables 
> is tempting, but pretty opaque.

Yes, the way to do it is by using Encode::FB_QUIET. Basically, here's 
how you would do it... if $text is the text you want to decode into 
UTF-8, then this should do the trick:

-----
use Encode;

my $textcopy = $text;
my $encoding = "gb2312";

my $decoded = decode($encoding, $text, Encode::FB_QUIET);

while ($text ne "") {   # this loops while we've still got bad 
characters to deal with.
   ### my $badbyte = substr($text, 0, 1);   # $badbyte now contains the 
invalid byte.
   ### my $hex = sprintf("%X", ord($badbyte));
   ### print STDERR "Invalid character \\x" . ("0" x (1 - length($hex))) 
. $hex . " in input - dropping.\n";
   $text = substr($text, 1);   # skip over the bad character
   $decoded .= decode($encoding, $text, Encode::FB_QUIET);
}

print "Output: $decoded\n";
-----

The code as given will ignore every bad character and prints no 
warnings; if you want warnings, uncomment the lines marked with ###. It 
depends what you want your code to do. :D

Hope this helps!

  - Ciaran.

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About