develooper Front page | perl.i18n | Postings from May 2007

Re: Problems with Perl Asian encodings?

Thread Previous | Thread Next
From:
Samuel L. Bayer
Date:
May 14, 2007 09:16
Subject:
Re: Problems with Perl Asian encodings?
Message ID:
464889C8.7060706@mitre.org
Samuel L. Bayer wrote:

> Has anyone else done such a comparison of GNU recode and Perl Encode? 
> I'd very much prefer to move the Perl, not simply for efficiency but 
> because, unlike GNU recode, it appears to be actively maintained; 
> however, the error rate is just too high, especially considering that 
> the GNU recode output looks clean, and our users have not complained 
> about it.

Hi again all -

Last week, I sent out a query about Asian encodings and Perl Encode vs. 
GNU recode. Martin Thurn graciously helped me debug this problem, and I 
can now summarize as follows, quoting Martin:

"  In the sample data you sent, in the original GB2312, right after the
word "diode", there is an octal \244 and octal \112.  Octal \244 =
decimal 164 which is not a legal first-byte in GB2312.
   Recode apparently dropped the \244 and left the \112 as-is, a capital
J.
   Encode apparently converted the \244 to a default UTF-8 "unknown
character" and left the \112 as-is, a capital J."

So the outcome was that there's a mode in GNU recode which will drop 
these illegal first bytes. So the question is: is the same thing 
possible in Perl Encode? The documentation for some of the FB_ variables 
is tempting, but pretty opaque.

Again, I'm using Perl 5.8.7, with the versions of Encode that come with 
that distribution.

Thanks so much in advance -

Sam Bayer
The MITRE Corporation
sam@mitre.org


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About