develooper Front page | perl.i18n | Postings from May 2007

Problems with Perl Asian encodings?

Thread Next
From:
Samuel L. Bayer
Date:
May 10, 2007 12:31
Subject:
Problems with Perl Asian encodings?
Message ID:
46437126.8030505@mitre.org
All -

I'm having a problem that perhaps someone here can cast some light on. 
For a very long time, a project I work on has been using GNU recode 3.6 
to transcode a wide range of encodings into UTF-8, including some of the 
more common Korean, Japanese and Chinese encodings (e.g., SJIS, gb2312, 
EUC-KR). For efficiency reasons, we've been looking at moving to the 
Perl Encode module, which we already use for windows-1252 because of 
corruption issues with Arabic with GNU recode.

I compared a batch of about 120,000 documents for which I had both the 
original and the output of GNU recode, and discovered a (relatively 
small) number of differences, say about 4K documents. In almost all of 
those cases, Perl recode appears to be inferior. The vast majority of 
the differing documents are gb2132; however, many of the other Asian 
encodings have sporadic problems. When I examine the documents for 
differences, I typically find that Perl recode has introduced some stray 
"unknown" characters at various points in the document, while the GNU 
recode version is clean.

Has anyone else done such a comparison of GNU recode and Perl Encode? 
I'd very much prefer to move the Perl, not simply for efficiency but 
because, unlike GNU recode, it appears to be actively maintained; 
however, the error rate is just too high, especially considering that 
the GNU recode output looks clean, and our users have not complained 
about it.

Any comments or advice would be welcome. I'm using Perl 5.8.7 (I known, 
it's not the latest version, but it's part of a very stable 
configuration that the project doesn't want to vary).

Thanks in advance -
Sam Bayer
The MITRE Corporation
sam@mitre.org

P.S. My familiarity with encoding issues is not extensive, and one thing 
that occurred to me was that there may be an encoding name conflict 
between GNU recode and Perl recode which was leading to the differences 
I was seeing. However, in the first two cases I examined, no encoding 
known to Perl Encode for the given languages (Chinese and Japanese) 
yielded the same (clean) output as GNU recode.

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About