develooper Front page | perl.unicode | Postings from October 2010

Detecting malformed characters in files opened with'<:encoding(something)'

Thread Next
October 3, 2010 16:29
Detecting malformed characters in files opened with'<:encoding(something)'
Message ID:
Dear List,

Various places in the Perl docs say, with good and sufficient reason, that when reading a UTF-8 file, it should be opened '<:encoding(utf8)' rather than '<:utf8'.

The thing is, nowhere can I find documented what happens when a malformed character is encountered, or how to affect this. The perluni* documentation (intro, tut, code, and faq) deals only with the case of the Encode::decode, in which the CHECK argument is exposed. The 'perldoc -f open', 'perldoc -f readline', and 'perldoc perlop' documentation are, to my reading at least, equally silent on the handling of malformed characters. The last two say that operating system errors from reads show up in $!, but this isn't really an operating system error, and $! seems _not_ to be set on decode errors.

My reading in this mailing list's archives uncovered PerlIO::encode. But the default $PerlIO::encode::fallback _ought_ to give a warning when a malformed character is encountered, and I surely can't make it do this.

I have experimented in several versions of Perl with the requisite Unicode support (5.8.8, 5.8.9, 5.10.1, 5.12.0, 5.12.1, and 5.12.2) using the attached script. All treat the malformed character as end-of-file, and none returns any sort of error that I can find, except for 5.10.1, which sets $! to 'Bad file descriptor' somewhere along the way.

So my questions are: when reading a file opened with '<:encoding(something)',

* Is the behavior on encountering a malformed character documented anywhere?

* If so, where?

* Is there a way to alter this behavior (say, by replacing the malformed data with a replacement character a la decode())?

* Is there any way for the Perl script that is doing the reading to find out why it failed to get any more data?

Thank you very much for your time and attention,

Tom Wyant (mailing address to the contrary notwithstanding)
Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About