develooper Front page | perl.perl5.porters | Postings from August 2011

Re: BOMs as noncharacters

Thread Previous | Thread Next
Leon Timmermans
August 18, 2011 08:50
Re: BOMs as noncharacters
Message ID:
On Thu, Aug 18, 2011 at 5:35 PM, Johan Vromans <> wrote:
> We came a long way, from ASCII via 'Extended' ASCII to Unicode. In the
> Unicode world, one can no longer process a text file without knowing
> what the encoding is. (Actually, this was true for Extended ASCII as
> well.) A BOM helps identify some of the possible encodings. However, our
> current IO systems are still equipped for byte operations only. Okay, we
> can specify an encoding using a PerlIO layer, but that's only part of
> the job. What we need is an augmented IO system that can handle BOMs.

The word «some» is exactly why this is not a particularly good idea.
Not because you can't recognize UTF-8 this way, but because you can't
differentiate legacy character sets. The absence of a BOM won't tell
you if it's latin1 or KOI-R or anything else. *If you have to make
such assumptions, you're screwed anyway*.

>  use open IN => ':encoding(auto)' OUT => ':encoding(UTF-16LE+BOM)';

The former is currently not implementable in any sane way on PerlIO.


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About