develooper Front page | perl.perl5.porters | Postings from August 2011

Re: BOMs as noncharacters

Thread Previous | Thread Next
From:
Leon Timmermans
Date:
August 18, 2011 08:50
Subject:
Re: BOMs as noncharacters
Message ID:
CAHhgV8hD5UmG3LLtfFSnmXsym9L8L4LnPdHWOJeaG4D5Fna_Tg@mail.gmail.com
On Thu, Aug 18, 2011 at 5:35 PM, Johan Vromans <jvromans@squirrel.nl> wrote:
> We came a long way, from ASCII via 'Extended' ASCII to Unicode. In the
> Unicode world, one can no longer process a text file without knowing
> what the encoding is. (Actually, this was true for Extended ASCII as
> well.) A BOM helps identify some of the possible encodings. However, our
> current IO systems are still equipped for byte operations only. Okay, we
> can specify an encoding using a PerlIO layer, but that's only part of
> the job. What we need is an augmented IO system that can handle BOMs.

The word «some» is exactly why this is not a particularly good idea.
Not because you can't recognize UTF-8 this way, but because you can't
differentiate legacy character sets. The absence of a BOM won't tell
you if it's latin1 or KOI-R or anything else. *If you have to make
such assumptions, you're screwed anyway*.

>  use open IN => ':encoding(auto)' OUT => ':encoding(UTF-16LE+BOM)';

The former is currently not implementable in any sane way on PerlIO.

Leon

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About