develooper Front page | perl.perl5.porters | Postings from August 2011

Re: BOMs as noncharacters

Thread Previous | Thread Next
Johan Vromans
August 18, 2011 07:35
Re: BOMs as noncharacters
Message ID:
Tom Christiansen <> writes:

> They are certainly discouraged in UTF-8 streams, where they not only 
> serve no purpose but also interfere with catenating streams together
> in a chain:
>     cat file1.utf8 file2.utf8 file3.utf8 > all.utf8
> *only* works correctly when those files have no out-of-band metadata
> BOMs at their fronts, with the possible exception of the first.

Yes, and --unfortunately-- no.

Simply 'open and read a file' is not possible without knowing what the
content (c.q. encoding) of the file is. The same applies to cat. Cat
concatenates series of bytes. Nothing more. Nothing less.

I cannot do

  cat /usr/bin/vi /bin/rm > nifty.program

and expect nifty.program to edit a file and then remove it. cat is dumb,
by design. This makes it possible to do things like:

  cat unpacker,code > installer.program

> This is the same glaring flaw that occurs when Microsoft people
> create a malformed text file that doesn't end in a newline.
>     cat file1.txt file2.txt file3.txt > all.txt
> If the first three files hold 10 lines apiece, then the final file
> *must* hold 30 lines.

Same here. Lines do not exist in byte streams. If you want to
concatenate lines, use a line concatenating tool.

(This does not imply that I don't think writing TEXT files without final
newline is stupid.)

> However, if either or both of the first two files have been
> negligently shorted their final newline, this is completely screwed up, and
> you accidentally create a single line in the output where there had been
> two of them in the input, and your output's line count no longer
> corresponds to that of your input.

Moreover, if the first file was terminated with some EOF marker then
many tools would not read beyond this marker no matter how many lines
there were to follow.

> That's why you should [...] never put a BOM at the start of (nor
> anywhere in) a UTF-8 file.  Sloppy Microsoft people tend to be guilty of
> both sins and often simultaneously, thereby needlessly making all of our
> lives more difficult.  Just say no.

I say no -- to this reasoning. 

We came a long way, from ASCII via 'Extended' ASCII to Unicode. In the
Unicode world, one can no longer process a text file without knowing
what the encoding is. (Actually, this was true for Extended ASCII as
well.) A BOM helps identify some of the possible encodings. However, our
current IO systems are still equipped for byte operations only. Okay, we
can specify an encoding using a PerlIO layer, but that's only part of
the job. What we need is an augmented IO system that can handle BOMs.

  use open IN => ':encoding(auto)' OUT => ':encoding(UTF-16LE+BOM)';
  print <>;

This will happily concatenate, correcty, files with BOMs.

-- Johan

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About