Front page | perl.i18n |
Postings from July 2003
encodings in Pod::Simple
From: Sean M. Burke
July 4, 2003 19:08
encodings in Pod::Simple
Message ID: email@example.com
(I'm CCing the perl-i18n list on this, as this will interest some on the
list; and I may need to pick the brains of a few people on the list.)
So as I'm milling over the implementation and tests for =encoding in
Pod::Simple, I'm starting to settle on some hopefully sane assumptions that
Pod::Simple can make, which I'd like to run past you all for comment:
* A Pod file is in one encoding. You can't have a file that's half UTF8
and half Shift-JIS.
* You can use only one =encoding directive per file. The exception to this
is "redundant =encoding" commands (i.e., ones that simply redeclare the
encoding that we've already declared) are ignored. So if you had
two "=encoding iso-8859-6" commands in a file, the second one would be
silently forgiven, and ignored. But if you have a "=encoding iso-8859-6"
and later a "=encoding shiftjis", this makes the file invalid (and the Pod
processor can probably do something drastic like abort parsing the file).
* If a Pod file is in UTF16, it /must/ flag this by having a BOM at the
beginning of the file. There can be a redundant "=encoding utf16" command,
but it will be ignored. No other =encoding directives are permitted in a
UTF16 file. (In short, a BOM counts sort of like an =encoding directive,
and so it uses up your allowance of one non-redundant =encoding per file.)
* Similarly, if a Pod file is in UTF8, it /can/ signal this with a UTF8
BOM, and/or a "=encoding utf8" directive. But it's forbidden to have a
UTF8 BOM and to then have an "=encoding" line other than "=encoding utf8".
[end of proposed encoding assumptions]
The reason I'm making UTF16 special, above, is that it's the only
really-double-byte-character-set I know of -- i.e., a character encoding
where ALL characters are expressed as two (or more) bytes long. I know
there's some Asian encodings where non-USASCII characters are encoded as
multiple bytes, but the letter A is still expressed as a single byte value,
a decimal-65 byte.
Or are there some Asian encodings I've forgotten about? Specifically, I'm
wondering whether UTF16 is the only attention-worthy encoding where the
nine characters "=encoding" take up 18 bytes to express, instead of being
just the 9 bytes 61 101 110 99 111 100 105 110 103 (i.e., join ' ', map
ord($_), split '', '=encoding').
Sean M. Burke http://search.cpan.org/~sburke/