develooper Front page | perl.i18n | Postings from July 2003

encodings in Pod::Simple

Thread Next
Sean M. Burke
July 4, 2003 19:08
encodings in Pod::Simple
Message ID:

(I'm CCing the perl-i18n list on this, as this will interest some on the 
list; and I may need to pick the brains of a few people on the list.)

So as I'm milling over the implementation and tests for =encoding in 
Pod::Simple, I'm starting to settle on some hopefully sane assumptions that 
Pod::Simple can make, which I'd like to run past you all for comment:

* A Pod file is in one encoding.  You can't have a file that's half UTF8 
and half Shift-JIS.

* You can use only one =encoding directive per file.  The exception to this 
is "redundant =encoding" commands (i.e., ones that simply redeclare the 
encoding that we've already declared) are ignored.  So if you had 
two  "=encoding iso-8859-6" commands in a file, the second one would be 
silently forgiven, and ignored.  But if you have a "=encoding iso-8859-6" 
and later a "=encoding shiftjis", this makes the file invalid (and the Pod 
processor can probably do something drastic like abort parsing the file).

* If a Pod file is in UTF16, it /must/ flag this by having a BOM at the 
beginning of the file.  There can be a redundant "=encoding utf16" command, 
but  it will be ignored.  No other =encoding directives are permitted in a 
UTF16 file.  (In short, a BOM counts sort of like an =encoding directive, 
and so it uses up your allowance of one non-redundant =encoding per file.)

* Similarly, if a Pod file is in UTF8, it /can/ signal this with a UTF8 
BOM, and/or a "=encoding utf8" directive.  But it's forbidden to have a 
UTF8 BOM and to then have an "=encoding" line other than "=encoding utf8".

[end of proposed encoding assumptions]

The reason I'm making UTF16 special, above, is that it's the only 
really-double-byte-character-set I know of -- i.e., a character encoding 
where ALL characters are expressed as two (or more) bytes long.  I know 
there's some Asian encodings where non-USASCII characters are encoded as 
multiple bytes, but the letter A is still expressed as a single byte value, 
a decimal-65 byte.

Or are there some Asian encodings I've forgotten about?  Specifically, I'm 
wondering whether UTF16 is the only attention-worthy encoding where the 
nine characters "=encoding" take up 18 bytes to express, instead of being 
just the 9 bytes 61 101 110 99 111 100 105 110 103 (i.e., join ' ',  map 
ord($_), split '', '=encoding').

Sean M. Burke

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About