Front page | perl.pod-people |
Postings from January 2015
From: David E . Wheeler
January 6, 2015 05:59
Message ID: 412A27EC-6EE5-4A06-8CBA-5128E7CE3741@justatheory.com
* Since Perl recognizes a Unicode Byte Order Mark at the start of files
as signaling that the file is Unicode encoded as in UTF-16 (whether
big-endian or little-endian) or UTF-8, Pod parsers should do the same.
Otherwise, the character encoding should be understood as being UTF-8
if the first highbit byte sequence in the file seems valid as a UTF-8
sequence, or otherwise as Latin-1.
I suggest we switch from Latin-1 to CP1252. The reasons are:
* CP1252 is effectively a superset of Latin-1.
* Sometimes characters valid in CP1252 but not in Latin-1 appear in Pod, typically curly quotes or m-dashes or similar pasted from Word. The usual suspects are listed in this table:
* By assuming CP1252 instead of Latin-1, such characters would be properly decoded when parsing Pod, thus making them come out right in the resulting outputs. Latin-1 should be unaffected.
So I think it would get better output for those documents that include special Windows characters, without side effects. We would just get a little more stuff to be output properly. I’ve discussed this with Sean Burke in the last couple years, and IIRC he said he probably should have assumed CP1252 instead of Latin-1 when he wrote it. It’s coming up again now because Karl Williamson has been improving the EBCDIC support recently, which is the same bit of code (it’s all about encodings, you know?), so this would be a natural time/place to do it.
But not if there are flaws with the plan. Thoughts? Should we make this change? Seems like a win overall to me, but I miss details all the time. Let me know your thoughts.
by David E . Wheeler