Front page | perl.perl5.porters |
Postings from August 2008
Re: Re-gaining 'O's
From: Russ Allbery
August 26, 2008 05:56
Re: Re-gaining 'O's
Message ID: firstname.lastname@example.org
"H.Merijn Brand" <email@example.com> writes:
> We're now running in circles.
> I'm not arguing that that is not (or is) the correct solution, but I
> want the CORE test suite to PASS for above combination. Currently all
> smokes are spitting 'F' for FAIL, where in my explanation it should be
> Can you make the CORE tests PASS without altering your POV on the
> module itself?
...I'm sorry, I had completely misunderstood your original message.
Please ignore all my previous contributions to this thread; I was
Okay, I can now reproduce this. What's happening is that the output from
the script is being double-converted, so already Unicode output is being
converted to Unicode again. For some reason, Perl doesn't think that the
string that Pod::Man is writing is already in Unicode and is therefore
assuming that it's in a legacy character set and reconverting to Unicode
on output to the file handle.
I think this is a Pod::Simple bug, since it's responsible for handling the
character set of the input and handling processors UTF-8. I think it's
doing that but not tagging the strings properly so that Perl knows that
they're UTF-8. Looking at the Pod::Simple source, it apparently pays no
attention to PERL_UNICODE and only realizes that its input is UTF-8 if
there's a BOM in the file. The following patch fixes the test case:
@@ -91,6 +91,8 @@ __DATA__
Beyoncé! Beyoncé! Beyoncé!!
by telling Pod::Simple explicitly that the input is UTF-8. perlpod does
sort of imply that this is required:
This command is used for declaring the encoding of a document.
Most users won’t need this; but if your encoding isn’t US-ASCII or
Latin-1, then put a "=encoding encodingname" command early in the
document so that pod formatters will know how to decode the
so this is a valid patch to use. perlpodspec goes on at greater length:
This command, which should occur early in the document (at least
before any non-US-ASCII data!), declares that this document is
encoded in the encoding encodingname, which must be an encoding
name that Encoding recognizes. (Encoding’s list of supported
encodings, in Encode::Supported, is useful here.) If the Pod
parser cannot decode the declared encoding, it should emit a
warning and may abort parsing the document altogether.
A document having more than one "=encoding" line should be
considered an error. Pod processors may silently tolerate this if
the not-first "=encoding" lines are just duplicates of the first
one (e.g., if there’s a "=use utf8" line, and later on another
"=use utf8" line). But Pod processors should complain if there are
contradictory "=encoding" lines in the same document (e.g., if
there is a "=encoding utf8" early in the document and "=encoding
big5" later). Pod processors that recognize BOMs may also complain
if they see an "=encoding" line that contradicts the BOM (e.g., if
a document with a UTF-16LE BOM has an "=encoding shiftjis" line).
I think it's debatable whether this is the correct behavior for
Pod::Simple; it seems to me that if PERL_UNICODE is set and we're in a
UTF-8 locale, Pod::Simple should assume all input is Unicode, since that's
kind of what that setting says. But I will include the test case patch
anyway in the next release of Pod::Man since given the current
specification it's required for Unicode input to be recognized properly.
I'm very sorry for my fairly useless previous responses when I didn't
understand what you were asking.
Russ Allbery (firstname.lastname@example.org) <http://www.eyrie.org/~eagle/>