develooper Front page | perl.perl5.porters | Postings from August 2008

Re: Re-gaining 'O's

Thread Previous | Thread Next
Russ Allbery
August 26, 2008 05:56
Re: Re-gaining 'O's
Message ID:
"H.Merijn Brand" <> writes:

> We're now running in circles.
> I'm not arguing that that is not (or is) the correct solution, but I
> want the CORE test suite to PASS for above combination. Currently all
> smokes are spitting 'F' for FAIL, where in my explanation it should be
> Can you make the CORE tests PASS without altering your POV on the
> module itself?

...I'm sorry, I had completely misunderstood your original message.
Please ignore all my previous contributions to this thread; I was
hopelessly confused.

Okay, I can now reproduce this.  What's happening is that the output from
the script is being double-converted, so already Unicode output is being
converted to Unicode again.  For some reason, Perl doesn't think that the
string that Pod::Man is writing is already in Unicode and is therefore
assuming that it's in a legacy character set and reconverting to Unicode
on output to the file handle.

I think this is a Pod::Simple bug, since it's responsible for handling the
character set of the input and handling processors UTF-8.  I think it's
doing that but not tagging the strings properly so that Perl knows that
they're UTF-8.  Looking at the Pod::Simple source, it apparently pays no
attention to PERL_UNICODE and only realizes that its input is UTF-8 if
there's a BOM in the file.  The following patch fixes the test case:

--- a/t/man-options.t
+++ b/t/man-options.t
@@ -91,6 +91,8 @@ __DATA__
 utf8 1
+=encoding utf-8
 =head1 BEYONCÉ
 Beyoncé!  Beyoncé!  Beyoncé!!

by telling Pod::Simple explicitly that the input is UTF-8.  perlpod does
sort of imply that this is required:

   "=encoding encodingname"
       This command is used for declaring the encoding of a document.
       Most users won’t need this; but if your encoding isn’t US-ASCII or
       Latin-1, then put a "=encoding encodingname" command early in the
       document so that pod formatters will know how to decode the

so this is a valid patch to use.  perlpodspec goes on at greater length:

   "=encoding encodingname"
       This command, which should occur early in the document (at least
       before any non-US-ASCII data!), declares that this document is
       encoded in the encoding encodingname, which must be an encoding
       name that Encoding recognizes.  (Encoding’s list of supported
       encodings, in Encode::Supported, is useful here.)  If the Pod
       parser cannot decode the declared encoding, it should emit a
       warning and may abort parsing the document altogether.

       A document having more than one "=encoding" line should be
       considered an error.  Pod processors may silently tolerate this if
       the not-first "=encoding" lines are just duplicates of the first
       one (e.g., if there’s a "=use utf8" line, and later on another
       "=use utf8" line).  But Pod processors should complain if there are
       contradictory "=encoding" lines in the same document (e.g., if
       there is a "=encoding utf8" early in the document and "=encoding
       big5" later).  Pod processors that recognize BOMs may also complain
       if they see an "=encoding" line that contradicts the BOM (e.g., if
       a document with a UTF-16LE BOM has an "=encoding shiftjis" line).

I think it's debatable whether this is the correct behavior for
Pod::Simple; it seems to me that if PERL_UNICODE is set and we're in a
UTF-8 locale, Pod::Simple should assume all input is Unicode, since that's
kind of what that setting says.  But I will include the test case patch
anyway in the next release of Pod::Man since given the current
specification it's required for Unicode input to be recognized properly.

I'm very sorry for my fairly useless previous responses when I didn't
understand what you were asking.

Russ Allbery (             <>

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About