develooper Front page | perl.perl5.porters | Postings from August 2008

Re: Re-gaining 'O's

Thread Previous | Thread Next
Russ Allbery
August 26, 2008 06:47
Re: Re-gaining 'O's
Message ID:
"H.Merijn Brand" <> writes:

> Hmm, not in my case: It makes that combination PASS, but breaks the
> other tests, so I'll wait till the rest of the patch.

If PERL_UNICODE *isn't* set, when the test script reads from __DATA__ and
writes that out to the file, I think the character set gets munged in some
strange way rather than writing out the same bytes that it read in.  So
the data in disk is some bizarre broken re-encoding of what's in __DATA__.

This patch (replacing the previous patch) makes the test succeed, but I
find the results completely nonsensical.

--- a/t/man-options.t
+++ b/t/man-options.t
@@ -38,6 +38,11 @@ while (<DATA>) {
         $options{$option} = $value;
     open (TMP, '> tmp.pod') or die "Cannot create tmp.pod: $!\n";
+    eval {
+        if (${^OPEN} and (split '\0', ${^OPEN})[1] =~ /utf8/) {
+            print TMP "=encoding utf-8\n\n";
+        }
+    };
     while (<DATA>) {
         last if $_ eq "###\n";
         print TMP $_;

This tries to add =encoding utf-8 iff the PerlIO layer seems to indicate
that Perl is going to write out UTF-8.  However, when run with
PERL_UNICODE set and in a UTF-8 locale, both the output POD file and the
output from Pod::Man is double-encoded and Perl thinks the test succeeds
because it thinks the data from <DATA> is also double-encoded.  (!?!)

I must admit that I find Perl's Unicode support completely baffling.  I've
read the documentation multiple times, and while I can now predict when it
won't work, I still have no idea how to get it to do what I want.  It's
kind of frustrating; I don't know if I'm just dim, or if I'm missing some
magic bit of documentation, or if the support is just really confusing.

I'm afraid I have absolutely no idea how to fix this.  I tried to add
encodes and decodes, and the results just got more and more confusing.  I
read through perlunitut and perlunifaq again, and I'm now even more
confused than I was when I started.

Are the lines I read from <DATA> supposed to require decoding, or is that
done for me?  How do I know whether data that I read from a file has
already been decoded by a layer added by PERL_UNICODE and when I have to
do this myself?  How do I know when I'm supposed to encode a string before
printing it to a file handle and when there's some layer in place that
will do this for me?  Am I even supposed to know?

The following patch tries to follow perlunitut and do encode and decode
whenever talking to the outside world, but if you run the script with this
patch without PERL_UNICODE set, the output from Pod::Man via Pod::Simple
is in ISO-8859-1 instead of Unicode.  Maybe this indicates that the
problem is actually a bug in Pod::Simple and *it's* not correctly encoding
the output?  (Pod::Man never writes out anything itself.)

But this still feels wrong to me; everywhere I do file IO in a module, I'm
required to add these encode and decode statements all over the place or
add IO layers, which aren't backward compatible with older versions of
Perl?  Surely that can't be right.

--- a/t/man-options.t
+++ b/t/man-options.t
@@ -23,6 +23,7 @@ END {
     print "not ok 1\n" unless $loaded;
+use Encode qw(decode encode);
 use Pod::Man;
 $loaded = 1;
@@ -30,17 +31,21 @@ print "ok 1\n";
 my $n = 2;
 while (<DATA>) {
+    $_ = decode ('utf-8', $_);
     my %options;
     next until $_ eq "###\n";
     while (<DATA>) {
+        $_ = decode ('utf-8', $_);
         last if $_ eq "###\n";
         my ($option, $value) = split;
         $options{$option} = $value;
     open (TMP, '> tmp.pod') or die "Cannot create tmp.pod: $!\n";
+    print TMP "=encoding utf-8\n\n";
     while (<DATA>) {
+        $_ = decode ('utf-8', $_);
         last if $_ eq "###\n";
-        print TMP $_;
+        print TMP encode ('utf-8', $_);
     close TMP;
     my $parser = Pod::Man->new (%options) or die "Cannot create parser\n";
@@ -50,13 +55,14 @@ while (<DATA>) {
     my $accents = 0;
     open (TMP, 'out.tmp') or die "Cannot open out.tmp: $!\n";
     while (<TMP>) {
+        $_ = decode ('utf-8', $_);
         $accents = 1 if /Accent mark definitions/;
         last if /^\.nh/;
     my $output;
         local $/;
-        $output = <TMP>;
+        $output = decode ('utf-8', <TMP>);
     close TMP;
     unlink ('tmp.pod', 'out.tmp');
@@ -70,6 +76,7 @@ while (<DATA>) {
     my $expected = '';
     while (<DATA>) {
+        $_ = decode ('utf-8', $_);
         last if $_ eq "###\n";
         $expected .= $_;
@@ -77,6 +84,8 @@ while (<DATA>) {
         print "ok $n\n";
     } else {
         print "not ok $n\n";
+        $expected = encode ('utf-8', $expected);
+        $output = encode ('utf-8', $output);
         print "Expected\n========\n$expected\nOutput\n======\n$output\n";

If I additionally set an output encoding before passing the file handle
into Pod::Man, the test passes again, but the files on disk are
double-encoded if PERL_UNICODE is set.  (!?)  It looks like I can't use
encode, decode, or a file handle encoding if PERL_UNICODE is set or I'll
get double-encoding?

Russ Allbery (             <>

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About