develooper Front page | perl.perl5.porters | Postings from August 2008

Re: Re-gaining 'O's

Thread Previous | Thread Next
From:
Russ Allbery
Date:
August 27, 2008 09:10
Subject:
Re: Re-gaining 'O's
Message ID:
87tzd73c7o.fsf@windlord.stanford.edu
(I should be mentioning -- I'm not subscribed to perl5-porters, so I'll
only see messages on this thread that are cc'd to me.)

Juerd Waalboer <juerd@convolution.nl> writes:
> Russ Allbery skribis 2008-08-26  0:27 (-0700):

>> But this still feels wrong to me; everywhere I do file IO in a module,
>> I'm required to add these encode and decode statements all over the
>> place or add IO layers, which aren't backward compatible with older
>> versions of Perl?  Surely that can't be right.
>
> Yes, that's right.
>
> Numbers always needed some conversion before they could fit in binary
> form. That's what pack and unpack are for, or on another level
> conversion to a string of (ASCII) digits.

I think I could count on one hand the number of times I've used pack and
unpack in nearly 15 years as a Perl programmer, so I must say I don't see
that as particularly comparable in terms of its impact on everyone who
writes Perl.

> Text strings need the same treatment if your encoding is not a "single
> byte = single character" encoding. Continuous unpacking and packing...
> I mean decoding and encoding of course.

Hm.  On first glance, that implementation choice looks like it's made Perl
much more difficult to program in and has broken backward compatibility,
but I'm not reading perl5-porters and haven't been around for the
discussion.  You all have done an excellent job maintaining Perl and I'm
sure there were very solid reasons for doing it this way.  I must admit,
though, that I'm unlikely to go back and sprinkle such code through every
module and script I maintain.

But that aside, most of the stuff I maintain doesn't claim to deal
specifically with UTF-8 and podlators does, so I would like to fix it here
at least, whatever dance Perl asks module authors to do.

> Looks sane, but it'd be easier with a layer. binmode DATA,
> ":encoding(utf-8)" and you don't need all those decodes anymore.

I'm not sure how I managed to miss this particular combination last night,
but the following patch appears to fix the problem.  Rationale: set all
file handles read by or written from the test script that deal
specifically with UTF-8 to use a utf-8 encoding except for the one that
will be used for Pod::Simple output.  For that one, let Pod::Simple do its
own internal encoding handling.  I can convince myself that those are
reasonable semantics.  I've checked the temporary output files, and they
now appear to all be encoded properly with no weird double-encoding.

The binmodes are wrapped in eval because I think they may die with
versions of Perl prior to 5.8, which podlators still claims to support.

This also fixed another bug that had been puzzling me and now I'm less
confused about why Pod::Simple is embedding what looked like a non-UTF-8
character; it isn't, it's assuming character semantics rather than byte
semantics.  Which I should have already known.

Could someone else look over this patch and both check that it also fixes
the problem for you and tell me if it looks sane?  I'm fairly sure of the
Pod::Man patch, which just reverses my own complete misunderstanding.

diff --git a/lib/Pod/Man.pm b/lib/Pod/Man.pm
index 38c4e3d..203ef4a 100644
--- a/lib/Pod/Man.pm
+++ b/lib/Pod/Man.pm
@@ -362,13 +362,6 @@ sub format_text {
         $text =~ s/([^\x00-\x7F])/$ESCAPES{ord ($1)} || "X"/eg;
     }
 
-    # For Unicode output, unconditionally remap ISO 8859-1 non-breaking spaces
-    # to the correct code point.  This is really a bug in Pod::Simple to be
-    # embedding ISO 8859-1 characters in the output stream that we see.
-    if ($$self{utf8} && ASCII) {
-        $text =~ s/\xA0/\xC2\xA0/g;
-    }
-
     # Ensure that *roff doesn't convert literal quotes to UTF-8 single quotes,
     # but don't mess up our accept escapes.
     if ($literal) {
diff --git a/t/man-options.t b/t/man-options.t
index f00e7d1..48ceae0 100755
--- a/t/man-options.t
+++ b/t/man-options.t
@@ -29,6 +29,7 @@ $loaded = 1;
 print "ok 1\n";
 
 my $n = 2;
+eval { binmode (\*DATA, ':encoding(utf-8)') };
 while (<DATA>) {
     my %options;
     next until $_ eq "###\n";
@@ -38,6 +39,8 @@ while (<DATA>) {
         $options{$option} = $value;
     }
     open (TMP, '> tmp.pod') or die "Cannot create tmp.pod: $!\n";
+    eval { binmode (\*TMP, ':encoding(utf-8)') };
+    print TMP "=encoding utf-8\n\n";
     while (<DATA>) {
         last if $_ eq "###\n";
         print TMP $_;
@@ -45,10 +48,12 @@ while (<DATA>) {
     close TMP;
     my $parser = Pod::Man->new (%options) or die "Cannot create parser\n";
     open (OUT, '> out.tmp') or die "Cannot create out.tmp: $!\n";
+    eval { binmode (\*OUT, ':encoding(utf-8)') };
     $parser->parse_from_file ('tmp.pod', \*OUT);
     close OUT;
     my $accents = 0;
     open (TMP, 'out.tmp') or die "Cannot open out.tmp: $!\n";
+    eval { binmode (\*TMP, ':encoding(utf-8)') };
     while (<TMP>) {
         $accents = 1 if /Accent mark definitions/;
         last if /^\.nh/;
@@ -123,7 +128,7 @@ This is S<non-breaking output>.
 ###
 .SH "S<> output with UTF\-8"
 .IX Header "S<> output with UTF-8"
-This is non\-breakingĀ output.
+This is non-breakingĀ output.
 ###
 
 ###

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About