develooper Front page | perl.perl5.porters | Postings from March 2014

perlbug and encoding (Re: Perl 5.20.0 Blockers, 2014-03-24)

Thread Next
Craig A. Berry
March 25, 2014 17:44
perlbug and encoding (Re: Perl 5.20.0 Blockers, 2014-03-24)
Message ID:
On Tue, Mar 25, 2014 at 8:41 AM, Eric Brine <> wrote:
> On Tue, Mar 25, 2014 at 8:45 AM, Craig A. Berry <>
> wrote:
>> On Tue, Mar 25, 2014 at 12:14 AM, Eric Brine <> wrote:
>> > On Mon, Mar 24, 2014 at 12:08 PM, Zefram <> wrote:
>> >>
>> >> Ricardo Signes wrote:
>> >> >3.  "Make perlbug Unicode-aware" broke perlbug on Win32
>> >> >
>> >>
>> >> I think there's a bug in the Unicode-awareness patch.
>> >
>> >
>> > And I have a third. It's printing decoded text to a binary handle.
>> >
>> > open(REP, '>:raw', $filename) or die "Unable to create report file
>> > '$filename': $!\n";
>> >
>> > open(F, "<:$input_encoding", $file)
>> >     or die "Unable to read report file from '$file': $!\n";
>> > while (<F>) {
>> >     print REP $_
>> > }
>> Maybe it's obvious to everyone else, but I'd appreciate it if you
>> could explain to me what the bug is.
> When I say the output is a "binary handle", I mean one that only accepts
> bytes. But the input handle is providing strings of Unicode code points.
> This mismatch is a bug. There's a decoding layer too many, or there's a
> missing encoding layer.
> In practice:
> * You could input "\x{00E9}" (é), which would be output as "\xE9".
> * You could input "\x{20AC}" (EURO), which would be output as "\xE2\x82\xAC"
> and a warning.
> * You could input "\x{00E9}\x{20AC}" (é EURO), which would be output as
> "\xC3\xA9\xE2\x82\xAC" and a warning.

Why would they warn when written to a :raw handle?

> What encoding is expected?
> * If you're expecting UTF-8, you have invalid UTF-8 in some cases and
> warnings in others.
> * If you're expecting iso-8859-1, you have improperly encoded characters in
> some cases, and you are warned about it.
> * If you're expecting another encoding, you have improperly encoded
> characters in some cases, and you are sometimes warned about it.

I did quite a bit of testing with both the euro character and egrave
in the message body (entered with vim in UTF-8) and the current
implementation passes them through fine and produces no warnings.  If
it's as broken by design as you imply, I wonder why it works.

>> The primary goal was to
>> construct a mail message with multiple attachments having potentially
>> multiple encodings, each potentially different from the encoding of
>> the message text.
> And you want to keep them as is? Then you want a binary input handle, and
> you want set the appropriate charset header if you don't already.

That's exactly what it does. Everything is read in raw and written out
raw *except* the message body, which may have been generated by any
random editor on any platform.  For that I attempt to infer the locale
encoding and read into a handle decoding that specific encoding
(which may not be working right per Zefram though I haven't confirmed
that yet).

Since I'm decoding on input, the data at that point have been
converted to Perl's internal encoding, which as far as I know is UTF-8
(or a lax variant of it) when it needs to be.  Then I write it out raw
and say in the MIME header that it's UTF-8.  Yes, I'm intentionally
losing the encoding on the output handle.  I pretty much have to since
that handle may get data in multiple unknown encodings.

I also don't specify character encodings in the MIME headers on the
attachment(s) since there is no way to know what they are.  But that's
not a problem since the main use case is patches created by
git-format-patch and git seems happy as long as we don't mangle
anything in transport.

There's some more rationale in the commit message:


Thanks for the input.  I'm glad this is (at last) getting the review it needs.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About