Front page | perl.perl5.porters |
Postings from March 2014
perlbug and encoding (Re: Perl 5.20.0 Blockers, 2014-03-24)
Thread Next
From:
Craig A. Berry
Date:
March 25, 2014 17:44
Subject:
perlbug and encoding (Re: Perl 5.20.0 Blockers, 2014-03-24)
Message ID:
CA+vYcVwmRn4piftEsCrvELLcxfw6J0zLTTN1L5-4aAm_pJ0d2w@mail.gmail.com
On Tue, Mar 25, 2014 at 8:41 AM, Eric Brine <ikegami@adaelis.com> wrote:
> On Tue, Mar 25, 2014 at 8:45 AM, Craig A. Berry <craig.a.berry@gmail.com>
> wrote:
>>
>> On Tue, Mar 25, 2014 at 12:14 AM, Eric Brine <ikegami@adaelis.com> wrote:
>> > On Mon, Mar 24, 2014 at 12:08 PM, Zefram <zefram@fysh.org> wrote:
>> >>
>> >> Ricardo Signes wrote:
>> >> >3. "Make perlbug Unicode-aware" broke perlbug on Win32
>> >> > https://rt.perl.org/Ticket/Display.html?id=121277
>> >>
>> >> I think there's a bug in the Unicode-awareness patch.
>> >
>> >
>> > And I have a third. It's printing decoded text to a binary handle.
>> >
>> > open(REP, '>:raw', $filename) or die "Unable to create report file
>> > '$filename': $!\n";
>> >
>> > open(F, "<:$input_encoding", $file)
>> > or die "Unable to read report file from '$file': $!\n";
>> > while (<F>) {
>> > print REP $_
>> > }
>>
>> Maybe it's obvious to everyone else, but I'd appreciate it if you
>> could explain to me what the bug is.
>
>
> When I say the output is a "binary handle", I mean one that only accepts
> bytes. But the input handle is providing strings of Unicode code points.
> This mismatch is a bug. There's a decoding layer too many, or there's a
> missing encoding layer.
>
> In practice:
>
> * You could input "\x{00E9}" (é), which would be output as "\xE9".
>
> * You could input "\x{20AC}" (EURO), which would be output as "\xE2\x82\xAC"
> and a warning.
>
> * You could input "\x{00E9}\x{20AC}" (é EURO), which would be output as
> "\xC3\xA9\xE2\x82\xAC" and a warning.
Why would they warn when written to a :raw handle?
> What encoding is expected?
>
> * If you're expecting UTF-8, you have invalid UTF-8 in some cases and
> warnings in others.
>
> * If you're expecting iso-8859-1, you have improperly encoded characters in
> some cases, and you are warned about it.
>
> * If you're expecting another encoding, you have improperly encoded
> characters in some cases, and you are sometimes warned about it.
I did quite a bit of testing with both the euro character and egrave
in the message body (entered with vim in UTF-8) and the current
implementation passes them through fine and produces no warnings. If
it's as broken by design as you imply, I wonder why it works.
>> The primary goal was to
>> construct a mail message with multiple attachments having potentially
>> multiple encodings, each potentially different from the encoding of
>> the message text.
>
>
> And you want to keep them as is? Then you want a binary input handle, and
> you want set the appropriate charset header if you don't already.
That's exactly what it does. Everything is read in raw and written out
raw *except* the message body, which may have been generated by any
random editor on any platform. For that I attempt to infer the locale
encoding and read into a handle decoding that specific encoding
(which may not be working right per Zefram though I haven't confirmed
that yet).
Since I'm decoding on input, the data at that point have been
converted to Perl's internal encoding, which as far as I know is UTF-8
(or a lax variant of it) when it needs to be. Then I write it out raw
and say in the MIME header that it's UTF-8. Yes, I'm intentionally
losing the encoding on the output handle. I pretty much have to since
that handle may get data in multiple unknown encodings.
I also don't specify character encodings in the MIME headers on the
attachment(s) since there is no way to know what they are. But that's
not a problem since the main use case is patches created by
git-format-patch and git seems happy as long as we don't mangle
anything in transport.
There's some more rationale in the commit message:
<http://perl5.git.perl.org/perl.git/commitdiff/092c3affc299403d8cc5278d27c9961bca81efd6>
Thanks for the input. I'm glad this is (at last) getting the review it needs.
Thread Next
-
perlbug and encoding (Re: Perl 5.20.0 Blockers, 2014-03-24)
by Craig A. Berry