develooper Front page | perl.perl5.porters | Postings from March 2014

Re: perlbug and encoding (Re: Perl 5.20.0 Blockers, 2014-03-24)

Thread Previous
From:
Leon Timmermans
Date:
March 25, 2014 19:06
Subject:
Re: perlbug and encoding (Re: Perl 5.20.0 Blockers, 2014-03-24)
Message ID:
CAHhgV8ia2s-tk4RELzdw8fCiF9xJ4_Akq5vyn5wCBAQF3taf+w@mail.gmail.com
On Tue, Mar 25, 2014 at 6:44 PM, Craig A. Berry <craig.a.berry@gmail.com>wrote:

> > When I say the output is a "binary handle", I mean one that only accepts
> > bytes. But the input handle is providing strings of Unicode code points.
> > This mismatch is a bug. There's a decoding layer too many, or there's a
> > missing encoding layer.
> >
> > In practice:
> >
> > * You could input "\x{00E9}" (é), which would be output as "\xE9".
> >
> > * You could input "\x{20AC}" (EURO), which would be output as
> "\xE2\x82\xAC"
> > and a warning.
> >
> > * You could input "\x{00E9}\x{20AC}" (é EURO), which would be output as
> > "\xC3\xA9\xE2\x82\xAC" and a warning.
>
> Why would they warn when written to a :raw handle?
>

Because the output is now suddenly in mixed-encoding, which tends to be
unparsable because you don't know how to decode it.


> >> The primary goal was to
> >> construct a mail message with multiple attachments having potentially
> >> multiple encodings, each potentially different from the encoding of
> >> the message text.
> >
> >
> > And you want to keep them as is? Then you want a binary input handle, and
> > you want set the appropriate charset header if you don't already.
>
> That's exactly what it does. Everything is read in raw and written out
> raw *except* the message body, which may have been generated by any
> random editor on any platform.  For that I attempt to infer the locale
> encoding and read into a handle decoding that specific encoding
> (which may not be working right per Zefram though I haven't confirmed
> that yet).
>
> Since I'm decoding on input, the data at that point have been
> converted to Perl's internal encoding, which as far as I know is UTF-8
> (or a lax variant of it) when it needs to be.  Then I write it out raw
> and say in the MIME header that it's UTF-8.  Yes, I'm intentionally
> losing the encoding on the output handle.  I pretty much have to since
> that handle may get data in multiple unknown encodings.
>

Decoding on input but not encoding on output will give unpredictable output.

I also don't specify character encodings in the MIME headers on the
> attachment(s) since there is no way to know what they are.  But that's
> not a problem since the main use case is patches created by
> git-format-patch and git seems happy as long as we don't mangle
> anything in transport.
>

Sounds reasonable to me.

Leon

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About