develooper Front page | perl.perl5.porters | Postings from March 2014

Re: perlbug and encoding (Re: Perl 5.20.0 Blockers, 2014-03-24)

Thread Previous | Thread Next
From:
Eric Brine
Date:
March 25, 2014 19:07
Subject:
Re: perlbug and encoding (Re: Perl 5.20.0 Blockers, 2014-03-24)
Message ID:
CALJW-qGebfUbV11ZYbD93hRcaKjRTnw6rWodqRONwYKcx=8w6w@mail.gmail.com
On Tue, Mar 25, 2014 at 1:44 PM, Craig A. Berry <craig.a.berry@gmail.com>wrote:

> > * You could input "\x{00E9}\x{20AC}" (é EURO), which would be output as
> > "\xC3\xA9\xE2\x82\xAC" and a warning.
>
> Why would they warn when written to a :raw handle?
>

Illegal input. 20AC is outside of the range of 0-255. You get "Wide
character in print"


>
> > What encoding is expected?
> >
> > * If you're expecting UTF-8, you have invalid UTF-8 in some cases and
> > warnings in others.
> >
> > * If you're expecting iso-8859-1, you have improperly encoded characters
> in
> > some cases, and you are warned about it.
> >
> > * If you're expecting another encoding, you have improperly encoded
> > characters in some cases, and you are sometimes warned about it.
>
> I did quite a bit of testing with both the euro character and egrave
> in the message body (entered with vim in UTF-8) and the current
> implementation passes them through fine and produces no warnings.  If
> it's as broken by design as you imply, I wonder why it works.
>

Encoding errors often cancel each other out.

>> The primary goal was to
> >> construct a mail message with multiple attachments having potentially
> >> multiple encodings, each potentially different from the encoding of
> >> the message text.
> >
> >
> > And you want to keep them as is? Then you want a binary input handle, and
> > you want set the appropriate charset header if you don't already.
>
> That's exactly what it does. Everything is read in raw and written out
> raw *except* the message body, which may have been generated by any
> random editor on any platform.


The handle were talking about is opened as follows:

open(F, "<:$input_encoding", $file)

If the variable name isn't misleading, that's not a raw file handle, so I
guess we aren't talking about attachments.

Since I'm decoding on input, the data at that point have been
>
converted to Perl's internal encoding, which as far as I know is UTF-8
> (or a lax variant of it) when it needs to be.


You're thinking of it all wrong. Decoding isn't an internal change,
decoding is the process of changing a string of bytes with
encoding-specific meanings into a string of Unicode code points. After
decoding, a string may contain chr(0x00E9) for é, and chr(0x20AC) for the
Euro symbol.

The only way you see that internal encoding in Perl code is if there's a
bug involved. It's not relevant here.


> Then I write it out raw and say in the MIME header that it's UTF-8.


You're printing code points, not the UTF-8 of them. Do you think that
printing to a raw handle prints the internal representation of a string?
Cause that's not true.

$ perl -MEncode -e'binmode STDOUT, ":raw"; my $input_encoding = "cp1252";
my $cp1252_text = "\xC9ric"; print decode($input_encoding, $cp1252_text);'
| od -t x1
0000000 c9 72 69 63
0000001

You say you're outputting UTF-8, but that's not valid UTF-8!

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About