develooper Front page | perl.perl5.porters | Postings from March 2014

Re: perlbug and encoding (Re: Perl 5.20.0 Blockers, 2014-03-24)

Thread Previous | Thread Next
From:
Aristotle Pagaltzis
Date:
March 26, 2014 09:59
Subject:
Re: perlbug and encoding (Re: Perl 5.20.0 Blockers, 2014-03-24)
Message ID:
20140326095950.GA87730@plasmasturm.org
* Craig A. Berry <craig.a.berry@gmail.com> [2014-03-25 18:45]:
> If it's as broken by design as you imply, I wonder why it works.

It seems to, because if you try to output a string containing "\x{100}"
or beyond (which is impossible to represent in a single byte), perl will
try to keep trucking instead of just throwing an error, and in absence
of anything better do it outputs the UTF-8 representation of the string.

I.e. "\x80\x{20AC}" will be output as 2 + 3 bytes.

But if your string does not contain such characters, even though it may
then have either UTF8=on or off, any "\x80" thru "\xFF" characters in it
will always be output as single bytes.

I.e. "\x80\xFF" will be output as 1 + 1 bytes.

Which are not going to be valid UTF-8.

There is a class of broken Perl programs which, under this arrangement,
happen to be broken in just the right way that the breakage is rarely or
possibly even never apparent. (But if you think your code falls in the
“never” category, you are nearly certainly mistaken.)

> Everything is read in raw and written out raw *except* the message
> body, which may have been generated by any random editor on any
> platform. For that I attempt to infer the locale encoding and read
> into a handle decoding that specific encoding (which may not be
> working right per Zefram though I haven't confirmed that yet).

That’s fine.

> Since I'm decoding on input, the data at that point have been
> converted to Perl's internal encoding, which as far as I know is UTF-8
> (or a lax variant of it) when it needs to be. Then I write it out raw
> and say in the MIME header that it's UTF-8. Yes, I'm intentionally
> losing the encoding on the output handle. I pretty much have to since
> that handle may get data in multiple unknown encodings.

Using a raw handle is fine.

Printing decoded strings to it is not.

You are missing the encoding step. You don’t need to put an encoding
layer on the handle to do that; you can encode explicitly beforehand,
just so long as you make sure to re-encode at any stage before output.

This will turn a string that contains "\x{20AC}" into one that contains
the 3-character sequence "\xE2\x82\xAC" representing the corresponding
octets, which are what you want to output.

These are different strings, even if the bytes in the PV buffer are the
same: length() will return different values, what . matches during
a regexp will differ, etc. Meanwhile the way that U+20AC is represented
internally as a single character should be no concern of yours: your
code should not be written to start behaving differently if perl were to
switch to internal representation in UTF-16 (yeah, right) or UTF-32/NFG
(maybe someday?), say.

I don’t know if you do anything with the strings after they are decoded.
If not, you can just transcode them from the locale encoding straight to
the UTF-8 encoding using Encode.pm’s `from_to` instead of decoding and
only later encoding them. That may or may not be easier.

> I also don't specify character encodings in the MIME headers on the
> attachment(s) since there is no way to know what they are. But that's
> not a problem since the main use case is patches created by
> git-format-patch and git seems happy as long as we don't mangle
> anything in transport.

That’s fine.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About