develooper Front page | perl.perl5.porters | Postings from March 2014

Re: perlbug and encoding (Re: Perl 5.20.0 Blockers, 2014-03-24)

Thread Previous | Thread Next
From:
Craig A. Berry
Date:
March 29, 2014 22:43
Subject:
Re: perlbug and encoding (Re: Perl 5.20.0 Blockers, 2014-03-24)
Message ID:
CA+vYcVxhw0ba0+68Yke7hkZ6mOxqVfYFcHunesxOZRpO7of_9A@mail.gmail.com
On Wed, Mar 26, 2014 at 4:59 AM, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:

> Using a raw handle is fine.
>
> Printing decoded strings to it is not.
>
> You are missing the encoding step.

Thanks for the explanation.  I'm probably beyond hope, but I
appreciate the effort :-).

Of course no such thing as a "decoded string" actually exists.  Every
decoding is another encoding.  Simply by virtue of being represented
in computer memory, strings have to be encoded in some way.  Even if
each character is represented by a 21-bit integer containing the value
of its Unicode code point, that is a form of encoding.  Various docs
say Perl's internal encoding is UTF-8, but also say not to depend on
that.  I believed the former but not the latter.  My bad.

Moving on to what to do with perlbug for 5.20.  The main reason to
specify layers on all the handles in perlbug was to ensure that
patches attached with the new -p option come through the wash ok even
if they have multiple encodings in them.  Using the :raw layer on both
input and output seems to accomplish that and I think this part is a
keeper.  It's probably a misnomer to call it "unicode awareness"; it
might be more proper to say we're making perlbug encoding-agnostic.

Somewhat as an afterthought, it seemed like it might be nice if we
could handle more than ASCII in the message body as well. We could
spell people's names correctly, and pasted-in code samples and output
from code samples might actually look as intended. Somehow I got it
into my head that in the case of a prepared report supplied with the
-f option (or by having the filename typed in response to a prompt) we
could not be encoding-agnostic and would have to know the input
encoding and convert it into a specified output encoding.  I now think
this whole idea was a mistake (even aside from my implementation
mistakes) and we should scrap it, at least for now.

Guessing the input encoding is the tricky part.  I was attempting to
use encoding::_get_locale_encoding().  Aside from being a private
method of a deprecated pragma, it depends on the locale being set up
properly and whatever program that created the file having observed
the locale setting.  As I understand it, pretty much no program on
Windows will do that.  On any platform, there is no reason to assume
the report file was created on the same system as the one running
perlbug.  And if the file was created in a text editor, any number of
editor defaults and/or user preferences could cause it to be in some
encoding other than what the locale specifies.

So I think we should stop pretending that we can reasonably guess the
encoding and instead focus on passing things through without mangling
them. I have pushed the branch craigb/perlbug_encoding_fixup which
takes a stab at this and also a rather blind swing at the CRLF
expectations for die-hard users of Notepad.  I have not tested this
branch at all and must urgently return to several other neglected
obligations so I'm not sure when I'll be able to.  But it's the best I
can offer at the moment as an alternative path forward.

P.S.   If someone wants to write a robust general-purpose encoding
detector and include it in perlbug, please go ahead, but be sure to
make it degrade nicely under miniperl when the Encode module is not
available.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About