develooper Front page | perl.perl5.porters | Postings from April 2014

Re: perlbug and encoding (Re: Perl 5.20.0 Blockers, 2014-03-24)

Thread Previous | Thread Next
From:
Craig A. Berry
Date:
April 2, 2014 20:28
Subject:
Re: perlbug and encoding (Re: Perl 5.20.0 Blockers, 2014-03-24)
Message ID:
CA+vYcVzFwaMWScmV3Xg=5H_BPdpR7tLY3PbA917MAZo3U+C_2Q@mail.gmail.com
On Sun, Mar 30, 2014 at 10:33 PM, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:
> But decoding doesn't merely transcode the string to UTF-8 -
>
> It *also* flips a flag bit.
>
> That flag bit is responsible for drastically altering the way operators
> perceive the contents of the PV buffer: it makes them treat multibyte
> sequences as a single entity. Perl-land stops seeing the byte sequence
> stored in the PV buffer, and instead sees a sequence of codepoints,
> regardless of how many PV buffer bytes any one codepoint corresponds to.
>
> Note that decoding could conceivably also transcode to Latin-1 *without*
> flipping the bit in the case of an input string limited to the Latin-1
> charset. Then operations on the decoded string would yield the same
> results as they do when it is transcoded to UTF-8 *with* the flag bit
> set: it would be a sequence of codepoints, even though in this case,
> each codepoint always corresponds to one PV buffer byte.
>
> Now, any single codepoint may or may not be representable in a single
> byte. So if you need bytes, you have to ask Perl for a representation in
> terms of bytes, which is the encode step. What may happen here if the
> decoded string was stored as UTF-8 internally is that the encoded string
> will have the same bytes in its PV buffer.
>
> But, crucially, *its flag bit will be off*.
>
> This makes the encode step *not* a no-op - string operations will treat
> multibyte sequences in its PV buffer as multiple separate characters.
>
> So there's your difference:
>
> Encoding/decoding do not only transcode - they also preselect the code
> paths that will be taken when operating on the string, picking between
> two semantics. The act of decoding/encoding is your request to switch
> between those.
>
> Bottom line, it's a question of layers and their interfaces.
>
> Hope this helps.

It does.  If my ignorance and embarrassment ever lessen sufficiently
to tackle it, I would want to see any mention of the internal encoding
in the docs emphasize that it is *dynamic* and may give you encoded
form octets or code points depending on a lot of things.

Back to perlbug.

>> Guessing the input encoding is the tricky part. I was attempting to
>> use encoding::_get_locale_encoding(). Aside from being a private
>> method of a deprecated pragma, it depends on the locale being set up
>> properly and whatever program that created the file having observed
>> the locale setting. As I understand it, pretty much no program on
>> Windows will do that. On any platform, there is no reason to assume
>> the report file was created on the same system as the one running
>> perlbug. And if the file was created in a text editor, any number of
>> editor defaults and/or user preferences could cause it to be in some
>> encoding other than what the locale specifies.
>>
>> So I think we should stop pretending that we can reasonably guess the
>> encoding and instead focus on passing things through without mangling
>> them.
>
> I'm afraid you cannot sidestep the problem in this way: the text/plain
> MIME type defaults to the US-ASCII charset unless otherwise specified,
> so if perlbug is to declare its main message as readable, it *will* be
> doing something about the encoding of the message, even if by omission.

Eek, you're right.  I wonder if there's even any point in saying the
content-transfer-encoding is 8bit if we're not specifying a charset.
We didn't specify any MIME headers before I added the attachment
capability, but there were also ASCII assumptions all the way through
perlbug.

> Now encoding::_get_locale_encoding is not a good idea, I agree. But is
> there a reason I'm missing that Encode::Locale wouldn't be either? That
> seems like the answer, no?

It certainly looks like a new and improved version of the same basic
idea.  It's not in the core, so I don't see how perlbug could depend
on it.  I also think it sidesteps the question of whether detecting
the locale encoding is any use at all in guessing what any random
editor or user will actually do.

> Note that no matter what, it's imperative to allow the user to override
> perlbug's choice of input encoding, in case its guess is wrong. So it
> will need a switch for this purpose.

I like this.  It means someone with one of those ancient and awful
versions of Visual Studio that silently converts everything to UCS-2
would have a way to rescue a prepared report, as well as solving
various lesser problems.

>> P.S. If someone wants to write a robust general-purpose encoding
>> detector and include it in perlbug, please go ahead, but be sure to
>> make it degrade nicely under miniperl when the Encode module is not
>> available.
>
> Ouch.
>
> Is I18N::Langinfo available then at least?
>
> Well, nowadays it's not a terrible idea to just expect UTF-8, and leave
> it to the user to say otherwise if that's wrong. This might possibly
> even be done always, i.e. skipping Encode::Locale entirely even where
> available.

I'm leaning in this direction.  Assume the report (either prepared or
from the template we supply) is UTF-8 unless the user specifies a
different encoding.  If they do specify an encoding, it has to be one
supported by the Encode module and they have to be running a Perl that
can load that module (i.e., not miniperl).

If the real input encoding is cp-1252 and they don't tell us that, I
don't think they'll be any worse off than they are now, and it would
only matter if they were using 8-bit characters.

We'll write out the message body using the :utf8 layer and say in the
MIME headers that it's UTF-8.

That's the best design I can come up with at the moment, but I'm open
to suggestion.

> But, err, how is *any* transcoding supposed to be done under miniperl if
> it lacks Encode?

It isn't.  By degrading gracefully I just meant perlbug shouldn't blow
up but should provide basic bug reporting capability even if doing so
means defaulting to ASCII or UTF-8.  If the build fails halfway on a
server with no end-user tools configured, perlbug should still
function well enough with miniperl to get some sort of message
through.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About