Front page | perl.perl5.porters |
Postings from March 2014
Re: perlbug and encoding (Re: Perl 5.20.0 Blockers, 2014-03-24)
From: Aristotle Pagaltzis
March 31, 2014 03:33
Re: perlbug and encoding (Re: Perl 5.20.0 Blockers, 2014-03-24)
Message ID: 20140331033305.GA46096@plasmasturm.org
* Craig A. Berry <email@example.com> [2014-03-29 23:45]:
> On Wed, Mar 26, 2014 at 4:59 AM, Aristotle Pagaltzis <firstname.lastname@example.org> wrote:
> > Using a raw handle is fine.
> > Printing decoded strings to it is not.
> > You are missing the encoding step.
> Thanks for the explanation. I'm probably beyond hope, but I appreciate
> the effort :-).
> Of course no such thing as a "decoded string" actually exists. Every
> decoding is another encoding. Simply by virtue of being represented in
> computer memory, strings have to be encoded in some way. Even if each
> character is represented by a 21-bit integer containing the value of
> its Unicode code point, that is a form of encoding. Various docs say
> Perl's internal encoding is UTF-8, but also say not to depend on that.
> I believed the former but not the latter. My bad.
well, to pick this nit a bit further: that’s true. But decoding doesn’t
merely transcode the string to UTF-8 –
It *also* flips a flag bit.
That flag bit is responsible for drastically altering the way operators
perceive the contents of the PV buffer: it makes them treat multibyte
sequences as a single entity. Perl-land stops seeing the byte sequence
stored in the PV buffer, and instead sees a sequence of codepoints,
regardless of how many PV buffer bytes any one codepoint corresponds to.
Note that decoding could conceivably also transcode to Latin-1 *without*
flipping the bit in the case of an input string limited to the Latin-1
charset. Then operations on the decoded string would yield the same
results as they do when it is transcoded to UTF-8 *with* the flag bit
set: it would be a sequence of codepoints, even though in this case,
each codepoint always corresponds to one PV buffer byte.
Now, any single codepoint may or may not be representable in a single
byte. So if you need bytes, you have to ask Perl for a representation in
terms of bytes, which is the encode step. What may happen here if the
decoded string was stored as UTF-8 internally is that the encoded string
will have the same bytes in its PV buffer.
But, crucially, *its flag bit will be off*.
This makes the encode step *not* a no-op – string operations will treat
multibyte sequences in its PV buffer as multiple separate characters.
So there’s your difference:
Encoding/decoding do not only transcode – they also preselect the code
paths that will be taken when operating on the string, picking between
two semantics. The act of decoding/encoding is your request to switch
Bottom line, it’s a question of layers and their interfaces.
Hope this helps.
It’s lamentably easy to get confused because there are two layers (Perl
and perl, if you will) that both use the exact same concepts: both deal
with sequences of (small) integers and both use the same de-/encoding
algorithms. “Of course”, on some level – because as you said: a decoded
string does not *actually* exist. But semantically, they are distinct
layers, and the use of the same representation on one layer has nothing
to do with its use at the other layer.
> Moving on to what to do with perlbug for 5.20. The main reason to
> specify layers on all the handles in perlbug was to ensure that
> patches attached with the new -p option come through the wash ok even
> if they have multiple encodings in them. Using the :raw layer on both
> input and output seems to accomplish that and I think this part is a
> keeper. It's probably a misnomer to call it "unicode awareness"; it
> might be more proper to say we're making perlbug encoding-agnostic.
> Somewhat as an afterthought, it seemed like it might be nice if we
> could handle more than ASCII in the message body as well. We could
> spell people's names correctly, and pasted-in code samples and output
> from code samples might actually look as intended. Somehow I got it
> into my head that in the case of a prepared report supplied with the
> -f option (or by having the filename typed in response to a prompt) we
> could not be encoding-agnostic and would have to know the input
> encoding and convert it into a specified output encoding. I now think
> this whole idea was a mistake (even aside from my implementation
> mistakes) and we should scrap it, at least for now.
The idea does stem from the right impulse.
> Guessing the input encoding is the tricky part. I was attempting to
> use encoding::_get_locale_encoding(). Aside from being a private
> method of a deprecated pragma, it depends on the locale being set up
> properly and whatever program that created the file having observed
> the locale setting. As I understand it, pretty much no program on
> Windows will do that. On any platform, there is no reason to assume
> the report file was created on the same system as the one running
> perlbug. And if the file was created in a text editor, any number of
> editor defaults and/or user preferences could cause it to be in some
> encoding other than what the locale specifies.
> So I think we should stop pretending that we can reasonably guess the
> encoding and instead focus on passing things through without mangling
I’m afraid you cannot sidestep the problem in this way: the text/plain
MIME type defaults to the US-ASCII charset unless otherwise specified,
so if perlbug is to declare its main message as readable, it *will* be
doing something about the encoding of the message, even if by omission.
Now encoding::_get_locale_encoding is not a good idea, I agree. But is
there a reason I’m missing that Encode::Locale wouldn’t be either? That
seems like the answer, no?
Note that no matter what, it’s imperative to allow the user to override
perlbug’s choice of input encoding, in case its guess is wrong. So it
will need a switch for this purpose.
> P.S. If someone wants to write a robust general-purpose encoding
> detector and include it in perlbug, please go ahead, but be sure to
> make it degrade nicely under miniperl when the Encode module is not
Is I18N::Langinfo available then at least?
Well, nowadays it’s not a terrible idea to just expect UTF-8, and leave
it to the user to say otherwise if that’s wrong. This might possibly
even be done always, i.e. skipping Encode::Locale entirely even where
But, err, how is *any* transcoding supposed to be done under miniperl if
it lacks Encode?
Aristotle Pagaltzis // <http://plasmasturm.org/>