Front page | perl.perl5.porters |
Postings from October 2017
Re: source encoding
October 25, 2017 15:47
Re: source encoding
Message ID: 20171025154650.GW6716@fysh.org
> Perl has nagged them into making a change,
> then later nagged them into undoing that change. I could see that
> putting some users off Perl; to the uninitiated it could look like a
> backwards-incompatible change followed by P5P backtracking on that
> with a second backwards-incompatible change -- all over something as
> insignificant as a comment.
Good point. It would be reasonable to apply a different rule for
comments, since their meaning won't change: the intermediate state
(where non-ASCII without "use utf8" is generally fatal) can permit legal
UTF-8 in comments, as if comments are always in the scope of "use utf8".
We'd only deprecate non-ASCII in comments where it isn't legal UTF-8.
>Another not-quite-so-benign case is users who are currently misusing
>UTF-8 in a two-wrongs-make-a-right way.
>When changing the source encoding to UTF-8, would it make sense to
>change Perl's default I/O encodings to match?
No. This is about the representation of programs, not about their
behaviour. Your user in this case needs to either actually fix the
program or at least perform a correct conversion of the buggy program
to ASCII or UTF-8 source. The deprecation message could clue them in
to the two-wrongs situation by advising them on \x sequences to change
the Latin-1 to, which would make the mojibake explicit. Or it could
specifically call out mojibake where the Latin-1 looks like it.
Changing default I/O encoding would be a separate change, and very
disruptive. Perl has always been able to handle binary I/O without
trouble, and to make the one-character string "\xff" print out as two
bytes rather than one would break huge swathes of existing correct code.
If you want Latin-1 to no longer be the default encoding, the way to
go about it is not to change to a different default, but to abolish
the default and demand that encoding be explicitly stated for all file
handles, which doesn't seem palatable. Or, slightly more palatably,
you could `merely' make it fatal to send non-ASCII characters through
a file handle whose encoding hasn't been declared, perhaps as an
intermediate state toward a new default. But this poses big problems
for Perl version portability of programs that handle non-ASCII data,
much bigger than the problems that arise in changing source encoding.
>Neither of these are helpful to a user who was happy with the way their
>non-Ascii characters were working until they were told to declare that
>they are using UTF-8,
Ah, here's your error. The deprecation message should absolutely not
advise people to declare that they are using UTF-8, because *they're not*.
It needs to advise that the program be converted to a supported encoding,
either ASCII using \x escapes or to UTF-8. Only after converting the
source to UTF-8 is adding the "use utf8" declaration correct. This is
something we'd have to be clear about in documentation: conversion to
UTF-8 is not just a matter of adding the pragma.
>(Aside: Note that the ???????? in the UTF-encoded filename appears correctly
>in the warning above, presumably because Perl doesn't actually know that
>it's a ????????. If Perl is presuming a source file is UTF-8, should it
>interpret that source's filename differently?)
No, the handling of non-ASCII filenames is another aspect of program
behaviour, not an aspect of source encoding. It's a thing that we
definitely need to make some changes about, but it's separate from this,
and a big topic in its own right. We also need to be careful to avoid
churn in filename semantics: when we eventually change them we need to
be pretty sure that we're changing them to the right permanent semantics.
And, aside, what I've quoted above from you with all the question marks is
how your message appeared for me. My locale is ASCII; my terminal does
Latin-1 but not general Unicode; and my MUA quite sensibly (but not all
that helpfully) substitutes for the annoying out-of-locale characters.
There is in general a problem with using non-ASCII characters when
discussing encoding issues. I'm not so concerned about the message not
rendering immediately for me; that's only an inconvenience. The real
problem is that when you're talking about a byte sequence (such as the
content of a source file or a program's output) this adds a layer of
encoding ambiguity to the discussion, which just obscures your actual
Using literal non-ASCII characters is basically OK if you're actually
talking about the *character* sequence, such as in "the source looks
like this in the user's editor" or "the output looks like this on the
user's terminal". But too often people conflate the byte sequence with
the character sequence, assuming some encoding that they don't specify.
(No, the encoding of your email message doesn't tell us what encoding
you're imagining of the user.) You'd think people would notice this
problem when the subject they're discussing is precisely the issue
of misconfigured encodings breaking a user's expectations for the
correspondence between bytes and characters, but apparently not.
>Separately, if Perl starts treating source code as UTF-8, Pod should do
>likewise, making =encoding utf-8 a no-op.
Ah, I hadn't thought of that. Technically, changing the rules of POD
is a distinct issue from program source encoding. With "=encoding"
being more workable in practice than any encoding declaration we have
in Perl code, we might not want to go to exactly the UTF-8-only state
that we're taking Perl code to. Do we want to deprecate the use of
other encodings? If we want to continue to allow multiple encodings,
we might not even want to change the default to UTF-8. We *would*
presumably want to remove the Latin-1 default, to avoid Perl code and
POD having clashing defaults. In any case, to change the default we'd
have to follow a deprecation process parallel to the Perl code one:
deprecate non-ASCII bytes in the absence of "=encoding", and have that
fatal for a while (effectively a default encoding of ASCII), before it
would be acceptable to change the default to UTF-8.
It *is* feasible to change the Perl source encoding rules without changing
the POD rules. We'd want to advise that documentation converted from
Latin-1 to UTF-8 acquire an "=encoding UTF-8". POD is another context
in which non-ASCII bytes can appear in a Perl program, similar to
comments though not quite the same. We would presumably not deprecate
legal UTF-8 in POD, regardless of "use utf8", possibly depending on
"=encoding" though that's not really necessary. perl doesn't need to
concern itself with enforcing correct encoding of the POD source; that
can be left to the POD tools. Its only necessary concern is that the
POD source, as with all code file contents, should be legal UTF-8.