Front page | perl.perl5.porters |
Postings from October 2017
Re: source encoding
October 25, 2017 11:51
Re: source encoding
Message ID: firstname.lastname@example.org
> It's probably viable now to insist on all Unicode users using UTF-8,
That's reasonable. And I really like the idea of getting to a state
where UTF-8 is just accepted. Thanks for thinking this through and
> To get there, first we have to deprecate all source encoding that's
> incompatible with it. That is, we deprecate the presence of non-ASCII
> bytes anywhere in a source file other than in the scope of "use utf8".
> This includes not only non-ASCII bytes in string literals, but also in
> comments, and anywhere else they manage to get.
That sounds painful, and particularly unfortunate for somebody who
currently has an undeclared UTF-8 comment in their code. If I understand
your plan correctly, they would go through the following sequence:
1 The source has UTF-8 characters in comments, and no utf8 pragma.
Perl would misinterpret these bytes, but in practice that doesn't
cause any harm in a comment, and so they display fine in a
UTF-8-presuming text editor.
2 Those comments start to cause warnings. To placate the warning it's
necessary to add the utf8 pragma.
3 Perl starts interpreting files as UTF-8. The presence of the utf8
pragma causes a warning. To placate the warning it's necessary to
remove the utf8 pragma.
4 The source has UTF-8 characters in comments, and no utf8 pragma.
From the user's point of view, the behaviour seems identical to that
at stage 1: they have fancy characters in their comments, which
display correctly in their editor, and which don't have any effect
on how the program runs. Perl has nagged them into making a change,
then later nagged them into undoing that change. I could see that
putting some users off Perl; to the uninitiated it could look like a
backwards-incompatible change followed by P5P backtracking on that
with a second backwards-incompatible change — all over something as
insignificant as a comment.
Is there a way of getting to UTF-8 everywhere without imposing this cost
on ‘benign’ UTF-8 characters in comments?
Another not-quite-so-benign case is users who are currently misusing
UTF-8 in a two-wrongs-make-a-right way. Consider this program (with 3
non-Ascii UTF-8 characters in the literal strings):
say "Zoë’s room";
Currently on a UTF-8 terminal that will _appear_ to work as the
programmer presumably intended; adding use utf8 makes it ‘worse’.
When changing the source encoding to UTF-8, would it make sense to
change Perl's default I/O encodings to match?
In the same way that Latin-1 is no longer a good default for source
encoding, it's no longer a good default for output — especially not for
a user who has specified a UTF-8 locale in their environment. Having
sources be UTF-8 but stdout be Latin-1 by default is an odd combination
that's unintuitive, awkward to explain to newcomers, and very unlikely
what most people would want.
Otherwise, somebody seeing the new warning could change their code to:
say "Zoë’s room!";
and discover that in a UTF-8 terminal that yields:
Wide character in say at ./Zoë line 4.
As I'm sure most readers of this list are aware, Perl encodes the first
string as Latin-1 on output, sending a byte that isn't UTF-8 to the
terminal. The second string contains a character which doesn't exist in
Latin-1, so Perl emits it as UTF-8, but with a warning.
Neither of these are helpful to a user who was happy with the way their
non-Ascii characters were working until they were told to declare that
they are using UTF-8, at which point it apparently made things worse
— causing them to become more frustrated with Perl, or with P5P for
breaking things that (they believed) worked.
(Aside: Note that the ‘ë’ in the UTF-encoded filename appears correctly
in the warning above, presumably because Perl doesn't actually know that
it's a ‘ë’. If Perl is presuming a source file is UTF-8, should it
interpret that source's filename differently?)
Separately, if Perl starts treating source code as UTF-8, Pod should do
likewise, making =encoding utf-8 a no-op.