develooper Front page | perl.perl5.porters | Postings from October 2017

source encoding

Thread Next
October 24, 2017 23:46
source encoding
Message ID:
Since we got that query a couple of days ago about Unicode operators, I
thought a bit about source encoding, and I see a way to make a meaningful
improvement.  Herewith follows a sketch.

The essential problem to solve is how to declare the encoding of a
source file.  Historically there was no need to declare an encoding,
and so we have inherited no mechanism for it from the pre-Unicode days.
So where to insert a declaration?  In theory we want to know the encoding
when opening a file, but the invoker of a code file not only has no
existing way to declare encoding, but also shouldn't be burdened with
knowing about the encoding, that being an internal detail of the code
being invoked.  But inside the text of the code file is also the wrong
place: in principle that's read too late, and the mechanisms we've tried
with pragmata yield the wrong scope.  Switching encoding on lexical
boundaries could in theory be made to work, but that's not how files
are written, and it sits uneasily with having a buffer of supposedly
decoded text waiting to be parsed.

So the only viable approach that's not incredibly difficult is to
remove the need for an encoding declaration, by making the encoding
the same for all code files.  That way there's no question of how to
handle an alleged Perl code file.  The fixed encoding would, of course,
be UTF-8.  It's probably viable now to insist on all Unicode users (that
is, programmers who want to use non-ASCII characters in their source)
using UTF-8, in a way that it wasn't a decade ago.  Inevitably some will
decry the change, claiming that it shows that we no longer care about
backcompat, et cetera.  To go this route we'd have to decide that the
goal of coherent Unicode source was worth these complaints.

The place we want to end up in, then, is that all code files are
interpreted as UTF-8.  If a file contains a byte sequence that is not
well-formed UTF-8, that causes an exception at compile time.  Furthermore,
if a file contains a codepoint that is not valid for interchange, that's
probably also a compile-time error.  The SV constituting the parse buffer
is always in decoded form, probably normally containing bytes from the
file and having the SvUTF8 flag on.  Maybe always in that form, with an
API rule against downgrading it.  Where non-ASCII characters are seen in
identifier context, they are consistently treated as potentially part of
the identifier, as happens currently under "use utf8".  String literals
are in downgraded form if their source consists only of ASCII characters,
but are probably upgraded if the source has any non-ASCII characters even
if they could be downgraded, all as happens currently under "use utf8".

To get there, first we have to deprecate all source encoding that's
incompatible with it.  That is, we deprecate the presence of non-ASCII
bytes anywhere in a source file other than in the scope of "use utf8".
This includes not only non-ASCII bytes in string literals, but also
in comments, and anywhere else they manage to get.  We also deprecate
the presence of anything other than well-formed UTF-8 (and possibly
other than codepoints valid for interchange), even under "use utf8".
Non-ASCII bytes are only permitted where they're well-formed UTF-8 in
the scope of "use utf8".  At the end of the deprecation cycle, because
of the insidiousness of the bugs that would result, and despite the
warning provided by the deprecation cycle, we cannot immediately change
the meaning of non-ASCII bytes.  Instead, we must make these deprecated
things fatal, and leave them fatal for one or two major versions.

After all that is done, we can finally actually change the interpretation
of non-ASCII bytes.  We change to the whole code file being interpreted
as UTF-8, so now we accept well-formed UTF-8 regardless of the "use
utf8" pragma.  Characters in identifiers and string literals follow the
rules described above, which were formerly the "use utf8" rules.

At this point, "use utf8" should be effectively a no-op.  I'm not sure
whether it actually would be; I haven't checked for remnants of the old
semantic effects.  If there are any, we should in parallel work to remove
them.  When "use utf8" no longer has any real effect, we should make it
an actual no-op, such that the pragma doesn't even set the $^H bit any
more, and we should deprecate it, to avoid confusing people with a no-op
that looks like it does something.  The deprecation notice should advise
programmers seeking portability across Perl versions to make their "use
utf8" pragmata conditional, something like "use if $] < 5.036, 'utf8'".

Is there an appetite for this kind of change?

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About