develooper Front page | perl.perl5.porters | Postings from October 2017

Re: Unicode operators

Thread Previous
October 25, 2017 01:42
Re: Unicode operators
Message ID:
Tony Cook wrote:
>C<use utf8;> does just that,

"use utf8" is is where some of the problem lies.  The behaviour of UTF-8
under "use utf8" is OK, but it's *different* from the behaviour of the
same characters encoded in Latin-1 without "use utf8".  Consider a program
attempting to use an identifier that includes an "e" with an acute accent:

$ perl5.26.1 -lwe $'use utf8; $a\xc3\xa9 = 3; print $a\xc3\xa9'
$ perl5.26.1 -lwe $'$a\xe9 = 3; print $a\xe9'
Unrecognized character \xE9; marked by <-- HERE after $a<-- HERE near column 3 at -e line 1.

Same character sequence (after the pragma), different legality.  What we
see here is, at the character level, *two different languages*, one
permitting non-ASCII letters to appear in identifiers, the other not.
Then consider the same character appearing in a string literal:

$ perl5.26.1 -MDevel::Peek -lwe $'use utf8; Dump "\xc3\xa9"' 2>&1 | grep P
SV = PV(0xaceda0) at 0xaeb0a0
  PV = 0xaf4070 "\303\251"\0 [UTF8 "\x{e9}"]
$ perl5.26.1 -MDevel::Peek -lwe $'Dump "\xe9"' 2>&1 | grep P
SV = PV(0x2560d20) at 0x257cff8
  PV = 0x2586000 "\351"\0

The difference is less marked this time: the string literals are both
legal and the strings compare eq, but one is upgraded and the other
is downgraded.  This difference does still matter for some language
purposes, even without any explicit checking of the string encoding,
for example if the string is used as a filename.  Same character sequence
in the source code, different program behaviour.

These differences need to be resolved if we're to make any serious
claim that the Perl language accommodates non-ASCII characters in code.
Once brought into alignment, there's a maintenance burden in keeping both
versions of the parsing code equivalent.  The burden is evidently great,
since historically we haven't managed to get them equivalent even once,
in all the time that we've had the "use utf8" pragma.

There are other problems with "use utf8" arising from the poor fit between
the issue of text encoding and the mechanism of lexically-scoped pragmata.
Text encoding isn't lexically scoped by nature: it's a whole-file deal.
Furthermore, switching encoding during parsing, on statement boundaries,
sits poorly with having a buffer of text to be parsed, which gets filled
on line boundaries.  At the time we read bytes from the file to add them
to the buffer, it's impossible to know what encoding they're supposedly
using, so it's impossible to have a buffer of characters.  This is
problematic, and conflicts with the way we handle strings everywhere else.

In a separate message I've proposed a Gordian-Knot solution to this.
If instead we try to fix up the current system, then it's feasible to
resolve the parsing differences, though one group or the other will
be surprised by how the interpretation of their filenames has changed.
But the issue with declaring the encoding too late is just horrible, and
passes on the burden of making two parser code paths equivalent to all
parser plugin code.  That's an intractable design fault, as things stand.


Thread Previous Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About