Front page | perl.perl5.porters |
Postings from November 2010
RFC: Processing Unicode non-characters and code points beyond Unicode's
Thread Next
From:
karl williamson
Date:
November 21, 2010 22:01
Subject:
RFC: Processing Unicode non-characters and code points beyond Unicode's
Message ID:
4CEA071D.5070008@khwilliamson.com
We've gone around on this before. Here's hopefully the final round.
First, the problem. Perl doesn't handle the Unicode non-character code
points very well at all. These are 66 code points in Unicode that are
guaranteed to never be assigned to be actual characters. Therefore, an
application can use them freely internally, knowing they won't ever
conflict with an actual character. The uses for them were envisioned
mainly as sentinel characters, for example adding one of them as the
final character in a list, so that a loop could terminate on finding it.
These characters, however, are illegal for interchange with an
unsuspecting application, for reasons that should be obvious given a
little thought.
In our last go round on this issue, David Golden concluded, and I
concur, that Perl need only check for these upon I/O, as that is the
only way to interchange with another application. And it shouldn't
check except for I/O, to allow free use within an application.
I need to add, though, that an application may not be a single process,
but a cooperating group. Therefore there does need to be a way to input
and output these characters.
Perl's problems are several-fold One, on input, it only knows about one
of the 66 characters. And it croaks on that one. It is impossible to
turn that croaking off. Second, the default is to croak on any use of
these internally. It is possible to turn that off, but not without
simultaneously turning off real errors. And, third, but not too big a
deal, on output, it splits these into two groups of 32 and 34
characters, a distinction that is not in the Standard, and hence is
confusing to someone who knows Unicode. Again, croaking is the default,
and it is possible to turn it off, but not without turning off things
that are always errors, as well.
The way you turn off the croaking is to turn off utf8 warnings.
After a lot of thought interspersed with my other activities over the
last months, I have found a pretty easy (not much code change) fix, that
is far simpler than I had envisioned. I believe we established in the
earlier go round, that the current model is so far from the correct
behavior, that it is acceptable to break the API somewhat.
What I propose now is to take David's suggestion and test on output that
a utf8 string doesn't contain one of these characters. This would be in
the same area of code as the "Wide character in print" message currently
is. I'm leaning to not making that a fatal error.
The API would change so that the existing flags meaning these things
would change names, and be inverted, so that when cleared, it would mean
to accept these code points; when set it would mean to not accept them.
This would have the effect that the default is to accept them. The two
output flags that make a false distinction would be combined into one.
Encode.xs would have a short patch then to forbid these character, as
that is the way input of utf8 strings is done. The :utf8 layer which
doesn't do error checking wouldn't change. I believe these are the only
two ways to get utf8 input. Please correct me if I'm wrong.
Existing code that relied on the flags would have to change. By
changing the names as we complement them, that code would not compile,
and so the maintainer would know that this would have to be dealt with.
Besides Encode, there is one other place that gets shipped with the
core that would have to change. Normalize would require a short patch
as well.
I propose to treat the beyond Unicode code points the same way as the
non-character ones. That is, they are freely usable without warnings
internally, but I/O would require flags to prevent warning/croaking.
Essentially, these are the same thing to Unicode. They are illegal for
interchange, but Perl has decided to accept them internally.
Zefram, however, wrote on this topic thus:
"No. It's not OK for a warning to be fatal. The situation should either
be a fatal error (regardless of warning flags) or a non-fatal warning
(controlled by warning flags). A warning would make a lot more sense,
because Perl is generally happy to process codepoints in ways that
Unicode does not permit."
I'm afraid I don't grok this. It sounds like he wants to replace the
entire current mechanism for dealing with malformed (or merely suspect)
utf8, as the way it works now is that by turning warnings off,
processing continues, but turning them on causes the warnings to be
fatal errors instead. That's a bigger issue. And, one that hasn't
really been vetted here before, as far as I know. We have this existing
mechanism. I was trying to fit as much as possible into it without
disturbing it much. Is that the right approach, or what?
Thread Next
-
RFC: Processing Unicode non-characters and code points beyond Unicode's
by karl williamson