develooper Front page | perl.perl5.porters | Postings from November 2010

RFC: Processing Unicode non-characters and code points beyond Unicode's

Thread Next
From:
karl williamson
Date:
November 21, 2010 22:01
Subject:
RFC: Processing Unicode non-characters and code points beyond Unicode's
Message ID:
4CEA071D.5070008@khwilliamson.com
We've gone around on this before.  Here's hopefully the final round.

First, the problem.  Perl doesn't handle the Unicode non-character code 
points very well at all.  These are 66 code points in Unicode that are 
guaranteed to never be assigned to be actual characters.  Therefore, an 
application can use them freely internally, knowing they won't ever 
conflict with an actual character.  The uses for them were envisioned 
mainly as sentinel characters, for example adding one of them as the 
final character in a list, so that a loop could terminate on finding it.

These characters, however, are illegal for interchange with an 
unsuspecting application, for reasons that should be obvious given a 
little thought.

In our last go round on this issue, David Golden concluded, and I 
concur, that Perl need only check for these upon I/O, as that is the 
only way to interchange with another application.  And it shouldn't 
check except for I/O, to allow free use within an application.

I need to add, though, that an application may not be a single process, 
but a cooperating group.  Therefore there does need to be a way to input 
and output these characters.

Perl's problems are several-fold  One, on input, it only knows about one 
of the 66 characters.  And it croaks on that one.  It is impossible to 
turn that croaking off.  Second, the default is to croak on any use of 
these internally.  It is possible to turn that off, but not without 
simultaneously turning off real errors.  And, third, but not too big a 
deal, on output, it splits these into two groups of 32 and 34 
characters, a distinction that is not in the Standard, and hence is 
confusing to someone who knows Unicode.  Again, croaking is the default, 
and it is possible to turn it off, but not without turning off things 
that are always errors, as well.

The way you turn off the croaking is to turn off utf8 warnings.

After a lot of thought interspersed with my other activities over the 
last months, I have found a pretty easy (not much code change) fix, that 
is far simpler than I had envisioned.  I believe we established in the 
earlier go round, that the current model is so far from the correct 
behavior, that it is acceptable to break the API somewhat.

What I propose now is to take David's suggestion and test on output that 
a utf8 string doesn't contain one of these characters.  This would be in 
the same area of code as the "Wide character in print" message currently 
is.  I'm leaning to not making that a fatal error.

The API would change so that the existing flags meaning these things 
would change names, and be inverted, so that when cleared, it would mean 
to accept these code points; when set it would mean to not accept them.
This would have the effect that the default is to accept them.  The two 
output flags that make a false distinction would be combined into one.

Encode.xs would have a short patch then to forbid these character, as 
that is the way input of utf8 strings is done.  The :utf8 layer which 
doesn't do error checking wouldn't change.  I believe these are the only 
two ways to get utf8 input.  Please correct me if I'm wrong.

Existing code that relied on the flags would have to change.  By 
changing the names as we complement them, that code would not compile, 
and so the maintainer would know that this would have to be dealt with. 
  Besides Encode, there is one other place that gets shipped with the 
core that would have to change.  Normalize would require a short patch 
as well.

I propose to treat the beyond Unicode code points the same way as the 
non-character ones.  That is, they are freely usable without warnings 
internally, but I/O would require flags to prevent warning/croaking. 
Essentially, these are the same thing to Unicode.  They are illegal for 
interchange, but Perl has decided to accept them internally.

Zefram, however, wrote on this topic thus:
"No.  It's not OK for a warning to be fatal.  The situation should either
be a fatal error (regardless of warning flags) or a non-fatal warning
(controlled by warning flags).  A warning would make a lot more sense,
because Perl is generally happy to process codepoints in ways that
Unicode does not permit."

I'm afraid I don't grok this.  It sounds like he wants to replace the 
entire current mechanism for dealing with malformed (or merely suspect) 
utf8, as the way it works now is that by turning warnings off, 
processing continues, but turning them on causes the warnings to be 
fatal errors instead.  That's a bigger issue.  And, one that hasn't 
really been vetted here before, as far as I know.  We have this existing 
mechanism.  I was trying to fit as much as possible into it without 
disturbing it much.  Is that the right approach, or what?


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About