develooper Front page | perl.perl5.porters | Postings from June 2010

RFC: Unicode non-characters vs. non Unicode characters

From:
karl williamson
Date:
June 10, 2010 12:25
Subject:
RFC: Unicode non-characters vs. non Unicode characters
Message ID:
4C113BBD.8070302@khwilliamson.com
A Unicode non-character is one of 66 ordinals (code points) reserved by 
Unicode to never be a character.

A non Unicode character is a code point that is above 0x10FFFF, and 
Unicode says they will never ever use those.  They are outside, and 
always will be, the Unicode standard.  For more clarity, I will call 
them non-Unicode code points.

A non-character is illegal for interchange, but one is free to use it 
internally in an application.  Note that an application can be any 
number of cooperating processes, so that these code points are usable in 
I/O.

Unicode doesn't like anyone using a non-Unicode code point, but Perl 
accepts them.

The problem  it seems to me is that that Perl treats these two classes 
of code point differently, and I am trying to reconcile that behavior. 
It seems to me that they should have rough parity.

When one is converting from code point to utf8, there is parity.  Use of 
either of these will raise a warning, but there is a flag for each that 
turns off the corresponding warning.

The difference comes when trying to go from utf8 to ordinal.  The 
non-Unicode code points are accepted unconditionally, without any 
warnings ever.  The non-character code points are treated as malformed 
utf8, and unless the flag is set to allow them, will cause Perl to throw 
up its hands.

This just seems wrong to me.  Neither is more malformed than the other. 
  Neither should be used for interchange with unsuspecting applications, 
but both should be usable within a set of cooperating applications.  Yet 
Perl treats worst the ones that Unicode likes the best.  I suspect that 
this behavior stems from early Unicode documentation which called the 
non-characters "illegal characters", but that is not what they mean now. 
  (Their provenance is based on big-endian vs little-endian potential 
confusions, and trying to make it a little easier to process Unicode in 
16-bit word chunks.)

I'm curious if anyone has some ideas on this



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About