develooper Front page | perl.perl5.porters | Postings from December 2010

Re: RFC: Summary of proposed handling of surrogates, non-characters,etc for 5.14. Note some backward incompatibility

Thread Previous | Thread Next
From:
karl williamson
Date:
December 20, 2010 17:06
Subject:
Re: RFC: Summary of proposed handling of surrogates, non-characters,etc for 5.14. Note some backward incompatibility
Message ID:
4D0FFD99.6080108@khwilliamson.com
Eric Brine wrote:
> On Mon, Dec 20, 2010 at 4:27 AM, demerphq <demerphq@gmail.com> wrote:
> 
>>> I *am* very opposed to surrogate codepoints behaving differently from
>>> non-surrogate codepoints under the allow-any-UV-codepoint paradigm.
>> Why shouldn't perl warn when it tries to lc() a string containing a
>> surrogate pair instead of the correctly decoded true codepoint the
>> surrogate pair represents?
>>
> 
> So you suggest we warn for code points that *can not* be encoded in UTF-16,
> but remain silent for code points that *must not* be encoded in UTF-16 (e.g.
> 0xFFFE)? If anything, that sounds backwards. Why warn for what already fails
> safe.

It is legal for 0xFFFE to be encoded in UTF-16.  It is illegal to 
interchange it to an unsuspecting application.  To rephrase slightly, a 
set of co-operating processes may exchange streams containing the UTF-16 
for 0xFFFE and all the other non-characters.  That would be called 
intrachange.  Interchange of these is illegal; intrachange is legal.

That means, that an application that accepts arbitrary strings of UTF-16 
  input needs to protect itself from those strings containing 0xFFFE 
which could mislead it into thinking the stream has a different 
endianness than it does, and is theoretically, at least, a security attack.

Similarly, an attack could come through UTF-8 as well.  That is why the 
default :utf8 (and utf16) input layer needs to exclude non-characters, 
as well as malformed ones.  But there needs to be a way to turn it off 
so that cooperating processes can exchange them.

> 
> I don't see any reason to some non-characters differently than other
> non-characters either.

I agree.

Here's a portion of Section 16.7 of the Standard:

"Applications are free to use any of these noncharacter code points 
internally but should never attempt to exchange them. If a noncharacter 
is received in open interchange, an application is not required to 
interpret it in any way. It is good practice, however, to recognize it 
as a noncharacter and to take appropriate action, such as replacing it 
with U+FFFD replacement character, to indicate the problem in the text. 
It is not recommended to simply delete noncharacter code points from 
such text, because of the potential security issues caused by deleting 
uninterpreted characters. (See conformance clause C7 in Section 3.2, 
Conformance Requirements, and Unicode Technical Report #36, “Unicode 
Security Considerations.”)

"In effect, noncharacters can be thought of as application-internal 
private-use code points.  Unlike the private-use characters discussed in 
Section 16.5, Private-Use Characters, which are assigned characters and 
which are intended for use in open interchange, subject to 
interpretation by private agreement, noncharacters are permanently 
reserved (unassigned) and have no interpretation whatsoever outside of 
their possible application-internal private uses."

An example of their valid use would be a distributed system that takes 
arbitrary Unicode as input.  The system would make sure that no inputs 
have these characters in them, but it could then use these characters in 
streams, intermixed with the input, communicating between the various 
processes that comprise it.


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About