Eric Brine wrote: > On Mon, Dec 20, 2010 at 4:27 AM, demerphq <demerphq@gmail.com> wrote: > >>> I *am* very opposed to surrogate codepoints behaving differently from >>> non-surrogate codepoints under the allow-any-UV-codepoint paradigm. >> Why shouldn't perl warn when it tries to lc() a string containing a >> surrogate pair instead of the correctly decoded true codepoint the >> surrogate pair represents? >> > > So you suggest we warn for code points that *can not* be encoded in UTF-16, > but remain silent for code points that *must not* be encoded in UTF-16 (e.g. > 0xFFFE)? If anything, that sounds backwards. Why warn for what already fails > safe. It is legal for 0xFFFE to be encoded in UTF-16. It is illegal to interchange it to an unsuspecting application. To rephrase slightly, a set of co-operating processes may exchange streams containing the UTF-16 for 0xFFFE and all the other non-characters. That would be called intrachange. Interchange of these is illegal; intrachange is legal. That means, that an application that accepts arbitrary strings of UTF-16 input needs to protect itself from those strings containing 0xFFFE which could mislead it into thinking the stream has a different endianness than it does, and is theoretically, at least, a security attack. Similarly, an attack could come through UTF-8 as well. That is why the default :utf8 (and utf16) input layer needs to exclude non-characters, as well as malformed ones. But there needs to be a way to turn it off so that cooperating processes can exchange them. > > I don't see any reason to some non-characters differently than other > non-characters either. I agree. Here's a portion of Section 16.7 of the Standard: "Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD replacement character, to indicate the problem in the text. It is not recommended to simply delete noncharacter code points from such text, because of the potential security issues caused by deleting uninterpreted characters. (See conformance clause C7 in Section 3.2, Conformance Requirements, and Unicode Technical Report #36, “Unicode Security Considerations.”) "In effect, noncharacters can be thought of as application-internal private-use code points. Unlike the private-use characters discussed in Section 16.5, Private-Use Characters, which are assigned characters and which are intended for use in open interchange, subject to interpretation by private agreement, noncharacters are permanently reserved (unassigned) and have no interpretation whatsoever outside of their possible application-internal private uses." An example of their valid use would be a distributed system that takes arbitrary Unicode as input. The system would make sure that no inputs have these characters in them, but it could then use these characters in streams, intermixed with the input, communicating between the various processes that comprise it.Thread Previous | Thread Next