2009/10/5 John Gardiner Myers <jgmyers@proofpoint.com>: > karl williamson wrote: >> >> 3) code points above the legal Unicode maximum 10FFFF (which they have >> recently reaffirmed will NEVER be exceeded (in 5.2, just released, 22% >> of the available code points are assigned, up from 21% in 5.2)). >> >> 4) surrogates code points >> >> Case 3) could be construed as non-characters, but are somewhat >> different because they aren't legal Unicode code points. The message >> could be reworded slightly to include them, as the principal is the same, >> namely that these can successfully be used internally in an application, but >> shouldn't be used for interchange with an unsuspecting application. But >> actually, I would prefer adding a new message for these, as there could be >> less restriction on them, as there isn't the possibility of confusion with >> BOM or other things. >> >> Case 4) has a separate message "UTF-16 surrogate 0x%04". I think that >> these actually could also be used internally in an application like the >> others. But this would definitely be an extension of Unicode, and require >> some more work, and so I don't advocate it. > > Cases (3) and (4), when encoded in UTF-8, result in ill-formed code unit > sequences (See definitions D92 and D93 in the Unicode Standard, version > 5.2). Generating such ill-formed code unit sequences violates conformance > requirement C9 of the Unicode Standard. Interpreting such ill-formed code > unit sequences as characters violates conformance requirement C10 of the > Unicode Standard. > > So Perl's existing behavior of merely warning on such "code points" does not > conform to the Unicode Standard. I think the language lawyers worked around this by saying that perl internally does "utf8", which is basically "UTF-8" with some rules relaxed and a larger range of "code points". IIRUC certain points of the system insist on operating on only UTF-8, such as to the best of my knowledge the conversion layer in Encode.pm does provide user selectable levels of error trapping of such sequences. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"Thread Previous | Thread Next