develooper Front page | perl.perl5.porters | Postings from September 2011

Re: [perl #100058] Perl leaves broken UTF-8 in SVs whose UTF8 isset

Thread Previous | Thread Next
Karl Williamson
September 27, 2011 16:11
Re: [perl #100058] Perl leaves broken UTF-8 in SVs whose UTF8 isset
Message ID:
On 09/26/2011 02:27 PM, Tom Christiansen wrote:
>> I think it was agreed some time ago that that is a bug.  The utf8 layer
>> should at least check for well-formedness (meaning that it produces a
>> valid perl scalar), even if it does not check for strict UTF-8 (disallow
>> certain codepoin(the latter being a matter of controversy).
> I do have some mail from Mark Davis explaining why a UTF-8 decoder must
> allow everything in the range U+0000 through U+1FFFF *except* for
> surrogates.  Our "nonchar" warnings apparently shouldn't be there.
> --tom

This issue keeps coming back up, when I think we have long ago resolved 
how to fix it.  Here is my view of how the API should work, and I 
thought that it followed the consensus view.  This follows what I think 
Zefram and David Golden proposed more than a year ago.

The default utf8 layer should prohibit malformed utf8, surrogates, 
non-character code points and above-Unicode code points.

There should be an alternate layer, called something like utf8-lax, 
which allows all three, but not malformed utf8.  There should be three 
other layers, with names like no-surrogates, no-nonchars, and 
only-unicode which disallow exactly one class, as indicated by their 
names.  It should be then possible to combine these to orthogonally 
allow any combination of the three problematic input types.

My understanding is that the the original reason for not doing the input 
checks was performance.  Security is a far more important issue now, and 
Nicholas has demonstrated code that does the parsing with a minimal 
performance hit.

I have been waiting for that code to be complete, and then planned to 
implement the other layers, unless someone else wanted to.

Having now read Mark's email, I don't think that contradicts anything 
said above.  It should be possible for a utf8 decoder to allow 
non-characters, but it should be possible for such a decoder to disallow 
them as well, and that should be what you get by default.  Only by 
taking extra action should you be able to specify that you want atypical 
code points allowed.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About