develooper Front page | perl.perl5.porters | Postings from October 2011

[perl #100058] Perl leaves broken UTF-8 in SVs whose UTF8 is set

From:
Father Chrysostomos via RT
Date:
October 23, 2011 14:05
Subject:
[perl #100058] Perl leaves broken UTF-8 in SVs whose UTF8 is set
Message ID:
rt-3.6.HEAD-31297-1319403904-1069.100058-15-0@perl.org
On Tue Sep 27 16:11:29 2011, public@khwilliamson.com wrote:
> On 09/26/2011 02:27 PM, Tom Christiansen wrote:
> >> I think it was agreed some time ago that that is a bug.  The utf8 layer
> >> should at least check for well-formedness (meaning that it produces a
> >> valid perl scalar), even if it does not check for strict UTF-8
(disallow
> >> certain codepoin(the latter being a matter of controversy).
> >
> > I do have some mail from Mark Davis explaining why a UTF-8 decoder must
> > allow everything in the range U+0000 through U+1FFFF *except* for
> > surrogates.  Our "nonchar" warnings apparently shouldn't be there.
> >
> > --tom
> >
> 
> This issue keeps coming back up, when I think we have long ago resolved 
> how to fix it.  Here is my view of how the API should work, and I 
> thought that it followed the consensus view.  This follows what I think 
> Zefram and David Golden proposed more than a year ago.
> 
> The default utf8 layer should prohibit malformed utf8,

Yes, of course.

> surrogates, 
> non-character code points and above-Unicode code points.

That might be going to far.

> 
> There should be an alternate layer, called something like utf8-lax, 
> which allows all three, but not malformed utf8.  There should be three 
> other layers, with names like no-surrogates, no-nonchars, and 
> only-unicode which disallow exactly one class, as indicated by their 
> names.  It should be then possible to combine these to orthogonally 
> allow any combination of the three problematic input types.
> 
> My understanding is that the the original reason for not doing the input 
> checks was performance.  Security is a far more important issue now,

Indeed, but the only example given where non-characters were a security
issue involved three pieces of buggy software interacting, including a
‘security’ layer that wasn’t.

(Have I already said this?  I have a backlog of messages I wanted to
reply to, so I may be repeating myself.)




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About