develooper Front page | perl.perl5.porters | Postings from March 2014

[perl #101384] perldiag does not adequately describe how to avoid malformed UTF8 scalars

Thread Previous
From:
Karl Williamson via RT
Date:
March 22, 2014 20:41
Subject:
[perl #101384] perldiag does not adequately describe how to avoid malformed UTF8 scalars
Message ID:
rt-4.0.18-6832-1395520885-1866.101384-15-0@perl.org
On Sat Sep 07 19:25:18 2013, jkeenan wrote:
> On Thu Sep 20 18:57:28 2012, jkeenan wrote:
> > On Fri Oct 14 16:06:47 2011, tom christiansen wrote:
> > > Also, this entry from perldiag lies:
> > > 
> > >    Malformed UTF-8 character (%s)
> > >        (S utf8) (F) Perl detected a string that didn't comply with UTF-8
> > >        encoding rules, even though it had the UTF8 flag on.
> > > 
> > >        One possible cause is that you set the UTF8 flag yourself for
> > >        data that you thought to be in UTF-8 but it wasn't (it was for
> > >        example legacy 8-bit data). To guard against this, you can use
> > >        Encode::decode_utf8.
> > > 
> > >        If you use the ":encoding(UTF-8)" PerlIO layer for input, invalid
> > >        byte sequences are handled gracefully, but if you use ":utf8",
> > >        the flag is set without validating the data, possibly resulting
> > >        in this error message.
> > > 
> > >        See also "Handling Malformed Data" in Encode.
> > 
> > The above is from 'perldiag'.
> > 
> > The below is Tom's comment:
> > 
> > > 
> > > That's because using ":encoding(UTF-8)" instead of ":utf8" makes 
> > > absolutely no difference.  The output and behavior are identical.
> > > Therefore it does *not*do*you*any*good*, and perldiag is in error.
> > > 
> > > Karl, isn't there something about this being some sort of security 
> > > problem?  Or is it ok because the code point seems to be construed
> > > as U+0000?
> > > 
> > 
> > Can anyone comment on these issues?
> > 
> > Thank you very much.
> > Jim Keenan
> 
> I pose the question again, a year later:  Can anyone comment on these
> issues?  Is there a real problem here at all?
> 
> Thank you very much.
> Jim Keenan

I finally looked at this, and it appears to me that the ticket is in error.  I did this:

perl -E'say qq[\300\261];' > foo

and then ran the attached test_encoding.pl program on it.  The results are:
=============================
utf8 "\xC0" does not map to Unicode at test.pl line 3.
utf8 "\xB1" does not map to Unicode at test.pl line 3.
5C \
78 x
43 C
30 0
5C \
78 x
42 B
31 1
0A 
=================================
What actually got placed into the variable $x is not malformed utf8, but a string consisting of a representation of what was attempted to be input: "\xC0\xB1".  That is apparently what perldiag means by the word "gracefully"

If I change the program (attached as test_utf8.pl) so that the input discipline is :utf8 instead, the output is
=======================
utf8 "\xC0" does not map to Unicode at test.pl line 3, <$fh> line 1.
C0 ��
B1 

0A 
======================
which contains malformed data.

So :encoding(UTF-8) refuses to input malformed UTF-8, but :utf8 allows it.  They don't behave identically. I think this ticket should be rejected, pending an explanation from Tom



-- 
Karl Williamson

---
via perlbug:  queue: perl5 status: open
https://rt.perl.org/Ticket/Display.html?id=101384

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About