On Sat Sep 07 19:25:18 2013, jkeenan wrote: > On Thu Sep 20 18:57:28 2012, jkeenan wrote: > > On Fri Oct 14 16:06:47 2011, tom christiansen wrote: > > > Also, this entry from perldiag lies: > > > > > > Malformed UTF-8 character (%s) > > > (S utf8) (F) Perl detected a string that didn't comply with UTF-8 > > > encoding rules, even though it had the UTF8 flag on. > > > > > > One possible cause is that you set the UTF8 flag yourself for > > > data that you thought to be in UTF-8 but it wasn't (it was for > > > example legacy 8-bit data). To guard against this, you can use > > > Encode::decode_utf8. > > > > > > If you use the ":encoding(UTF-8)" PerlIO layer for input, invalid > > > byte sequences are handled gracefully, but if you use ":utf8", > > > the flag is set without validating the data, possibly resulting > > > in this error message. > > > > > > See also "Handling Malformed Data" in Encode. > > > > The above is from 'perldiag'. > > > > The below is Tom's comment: > > > > > > > > That's because using ":encoding(UTF-8)" instead of ":utf8" makes > > > absolutely no difference. The output and behavior are identical. > > > Therefore it does *not*do*you*any*good*, and perldiag is in error. > > > > > > Karl, isn't there something about this being some sort of security > > > problem? Or is it ok because the code point seems to be construed > > > as U+0000? > > > > > > > Can anyone comment on these issues? > > > > Thank you very much. > > Jim Keenan > > I pose the question again, a year later: Can anyone comment on these > issues? Is there a real problem here at all? > > Thank you very much. > Jim Keenan I finally looked at this, and it appears to me that the ticket is in error. I did this: perl -E'say qq[\300\261];' > foo and then ran the attached test_encoding.pl program on it. The results are: ============================= utf8 "\xC0" does not map to Unicode at test.pl line 3. utf8 "\xB1" does not map to Unicode at test.pl line 3. 5C \ 78 x 43 C 30 0 5C \ 78 x 42 B 31 1 0A ================================= What actually got placed into the variable $x is not malformed utf8, but a string consisting of a representation of what was attempted to be input: "\xC0\xB1". That is apparently what perldiag means by the word "gracefully" If I change the program (attached as test_utf8.pl) so that the input discipline is :utf8 instead, the output is ======================= utf8 "\xC0" does not map to Unicode at test.pl line 3, <$fh> line 1. C0 �� B1 0A ====================== which contains malformed data. So :encoding(UTF-8) refuses to input malformed UTF-8, but :utf8 allows it. They don't behave identically. I think this ticket should be rejected, pending an explanation from Tom -- Karl Williamson --- via perlbug: queue: perl5 status: open https://rt.perl.org/Ticket/Display.html?id=101384Thread Previous