develooper Front page | perl.perl5.porters | Postings from November 2010

[perl #63446] utf8 fatal warning

Thread Previous | Thread Next
From:
Father Chrysostomos via RT
Date:
November 28, 2010 13:16
Subject:
[perl #63446] utf8 fatal warning
Message ID:
rt-3.6.HEAD-13564-1290978986-1559.63446-15-0@perl.org
On Sun Mar 21 19:08:59 2010, khw wrote:
> On Tue Feb 24 13:27:21 2009, zefram@fysh.org wrote:
> > $ perl -le 'print do { no warnings "utf8"; "\x{d800}" } =~
> >    /\A[\x{123}]/ ? "yes" : "no"'
> > no
> > $ perl -lwe 'print do { no warnings "utf8"; "\x{d800}" } =~
> >    /\A[\x{123}]/ ? "yes" : "no"'
> > Malformed UTF-8 character (fatal) at -e line 1.
> > $
> > 
> > Turning warnings on makes the regexp operation die, whereas with
> >    warnings
> > off it produced the correct answer.  If 'no warnings "utf8"' is in
> >    scope
> > at the regexp op, then the error does not occur.
> 
> I'm not sure what to do about this ticket.  The basics of it anyway are
> behaving as designed, which is that non-characters and surrogates
> generate errors unless warnings are turned off, but then things should
> work.  

It may be working as designed, but it was not designed very well.

> The message in 5.12 for U+FFFF has been clarified that this
> character is illegal for interchange.  This should be extended in a
> later release to the other 65 noncharacters.
> 
> Surrogates, on the other hand, should never appear in well-formed utf8,
> and there are security considerations for doing so that I don't fully
> understand but can see why.  

The regular expression engine is not a security layer. It should not
pretend to be one. If I want to implement a security layer using regular
expressions, then this bug (yes, I do consider it a bug) will get in the
way.

Furthermore, Perl’s strings are not just Unicode. Unicode strings are
merely a subset of the strings that Perl supports.

Regular expressions are for looking at strings. So it should not warn or
die based on the contents of the string, as long as it is a valid Perl
string.

perl already warns for "\x{d800}" and chr 0xd800. So if such a string is
passed to a regular expression, we get multiple warnings for the same
character.

I use Perl’s strings for storing 16-bit binary data. The result is that
not only the code creating such strings, but any code looking at the
strings, has to turn off utf8 warnings. So I can’t use any CPAN modules
such as Data::Dump::Streamer.

I propose we stop the regular expression engine from rejecting or
warning about these characters altogether. The only checking should be
for code that creates such characters or for I/O layers.

There are three patches attached that fix a few cases. There will be
more to come.


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About