On Sun Mar 21 19:08:59 2010, khw wrote: > On Tue Feb 24 13:27:21 2009, zefram@fysh.org wrote: > > $ perl -le 'print do { no warnings "utf8"; "\x{d800}" } =~ > > /\A[\x{123}]/ ? "yes" : "no"' > > no > > $ perl -lwe 'print do { no warnings "utf8"; "\x{d800}" } =~ > > /\A[\x{123}]/ ? "yes" : "no"' > > Malformed UTF-8 character (fatal) at -e line 1. > > $ > > > > Turning warnings on makes the regexp operation die, whereas with > > warnings > > off it produced the correct answer. If 'no warnings "utf8"' is in > > scope > > at the regexp op, then the error does not occur. > > I'm not sure what to do about this ticket. The basics of it anyway are > behaving as designed, which is that non-characters and surrogates > generate errors unless warnings are turned off, but then things should > work. It may be working as designed, but it was not designed very well. > The message in 5.12 for U+FFFF has been clarified that this > character is illegal for interchange. This should be extended in a > later release to the other 65 noncharacters. > > Surrogates, on the other hand, should never appear in well-formed utf8, > and there are security considerations for doing so that I don't fully > understand but can see why. The regular expression engine is not a security layer. It should not pretend to be one. If I want to implement a security layer using regular expressions, then this bug (yes, I do consider it a bug) will get in the way. Furthermore, Perl’s strings are not just Unicode. Unicode strings are merely a subset of the strings that Perl supports. Regular expressions are for looking at strings. So it should not warn or die based on the contents of the string, as long as it is a valid Perl string. perl already warns for "\x{d800}" and chr 0xd800. So if such a string is passed to a regular expression, we get multiple warnings for the same character. I use Perl’s strings for storing 16-bit binary data. The result is that not only the code creating such strings, but any code looking at the strings, has to turn off utf8 warnings. So I can’t use any CPAN modules such as Data::Dump::Streamer. I propose we stop the regular expression engine from rejecting or warning about these characters altogether. The only checking should be for code that creates such characters or for I/O layers. There are three patches attached that fix a few cases. There will be more to come.Thread Previous | Thread Next