On Fri Jun 22 14:31:51 2007, jgmyers wrote: > > This is a bug report for perl from jgmyers@pong.us.proofpoint.com, > generated with the help of perlbug 1.35 running under perl v5.8.8. > > > ----------------------------------------------------------------- > [Please enter your report here] > > This bug is similar to bug #38722. utf8::valid() and utf8::decode() > incorrectly consider illegal characters and surrogates as being valid. > A script that depends on using these functions to validate untrusted > input will then have the resulting invalid unicode strings throw > warnings out of Perl_uvuni_to_utf8_flags in later processing. > > The following patch tightens up the validity checks to exclude such > illegal and ill-formed characters. Applying it causes a couple of > perl's harness tests to fail as those tests incorrectly expect to be > able to process surrogates and illegal characters. > > This also brings up the separate issue that the "chr" function should > probably throw a warning when asked to create a character that > Perl_uvuni_to_utf8_flags would warn about. > > --- perl-5.8.8-attrib/utf8.h 2006-06-26 15:34:05.000000000 -0700 > +++ perl-5.8.8-utf8valid/utf8.h 2007-06-22 14:18:26.000000000 -0700 > @@ -276,15 +276,13 @@ > (p)[2] >= 0x80 && (p)[2] <= 0xBF) > #define IS_UTF8_CHAR_3c(p) \ > ((p)[0] == 0xED && \ > - (p)[1] >= 0x80 && (p)[1] <= 0xBF && \ > - (p)[2] >= 0x80 && (p)[2] <= 0xBF) > -/* In IS_UTF8_CHAR_3c(p) one could use > - * (p)[1] >= 0x80 && (p)[1] <= 0x9F > - * if one wanted to exclude surrogates. */ > + (p)[1] >= 0x80 && (p)[1] <= 0x9F) > #define IS_UTF8_CHAR_3d(p) \ > ((p)[0] >= 0xEE && (p)[0] <= 0xEF && \ > (p)[1] >= 0x80 && (p)[1] <= 0xBF && \ > - (p)[2] >= 0x80 && (p)[2] <= 0xBF) > + (p)[2] >= 0x80 && (p)[2] <= 0xBF && \ > + ((p)[0] != 0xEF || (((p)[1] != 0xBF || (p)[2] <= 0xBD) && \ > + ((p)[1] != 0xB7 || (p)[2] <= 0x8F || > (p)[2] > >= 0xB0)))) > #define IS_UTF8_CHAR_4a(p) \ > ((p)[0] == 0xF0 && \ > (p)[1] >= 0x90 && (p)[1] <= 0xBF && \ > @@ -315,9 +313,9 @@ > IS_UTF8_CHAR_3c(p) || \ > IS_UTF8_CHAR_3d(p)) > #define IS_UTF8_CHAR_4(p) \ > - (IS_UTF8_CHAR_4a(p) || \ > - IS_UTF8_CHAR_4b(p) || \ > - IS_UTF8_CHAR_4c(p)) > + ((IS_UTF8_CHAR_4a(p) || \ > + IS_UTF8_CHAR_4b(p) || \ > + IS_UTF8_CHAR_4c(p)) && ((p)[2] != 0xBF || (p)[3] <= 0xBD || > ((p)[1] & 0xf) != 0xf)) > > /* IS_UTF8_CHAR(p) is strictly speaking wrong (not UTF-8) because it > * (1) allows UTF-8 encoded UTF-16 surrogates > > > [Please do not change anything below this line] > ----------------------------------------------------------------- > --- > Flags: > category=core > severity=medium > --- > Site configuration information for perl v5.8.8: > > Configured by jgmyers at Tue Feb 13 10:14:49 PST 2007. Discussion in this RT petered out five years ago. Is there anyone familiar with UTF-8 issues who could review the discussion and recommend an action? Thank you very much. Jim Keenan --- via perlbug: queue: perl5 status: open https://rt.perl.org:443/rt3/Ticket/Display.html?id=43294Thread Next