develooper Front page | perl.perl5.porters | Postings from September 2013

[perl #43294] utf8::valid considers illegal characters to be valid

Thread Next
From:
James E Keenan via RT
Date:
September 18, 2013 23:32
Subject:
[perl #43294] utf8::valid considers illegal characters to be valid
Message ID:
rt-3.6.HEAD-1873-1379547128-807.43294-15-0@perl.org
On Fri Jun 22 14:31:51 2007, jgmyers wrote:
> 
> This is a bug report for perl from jgmyers@pong.us.proofpoint.com,
> generated with the help of perlbug 1.35 running under perl v5.8.8.
> 
> 
> -----------------------------------------------------------------
> [Please enter your report here]
> 
> This bug is similar to bug #38722.  utf8::valid() and utf8::decode()
> incorrectly consider illegal characters and surrogates as being valid.
> A script that depends on using these functions to validate untrusted
> input will then have the resulting invalid unicode strings throw
> warnings out of Perl_uvuni_to_utf8_flags in later processing.
> 
> The following patch tightens up the validity checks to exclude such
> illegal and ill-formed characters.  Applying it causes a couple of
> perl's harness tests to fail as those tests incorrectly expect to be
> able to process surrogates and illegal characters.
> 
> This also brings up the separate issue that the "chr" function should
> probably throw a warning when asked to create a character that
> Perl_uvuni_to_utf8_flags would warn about.
> 
> --- perl-5.8.8-attrib/utf8.h    2006-06-26 15:34:05.000000000 -0700
> +++ perl-5.8.8-utf8valid/utf8.h 2007-06-22 14:18:26.000000000 -0700
> @@ -276,15 +276,13 @@
>          (p)[2] >= 0x80 && (p)[2] <= 0xBF)
>  #define IS_UTF8_CHAR_3c(p)     \
>         ((p)[0] == 0xED && \
> -        (p)[1] >= 0x80 && (p)[1] <= 0xBF && \
> -        (p)[2] >= 0x80 && (p)[2] <= 0xBF)
> -/* In IS_UTF8_CHAR_3c(p) one could use
> - * (p)[1] >= 0x80 && (p)[1] <= 0x9F
> - * if one wanted to exclude surrogates. */
> +        (p)[1] >= 0x80 && (p)[1] <= 0x9F)
>  #define IS_UTF8_CHAR_3d(p)     \
>         ((p)[0] >= 0xEE && (p)[0] <= 0xEF && \
>          (p)[1] >= 0x80 && (p)[1] <= 0xBF && \
> -        (p)[2] >= 0x80 && (p)[2] <= 0xBF)
> +        (p)[2] >= 0x80 && (p)[2] <= 0xBF && \
> +        ((p)[0] != 0xEF || (((p)[1] != 0xBF || (p)[2] <= 0xBD) && \
> +                            ((p)[1] != 0xB7 || (p)[2] <= 0x8F ||
> (p)[2]
>  >= 0xB0))))
>  #define IS_UTF8_CHAR_4a(p)     \
>         ((p)[0] == 0xF0 && \
>          (p)[1] >= 0x90 && (p)[1] <= 0xBF && \
> @@ -315,9 +313,9 @@
>          IS_UTF8_CHAR_3c(p) || \
>          IS_UTF8_CHAR_3d(p))
>  #define IS_UTF8_CHAR_4(p)      \
> -       (IS_UTF8_CHAR_4a(p) || \
> -        IS_UTF8_CHAR_4b(p) || \
> -        IS_UTF8_CHAR_4c(p))
> +       ((IS_UTF8_CHAR_4a(p) || \
> +         IS_UTF8_CHAR_4b(p) || \
> +         IS_UTF8_CHAR_4c(p)) && ((p)[2] != 0xBF || (p)[3] <= 0xBD ||
> ((p)[1] & 0xf) != 0xf))
> 
>  /* IS_UTF8_CHAR(p) is strictly speaking wrong (not UTF-8) because it
>   * (1) allows UTF-8 encoded UTF-16 surrogates
> 
> 
> [Please do not change anything below this line]
> -----------------------------------------------------------------
> ---
> Flags:
>     category=core
>     severity=medium
> ---
> Site configuration information for perl v5.8.8:
> 
> Configured by jgmyers at Tue Feb 13 10:14:49 PST 2007.


Discussion in this RT petered out five years ago.  Is there anyone
familiar with UTF-8 issues who could review the discussion and recommend
an action?

Thank you very much.
Jim Keenan


---
via perlbug:  queue: perl5 status: open
https://rt.perl.org:443/rt3/Ticket/Display.html?id=43294

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About