develooper Front page | perl.perl5.porters | Postings from March 2008

[perl #51710] utf8::valid rejects characters in \x14_FFFF - \x1F_FFFF

Thread Previous | Thread Next
From:
Steve Peters via RT
Date:
March 31, 2008 07:56
Subject:
[perl #51710] utf8::valid rejects characters in \x14_FFFF - \x1F_FFFF
Message ID:
rt-3.6.HEAD-717-1206975351-1453.51710-15-0@perl.org
On Thu Mar 13 14:18:31 2008, chris_hall wrote:
> 
> This is a bug report for perl from chris.hall@highwayman.com,
> generated with the help of perlbug 1.35 running under perl v5.8.8.
> 
> 
> -----------------------------------------------------------------
> 
> It appears that utf8::valid() disagrees with Encode::encode('utf8',
> ...)
> for characters \x{14_0000) - \x{1F_0000}.
> 
> I suggest utf8::valid() is broken.
> 
> The following:
> 
>    use strict ;
> 
>    use Encode qw(FB_QUIET LEAVE_SRC) ;
> 
>    printf "Perl v%vd & Encode %s\n", $^V, $Encode::VERSION ;
> 
>    # Test characters: 0x0000_FFFF, 0x0001_FFFF, 0x0002_0000,
> 0x0002_FFFF,
>    #                  0x0003_0000, ...., 0x7FFF_FFFF.
> 
>    my $c = 0xFFFF ;
>    while ($c <= 0x7FFF_FFFF) {
>      my $s = chr($c) ;
> 
>      my $v = utf8::valid($s) ? 1 : 0 ;
>      my $o = Encode::encode('utf8', $s, FB_QUIET() | LEAVE_SRC()) ;
> 
>      my $r = $o ? 1 : 0 ;
> 
>      if ($v != $r) {
>        printf "0x%04X_%04X: utf8::valid=%d but Encode::encode=%d  ",
>                                      ($c >> 16), $c & 0xFFFF, $v, $r ;
>        Encode::_utf8_off($s) ;
>        print map { sprintf '\x%02X', ord($_) } split(//, $s) ;
>        print "\n" ;
>      } ;
> 
>      if ($c & 0xFFFF) { $c += 1 ; } else { $c += 0xFFFF ; } ;
>    } ;
> 
> Produces:
> 
>    Perl v5.8.8 & Encode 2.23
>    0x0014_0000: utf8::valid=0 but Encode::encode=1  \xF5\x80\x80\x80
>    0x0014_FFFF: utf8::valid=0 but Encode::encode=1  \xF5\x8F\xBF\xBF
>    0x0015_0000: utf8::valid=0 but Encode::encode=1  \xF5\x90\x80\x80
>    0x0015_FFFF: utf8::valid=0 but Encode::encode=1  \xF5\x9F\xBF\xBF
>    0x0016_0000: utf8::valid=0 but Encode::encode=1  \xF5\xA0\x80\x80
>    0x0016_FFFF: utf8::valid=0 but Encode::encode=1  \xF5\xAF\xBF\xBF
>    0x0017_0000: utf8::valid=0 but Encode::encode=1  \xF5\xB0\x80\x80
>    0x0017_FFFF: utf8::valid=0 but Encode::encode=1  \xF5\xBF\xBF\xBF
>    0x0018_0000: utf8::valid=0 but Encode::encode=1  \xF6\x80\x80\x80
>    0x0018_FFFF: utf8::valid=0 but Encode::encode=1  \xF6\x8F\xBF\xBF
>    0x0019_0000: utf8::valid=0 but Encode::encode=1  \xF6\x90\x80\x80
>    0x0019_FFFF: utf8::valid=0 but Encode::encode=1  \xF6\x9F\xBF\xBF
>    0x001A_0000: utf8::valid=0 but Encode::encode=1  \xF6\xA0\x80\x80
>    0x001A_FFFF: utf8::valid=0 but Encode::encode=1  \xF6\xAF\xBF\xBF
>    0x001B_0000: utf8::valid=0 but Encode::encode=1  \xF6\xB0\x80\x80
>    0x001B_FFFF: utf8::valid=0 but Encode::encode=1  \xF6\xBF\xBF\xBF
>    0x001C_0000: utf8::valid=0 but Encode::encode=1  \xF7\x80\x80\x80
>    0x001C_FFFF: utf8::valid=0 but Encode::encode=1  \xF7\x8F\xBF\xBF
>    0x001D_0000: utf8::valid=0 but Encode::encode=1  \xF7\x90\x80\x80
>    0x001D_FFFF: utf8::valid=0 but Encode::encode=1  \xF7\x9F\xBF\xBF
>    0x001E_0000: utf8::valid=0 but Encode::encode=1  \xF7\xA0\x80\x80
>    0x001E_FFFF: utf8::valid=0 but Encode::encode=1  \xF7\xAF\xBF\xBF
>    0x001F_0000: utf8::valid=0 but Encode::encode=1  \xF7\xB0\x80\x80
>    0x001F_FFFF: utf8::valid=0 but Encode::encode=1  \xF7\xBF\xBF\xBF
> 
> And the same for: Perl v5.10.0 & Encode 2.23
> 
> Chris
> 

I'll check to see if the patch included in RT #43294 fixes both problems.

Thanks for the report.

Steve

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About