develooper Front page | perl.perl5.porters | Postings from December 2004

real UTF-8 vs. utf8n_to_uvuni()

Thread Next
From:
Dan Kogai
Date:
December 4, 2004 18:59
Subject:
real UTF-8 vs. utf8n_to_uvuni()
Message ID:
9966DB32-4669-11D9-BBC7-000A95DBB50A@dan.co.jp
On Dec 05, 2004, at 10:56, Dan Kogai wrote:
> Thanks, applied in my repository.  New tests and documentation fix in 
> progress.  When I am done w/ that, I will release Encode-2.0901 on my 
> web (not CPAN yet).  When cross-checks by porters are done I will 
> release Encode-2.10.
>
> Dan the Encode Maintainer

Now I am writing test suites and found some of the strictures are 
missing.

Surrogate -- OK
% perl -Mblib -MEncode -le '$a="\x{d801}"; print encode("UTF-8", $a, 1)'
"\x{d801}" does not map to utf8 at 
/gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

U+FFFF -- OK
% perl -Mblib -MEncode -le '$a="\x{ffff}"; print encode("UTF-8", $a, 1)'
"\x{ffff}" does not map to utf8 at 
/gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

Chars above U+10FFFF -- NOT OK
%> perl -Mblib -MEncode -le '$a="\x{11ffff}"; print encode("UTF-8", $a, 
1)'
????

Sine Gisle's patch make use of utf8n_to_uvuni(), it seems to be a 
problem of perl core.  So I have checked utf8.c which defines that.  
Seems like it does not make use of PERL_UNICODE_MAX.

The patch against utf8.c fixes that.

 > ~/danperl/bin/perl5.8.6 -Mblib -MEncode -le '$a="\x{11FFFF}"; print 
encode("UTF-8", $a, 1)'
"\x{00f4}" does not map to utf8 at 
/gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

As you see, the warning is still funny.  But for any case w/ 
UTF8_WARN_LONG is funny as follows;

 > perl -Mblib -MEncode -le '$a="\x{7fff_ffff}"; print encode("UTF-8", 
$a, 1)'
??????
 > perl -Mblib -MEncode -le '$a="\x{8000_0000}"; print encode("UTF-8", 
$a, 1)'
"\x{00fe}" does not map to utf8 at 
/gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

I have tracked down and found this warning was handled by Encode so 
Gisle and I can fix that.

Dan the Encode Maintainer

--- perl-5.8.x/utf8.c   Wed Nov 17 23:11:04 2004
+++ perl-5.8.x.dan/utf8.c       Sun Dec  5 11:38:52 2004
@@ -429,6 +429,13 @@
         }
         else
             uv = UTF8_ACCUMULATE(uv, *s);
+       /* Checks if ord() > 0x10FFFF -- dankogai */
+       if (uv > PERL_UNICODE_MAX){
+           if (!(flags & UTF8_ALLOW_LONG)) {
+               warning = UTF8_WARN_LONG;
+               goto malformed;
+           }
+       }
         if (!(uv > ouv)) {
             /* These cannot be allowed. */
             if (uv == ouv) {


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About