develooper Front page | perl.perl5.porters | Postings from March 2008

[perl #51936] Inconsistent handling of characters with value > 0x7FFF_FFFF and other issues

Thread Previous
From:
Chris Hall
Date:
March 20, 2008 09:01
Subject:
[perl #51936] Inconsistent handling of characters with value > 0x7FFF_FFFF and other issues
Message ID:
rt-3.6.HEAD-25460-1206022411-1363.51936-75-0@perl.org
# New Ticket Created by  Chris Hall 
# Please include the string:  [perl #51936]
# in the subject line of all future correspondence about this issue. 
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=51936 >


This is a bug report for perl from chris.hall@highwayman.com,
generated with the help of perlbug 1.36 running under perl 5.10.0.


-----------------------------------------------------------------
[Please enter your report here]

Amongst the issues:

   * Character values > 0x7FFF_FFFF are not consistently handled.

     IMO: the handling is so broken that it would be much better
          to draw the line at 0x7FFF_FFFF.

   * chr and pack respond differently to large and out of range
     values.

   * pack can generate strings that unpack will not process.

   * warnings about 'illegal' non-characters are arguably spurious.
     Certainly there are many cases which are more illegal where
     no warnings are issued.

     Treating 0xFFFF_FFFF as a non-character is interesting.

   * IMO: chr(-1) complete nonsense == undef, not "a character I
          cannot handle" == U+FFFD.

Perl strings containing characters >0x7FFF_FFFF use a non-standard
extension to UTF-8.  Strictly speaking, UTF-8 stops at U+10FFFF.
However, sequences up to 0x7FFF_FFFF are well defined.

Bits of Perl are happier with these non-standard sequences than
others.

Consider:

    1: use strict ;
    2: use warnings ;
    3:
    4: warn "__Runtime__" ;
    5:
    6: my $q = chr(0x7FFF_FFFF).chr(0xE0).chr(0x8000_0000).chr(0xFFFF_FFFD) ;
    7: my $v = utf8::valid($q) ? 'Valid' : 'Invalid' ;
    8: my $l = length($q) ;
    9: my $r = $q.$q ;
   10: $q =~ s/\x{E0}/ / ;
   11: $q =~ s/\x{7FFF_FFFF}/Hello/ ;
   12: $q =~ s/\x{8000_0000}/World/ ;
   13: $q =~ s/\x{FFFF_FFFD}/ !/ ;
   14: print "$v($l): '$q'\n" ;
   15:
   16: $r = substr($r, 3, 4) ;
   17: print "\$r=", hx(sc($r)), "\n" ;
   18: my @w = unpack('U*', $r) ;
   19: print "\@w=", hx(@w), "\n" ;
   20:
   21: $r = pack('U*', sc($r), 0x1_1234_5678) ;
   22: print "\$r=", hx(sc($r)), "\n" ;
   23: @w = unpack('U*', $r) ;
   24: print "\@w=", hx(@w), "\n" ;
   25:
   26: sub sc { map ord, split(//, $_[0]) ; } ;
   27: sub hx { map sprintf('\\x{%X}', $_), @_ ; } ;

which generates:

    A: Unicode character 0x7fffffff is illegal at tb.pl line 11.
    B: Malformed UTF-8 character (byte 0xfe) at tb.pl line 12.
    C: Malformed UTF-8 character (byte 0xfe) at tb.pl line 13.
    D: Integer overflow in hexadecimal number at tb.pl line 21.
    E: Hexadecimal number > 0xffffffff non-portable at tb.pl line 21.
   --: __Runtime__ at tb.pl line 4.
    a: Unicode character 0x7fffffff is illegal at tb.pl line 6.
    b: Invalid(4): 'Hello World !'
    c: $r=\x{FFFFFFFD}\x{7FFFFFFF}\x{E0}\x{80000000}
    d: Malformed UTF-8 character (byte 0xfe) in unpack at tb.pl line 18.
    e: Malformed UTF-8 character (unexpected continuation byte 0x83, with no
     : preceding start byte) in unpack at tb.pl line 18.

    ... repeated for 0xbf, 0xbf, 0xbf, 0xbf, 0xbd

    f: Malformed UTF-8 character (byte 0xfe) in unpack at tb.pl line 18.
    g: Malformed UTF-8 character (unexpected continuation byte 0x82, with no
     : preceding start byte) in unpack at tb.pl line 18.

    ... repeated for 0x80, 0x80, 0x80, 0x80, 0x80

    h: @w=\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{7FFFFFFF}\x{E0}
     : \x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}
    i: Unicode character 0x7fffffff is illegal at tb.pl line 21.
    j: Unicode character 0xffffffff is illegal at tb.pl line 21.
    k: $r=\x{FFFFFFFD}\x{7FFFFFFF}\x{E0}\x{80000000}\x{FFFFFFFF}
    l: Malformed UTF-8 character (byte 0xfe) in unpack at tb.pl line 23.
    m: Malformed UTF-8 character (unexpected continuation byte 0x83, with no
     : preceding start byte) in unpack at tb.pl line 23.

    ... repeated for 0xbf, 0xbf, 0xbf, 0xbf, 0xbd

    n: Malformed UTF-8 character (byte 0xfe) in unpack at tb.pl line 23.
    o: Malformed UTF-8 character (unexpected continuation byte 0x82, with no
     : preceding start byte) in unpack at tb.pl line 23.

    ... repeated for 0x80, 0x80, 0x80, 0x80, 0x80

    p: Malformed UTF-8 character (byte 0xfe) in unpack at tb.pl line 23.
    q: Malformed UTF-8 character (unexpected continuation byte 0x83, with no
     : preceding start byte) in unpack at tb.pl line 23.

    ... repeated for 0xbf, 0xbf, 0xbf, 0xbf, 0xbf

    r: @w=\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{7FFFFFFF}\x{E0}
     : \x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}

NOTES:

  1. chr(n) is happy with characters > 0x7FFF_FFFF

     BUT: note the runtime warning about 0x7FFF_FFFF itself -- output line a.

     Unicode defines characters U+xxFFFF as non-characters, for all xx from
     0x00 to 0x10 -- the (current) Unicode range.

     These characters are NOT illegal.  Unicode states:

      "Noncharacter code points are reserved for internal use, such as
       for sentinel values. They should never be interchanged. They do,
       however, have well-formed representations in Unicode encoding
       forms and survive conversions between encoding forms. This allows
       sentinel values to be preserved internally across Unicode
       encoding forms, even though they are not designed to be used in
       open interchange."

     Characters > 0x10_FFFF are not known to Unicode.

     IMO, chr(n) should not be issuing warnings about non-characters at all.

     IMO, to project non-characters beyond the Unicode range is doubly
     perverse.

     FURTHER: although characters > 0x10_FFFF are beyond Unicode, and
     characters > 0x7FFF_FFFF are beyond UTF-8, chr(n) is only warning
     about actual and invented non-characters (and surrogates).

  2. Similarly "\x{8000_0000} and "\x{7FFF_FFFF}" -- output line A.

  3. HOWEVER: utf8::valid() considers a string containing characters
    which are > 0x7FFF_FFFF to be *invalid* -- see code lines 7 & 14 and
    output line b.

    IMO allowing for characters 0x7FFF_FFFF in the first place is a mistake.

    But having allowed them, why flag the string as invalid ?

  4. However: length() is happy, and issues no warning.

     Either length() is accepting the non-standard encoding, or some other
     mechanism means that it's not scanning the string.

  5. Lines 12 & 13 generate warnings about malformed UTF-8, at compile time.

     However, the run-time copes with these super-large characters.

  6. substr is happy with the super-large characters -- line 16.

  7. split is happy with the super-large characters -- line 26.

  8. ord is happy with the super-large characters -- line 26.

  9. unpack 'U' throws up all over super-large characters !

     See lines 18 & 23, and output d-h and l-r.

     unpack has no idea about the non-standard encoding of characters
     greater than 0x7FFF_FFFF, and unpacks each 'invalid' byte as
     0x00.

10. pack 'U' complains about character values in much the same way as
     chr does -- output i & j.

     However, pack and chr are by no means consistent with each other,
     see below.

11. pack 'U' is generating stuff that unpack 'U' cannot cope with !

     See lines 21-24 and output k-r

___________________________________________________________

Looking further at chr and pack:

    1: use strict ;
    2: use warnings ;
    3:
    4: warn "__Runtime__" ;
    5:
    6: my $q = chr(0xD800).chr(0xFFFF).chr(0x7FFF_FFFF) ;
    7: my $v = utf8::valid($q) ? 'Valid' : 'Invalid' ;
    8: print "\$q = ", hx(sc($q)), " -- $v\n" ;
    9:
   10: my @t = (0x1_2345_6789, -1, -10, 0xD800) ;
   11: my $r = join '', map(chr, @t) ;
   12: print "\$r=", hx(sc($r)), "\n" ;
   13:
   14: my $s = pack('U*', @t) ;
   15: print "\$s=", hx(sc($s)), "\n" ;
   16:
   17: sub sc { map ord, split(//, $_[0]) ; } ;
   18: sub hx { map sprintf('\\x{%X}', $_), @_ ; } ;

On a 64-bit v5.8.8:

    A: UTF-16 surrogate 0xd800 at tb2.pl line 6.
    B: Unicode character 0xffff is illegal at tb2.pl line 6.
    C: Unicode character 0x7fffffff is illegal at tb2.pl line 6.
    D: Hexadecimal number > 0xffffffff non-portable at tb2.pl line 10.
    -- __Runtime__ at tb2.pl line 4.
    a: $q = \x{D800}\x{FFFF}\x{7FFFFFFF} -- Valid
    b: Unicode character 0xffffffffffffffff is illegal at tb2.pl line 11.
    c: UTF-16 surrogate 0xd800 at tb2.pl line 11.
    d: $r=\x{123456789}\x{FFFFFFFFFFFFFFFF}\x{FFFFFFFFFFFFFFF6}\x{D800}
    e: Unicode character 0xffffffff is illegal at tb2.pl line 14.
    f: UTF-16 surrogate 0xd800 at tb2.pl line 14.
    g: $s=\x{23456789}\x{FFFFFFFF}\x{FFFFFFF6}\x{D800}

   * chr(-1) generates a warning, not because it's complete rubbish,
     but because 0xffffffffffffffff is a non-character !!!

     chr(-3) doesn't merit a warning.

   * note that surrogates and non-characters are OK as far as utf8::valid
     is concerned -- no warnings, even.

   * pack is masking stuff to 32 bit unsigned !!

   * both chr and pack are throwing warnings about surrogates

On a 32-bit v5.10.0:

    A: Integer overflow in hexadecimal number at tb2.pl line 10.
    B: Hexadecimal number > 0xffffffff non-portable at tb2.pl line 10.
    -- __Runtime__ at tb2.pl line 4.
    a: UTF-16 surrogate 0xd800 at tb2.pl line 6.
    b: Unicode character 0xffff is illegal at tb2.pl line 6.
    c: Unicode character 0x7fffffff is illegal at tb2.pl line 6.
    d: $q = \x{D800}\x{FFFF}\x{7FFFFFFF} -- Valid
    e: Unicode character 0xffffffff is illegal at tb2.pl line 11.
    f: UTF-16 surrogate 0xd800 at tb2.pl line 11.
    g: $r=\x{FFFFFFFF}\x{FFFD}\x{FFFD}\x{D800}
    h: Unicode character 0xffffffff is illegal at tb2.pl line 14.
    i: Unicode character 0xffffffff is illegal at tb2.pl line 14.
    j: UTF-16 surrogate 0xd800 at tb2.pl line 14.
    k: $s=\x{FFFFFFFF}\x{FFFFFFFF}\x{FFFFFFF6}\x{D800}

   * chr is mapping -ve values to U+FFFD -- without warning.

     This is as per documentation.

     However, character 0xFFFF_FFFF, merits a warning, but does NOT
     get translated to U+FFFD !!

     IMO: this is a dog's dinner.  I think:

       - non-characters and surrogates should not trouble chr
         (any more than they trouble utf8::valid)

       - values that are invalid should generate undef, not U+FFFD
         replacement characters:

          a) cannot distinguish chr(0xFFFD) and chr(-10)

          b) U+FFFD is a replacement for a character that we don't
             know -- it's not a replacement for something that
             just isn't a character in the first place !

             [-1 is a banana.  U+FFFD is an orange, which we may
              substitute for another form of orange.]

       - limiting characters to 0x7FFF_FFFF is no great loss, and
         avoids a ton of portability and non-standard-ness issues.

   * pack 'U' is NOT mapping -ve values to U+FFFD !!


[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
     category=core
     severity=medium
---
Site configuration information for perl 5.10.0:

Configured by SYSTEM at Thu Jan 10 11:00:30 2008.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
   Platform:
     osname=MSWin32, osvers=5.00, archname=MSWin32-x86-multi-thread
     uname=''
     config_args='undef'
     hint=recommended, useposix=true, d_sigaction=undef
     useithreads=define, usemultiplicity=define
     useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
     use64bitint=undef, use64bitall=undef, uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
   Compiler:
     cc='cl', ccflags ='-nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32 -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE -
DPRIVLIB_LAST_IN_INC -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -DPERL_MSVCRT_READFIX',
     optimize='-MD -Zi -DNDEBUG -O1',
     cppflags='-DWIN32'
     ccversion='12.00.8804', gccversion='', gccosandvers=''
     intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
     d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=10
     ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='__int64', lseeksize=8
     alignbytes=8, prototype=define
   Linker and Libraries:
     ld='link', ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf  -libpath:"C:\Program Files\Perl\lib\CORE"  -machine:x86'
     libpth=\lib
     libs=  oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib  comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib
netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib  version.lib odbc32.lib odbccp32.lib msvcrt.lib
     perllibs=  oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib  comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib
netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib  version.lib odbc32.lib odbccp32.lib msvcrt.lib
     libc=msvcrt.lib, so=dll, useshrplib=true, libperl=perl510.lib
     gnulibc_version=''
   Dynamic Linking:
     dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
     cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug -opt:ref,icf  -libpath:"C:\Program Files\Perl\lib\CORE"  -machine:x86'

Locally applied patches:
     ACTIVEPERL_LOCAL_PATCHES_ENTRY
     32809 Load 'loadable object' with non-default file extension
     32728 64-bit fix for Time::Local

---
@INC for perl 5.10.0:
     d:\gmch_root\gmch perl lib
     d:\gmch_root\gmch perl lib\windows
     C:/Program Files/Perl/site/lib
     C:/Program Files/Perl/lib
     .

---
Environment for perl 5.10.0:
     HOME (unset)
     LANG (unset)
     LANGUAGE (unset)
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
     PATH=C:\Program Files\Perl\site\bin;C:\Program Files\Perl\bin;C:\PROGRAM FILES\_BATCH;C:\PROGRAM FILES\_BIN;C:\PROGRAM FILES\ARM\BIN\WIN_32-
PENTIUM;C:\PROGRAM FILES\PERL\BIN\;C:\WINDOWS\SYSTEM32;C:\WINDOWS;C:\WINDOWS\SYSTEM32\WBEM;C:\PROGRAM FILES\ATI TECHNOLOGIES\ATI CONTROL
PANEL;C:\PROGRAM FILES\MICROSOFT SQL SERVER\80\TOOLS\BINN\;C:\PROGRAM FILES\ARM\UTILITIES\FLEXLM\10.8.0\12\WIN_32-PENTIUM;C:\PROGRAM FILES\ARM\R
VCT\PROGRAMS\3.0\441\EVAL2-SC\WIN_32-PENTIUM;C:\PROGRAM FILES\ARM\RVD\CORE\3.0\675\EVAL2-SC\WIN_32-PENTIUM\BIN;C:\PROGRAM FILES\SUPPORT
TOOLS\;C:\Program Files\QuickTime\QTSystem\
     PERLLIB=d:\gmch_root\gmch perl lib;d:\gmch_root\gmch perl lib\windows
     PERL_BADLANG (unset)
     SHELL (unset)
-- 
Chris Hall               highwayman.com

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About