develooper Front page | perl.perl5.porters | Postings from August 2008

[perl #58182] Inconsistent and wrong handling of 8th bit set chars with no locale

Thread Next
From:
karl williamson
Date:
August 21, 2008 00:29
Subject:
[perl #58182] Inconsistent and wrong handling of 8th bit set chars with no locale
Message ID:
rt-3.6.HEAD-29762-1219271308-905.58182-75-0@perl.org
# New Ticket Created by  karl williamson 
# Please include the string:  [perl #58182]
# in the subject line of all future correspondence about this issue. 
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=58182 >


This is a bug report for perl from corporate@khwilliamson.com,
generated with the help of perlbug 1.36 running under perl 5.10.0.


-----------------------------------------------------------------
Characters in the range U+0080 through U+00FF behave inconsistently
depending on whether or not they are part of a string which also
includes a character above that range, and in some cases they behave
incorrectly even when part of such a string.  The problems I will
concentrate on in this report are those involving case.

I presume that they do work properly when a locale is set, but I haven't
tested that.

print uc("\x{e0}"), "\n"; # (a with grave accent)

yields itself instead of a capital A with grave accent (U+00C0).  This
is true whether or not the character is part of a string which includes
a character not storable in a single byte.  Similarly

print "\x{e0}" =~ /\x{c0}/i, "\n";

will print a null string on a line, as the match fails.

The same behavior occurs for all characters in this range that are
marked in the Unicode standard as lower case and have single letter
upper case equivalents.


The behavior that is inconsistent mostly occurs with upper case letters
being mapped to lower case.

print lcfirst("\x{c0}aaaaa"), "\n";

doesn't change the first character.  But

print lcfirst("\x{c0}aaaaa\x{101}"), "\n";

does change it.  There is something seriously wrong when a character
separated by an arbitrarily large distance from another one can affect
what case the latter is considered to be. Similarly,

print "\x{c0}aaaaaa" =~ /^\x{e0}/i, "\n";

will show the match failing, but

print "\x{c0}aaaaaa\x{101}" =~ /^\x{e0}/i, "\n";

will show the match succeeding.  Again a character maybe hundreds of
positions further along in a string can affect whether the first
character in said string matches its lower case equivalent when case is
ignored.

The same behavior occurs for all characters in this range that are
marked in the Unicode standard as upper case and have lower case
equivalents, as well as U+00DF which is lower case and has an upper case
equivalent of the string 'SS'.

Also, the byte character classes inconsistently match characters in this
range, again depending on whether or not the character is part of a
larger string that contains a character greater than the range.  So, for
example, for a non-breaking space,

print "\xa0" =~ /^\s/, "\n";

will show that the match returns false but

print "\xa0\x{101}" =~ /^\s/, "\n";

will show that the match returns true.  But this behavior is sort-of
documented, and there is a work-around, which is to use the '\p{}'
classes instead.  Note that calling them byte character classes is
wrong; they really are 7-bit classes.

 From reading the documentation, I presume that the inconsistent behavior
is a result of the decision to have perl not switch to wide-character
mode in storing its strings unless necessary.  I like that decision for
efficiency reasons.  But what has happened is that the code points in
the range 128 - 255 have been orphaned, when they aren't part of strings
that force the switch.  Again, I presume but haven't tested, that using
a locale causes them to work properly for that locale, but in the
absence of a locale they should be treated as Unicode code points (or
equivalently for characters in this range, as iso-8859-1).  Storing as
wide-characters is supposed to be transparent to users, but this bug
belies that and yields very inconsistent and unexpected behavior.
(This doesn't explain the lower to upper case translation bug, which is
wrong even in wide-character mode.)

I am frankly astonished that this bug exists, as I have come to expect
perl to "Do the Right Thing" over the course of many years of using it.
I did see one bug report of something similar to this when searching for
this, but it apparently was misunderstood and went nowhere, and wasn't
in the perl bug data base

-----------------------------------------------------------------
---
Flags:
     category=core
     severity=high
---
Site configuration information for perl 5.10.0:

Configured by ActiveState at Wed May 14 05:06:16 PDT 2008.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
   Platform:
     osname=linux, osvers=2.4.21-297-default, 
archname=i686-linux-thread-multi
     uname='linux gila 2.4.21-297-default #1 sat jul 23 07:47:39 utc 
2005 i686 i686 i386 gnulinux '
     config_args='-ders -Dcc=gcc -Dusethreads -Duseithreads 
-Ud_sigsetjmp -Uinstallusrbinperl -Ulocincpth= -Uloclibpth= 
-Accflags=-DUSE_SITECUSTOMIZE -Duselargefiles 
-Accflags=-DPRIVLIB_LAST_IN_INC -Dprefix=/opt/ActivePerl-5.10 
-Dprivlib=/opt/ActivePerl-5.10/lib -Darchlib=/opt/ActivePerl-5.10/lib 
-Dsiteprefix=/opt/ActivePerl-5.10/site 
-Dsitelib=/opt/ActivePerl-5.10/site/lib 
-Dsitearch=/opt/ActivePerl-5.10/site/lib -Dsed=/bin/sed -Duseshrplib 
-Dcf_by=ActiveState -Dcf_email=support@ActiveState.com'
     hint=recommended, useposix=true, d_sigaction=define
     useithreads=define, usemultiplicity=define
     useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
     use64bitint=undef, use64bitall=undef, uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
   Compiler:
     cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS 
-DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_INC -fno-strict-aliasing -pipe 
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
     optimize='-O2',
     cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS 
-DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_INC -fno-strict-aliasing -pipe'
     ccversion='', gccversion='3.3.1 (SuSE Linux)', gccosandvers=''
     intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
     d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
     ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', 
lseeksize=8
     alignbytes=4, prototype=define
   Linker and Libraries:
     ld='gcc', ldflags =''
     libpth=/lib /usr/lib /usr/local/lib
     libs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
     perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
     libc=, so=so, useshrplib=true, libperl=libperl.so
     gnulibc_version='2.3.2'
   Dynamic Linking:
     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E 
-Wl,-rpath,/opt/ActivePerl-5.10/lib/CORE'
     cccdlflags='-fPIC', lddlflags='-shared -O2'

Locally applied patches:
     ACTIVEPERL_LOCAL_PATCHES_ENTRY
     33741 avoids segfaults invoking S_raise_signal() (on Linux)
     33763 Win32 process ids can have more than 16 bits
     32809 Load 'loadable object' with non-default file extension
     32728 64-bit fix for Time::Local

---
@INC for perl 5.10.0:
     /opt/ActivePerl-5.10/site/lib
     /opt/ActivePerl-5.10/lib
     .

---
Environment for perl 5.10.0:
     HOME=/home/khw
     LANG=en_US.UTF-8
     LANGUAGE (unset)
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
 
PATH=/opt/ActivePerl-5.10/bin:/home/khw/bin:/home/khw/print/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/usr/games:/home/khw/cxoffice/bin
     PERL_BADLANG (unset)
     SHELL=/bin/ksh


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About