develooper Front page | perl.perl5.porters | Postings from August 2008

[perl #58430] Unicode::UCD::casefold() does not work as documented, nor prob as intended

From:
karl williamson
Date:
August 29, 2008 05:28
Subject:
[perl #58430] Unicode::UCD::casefold() does not work as documented, nor prob as intended
Message ID:
rt-3.6.HEAD-29762-1219946564-598.58430-75-0@perl.org
# New Ticket Created by  karl williamson 
# Please include the string:  [perl #58430]
# in the subject line of all future correspondence about this issue. 
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=58430 >



This is a bug report for perl from corporate@khwilliamson.com,
generated with the help of perlbug 1.35 running under perl v5.8.8.


-----------------------------------------------------------------
The documentation claims that the casefold function returns an 'I' for
the special case of dotless lowercase i mapping.  But there is no code
in the function that does that.  casefold() uses the file
CaseFoldting.txt from Unicode.  The function is looking for an 'I' in
that file in the appropriate column, but the file uses a 'T' not an 'I'
for this purpose, so it never will be found.

But it is a good thing that this bug exists, for otherwise, it would
generally return the wrong thing for the folding of an upper case 'I'.
The problem is that the file contains multiple entries for several
characters.  This is nowhere indicated in the function's documentation,
and I'm not sure that the programmer realized it, because it is not
clear to me what the proper behavior should be, except that the current
behavior isn't correct.  Perhaps it should return hashes like casespec()
does to allow the caller to choose which folding to do, or have a second 
parameter to indicate which type to return.

For example, with a capital I, there are two entries, the first for the
normal case where 'I' maps to 'i', and the 2nd for where it maps to a
dotless i.  The function populates a hash, and whatever entry comes last
in the file overwrites any earlier hash value.  Thus if the function 
were written to look for the T (instead of the non-existent I), the very
special case of Turkish  would override the more likely case in any of a
number of other languages.

There are a number of other cases in the file where there are different
mappings for the same character, and the function will always use just
one mapping, the last one found in the file.

This is contrary to what the documentation implies, and I doubt that it
is an adequate interface to the database.
-----------------------------------------------------------------
---
Flags:
     category=library
     severity=medium
---
Site configuration information for perl v5.8.8:

Configured by Debian Project at Tue Nov 27 10:56:10 GMT 2007.

Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
   Platform:
     osname=linux, osvers=2.6.15.7, archname=i486-linux-gnu-thread-multi
     uname='linux palmer 2.6.15.7 #1 smp thu sep 7 19:42:20 utc 2006 
i686 gnulinux '
     config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN 
-Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr 
-Dprivlib=/usr/share/perl/5.8 -Darchlib=/usr/lib/perl/5.8 
-Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 
-Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local 
-Dsitelib=/usr/local/share/perl/5.8.8 
-Dsitearch=/usr/local/lib/perl/5.8.8 -Dman1dir=/usr/share/man/man1 
-Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 
-Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl 
-Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio 
-Uusenm -Duseshrplib -Dlibperl=libperl.so.5.8.8 -Dd_dosuid -des'
     hint=recommended, useposix=true, d_sigaction=define
     usethreads=define use5005threads=undef useithreads=define 
usemultiplicity=define
     useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
     use64bitint=undef use64bitall=undef uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
   Compiler:
     cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS 
-DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include 
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
     optimize='-O2',
     cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN 
-fno-strict-aliasing -pipe -I/usr/local/include'
     ccversion='', gccversion='4.2.3 20071123 (prerelease) (Ubuntu 
4.2.2-3ubuntu4)', gccosandvers=''
     intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
     d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
     ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', 
lseeksize=8
     alignbytes=4, prototype=define
   Linker and Libraries:
     ld='cc', ldflags =' -L/usr/local/lib'
     libpth=/usr/local/lib /lib /usr/lib
     libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
     perllibs=-ldl -lm -lpthread -lc -lcrypt
     libc=/lib/libc-2.6.1.so, so=so, useshrplib=true, 
libperl=libperl.so.5.8.8
     gnulibc_version='2.6.1'
   Dynamic Linking:
     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
     cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:


---
@INC for perl v5.8.8:
     /etc/perl
     /usr/local/lib/perl/5.8.8
     /usr/local/share/perl/5.8.8
     /usr/lib/perl5
     /usr/share/perl5
     /usr/lib/perl/5.8
     /usr/share/perl/5.8
     /usr/local/lib/site_perl
     .

---
Environment for perl v5.8.8:
     HOME=/home/khw
     LANG=en_US.UTF-8
     LANGUAGE (unset)
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
 
PATH=/home/khw/bin:/home/khw/print/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/usr/games:/home/khw/cxoffice/bin
     PERL_BADLANG (unset)
     SHELL=/bin/ksh




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About