develooper Front page | perl.perl5.porters | Postings from December 2004

[perl #32971] UTF8 characters that exists in latin1 breaks regexps

Thread Next
From:
Mikael Wahlberg
Date:
December 9, 2004 01:20
Subject:
[perl #32971] UTF8 characters that exists in latin1 breaks regexps
Message ID:
rt-3.0.11-32971-102462.11.6650980910689@perl.org
# New Ticket Created by  Mikael Wahlberg 
# Please include the string:  [perl #32971]
# in the subject line of all future correspondence about this issue. 
# <URL: http://rt.perl.org:80/rt3/Ticket/Display.html?id=32971 >


This is a bug report for perl from mikael@ardendo.se,
generated with the help of perlbug 1.35 running under perl v5.8.4.


-----------------------------------------------------------------
[Please enter your report here]

When using the 'use encoding utf8' pragma any regexps including a character 
written in UTF8 that exists in Latin1 breaks the regexp engine. Some examples

mikael@crydee:~$ perl -e 'use encoding utf8; $a="söt"; $b=qr($a); $a=~ 
s/$b/BOLL/gi; print $a."\n";'
BOLLöt
mikael@crydee:~$ perl -e 'use encoding utf8; $a="söt€"; $b=qr($a); $a=~ 
s/$b/BOLL/gi; print $a."\n";'
BOLL
mikael@crydee:~$ perl -e 'use encoding utf8; $a="s€t"; $b=qr($a); $a=~ 
s/$b/BOLL/gi; print $a."\n";'
BOLL
mikael@crydee:~$ perl -e 'use encoding utf8; $a="s¤t"; $b=qr($a); $a=~ 
s/$b/BOLL/gi; print $a."\n";'
s¤t


As you can see the replace regexp only works ok when I put a €-sign (which is 
not in Latin1, and is represented by multiple
bytes in UTF-X). Otherwise with 'ö' it breaks halfway, and with '¤' it don't 
match at all.

Also, when using 'use locale' as well, it actually segfaults on '¤' but 
behaves similar with 'ö' and '€'. See below.

mikael@crydee:~$ perl -e 'use encoding utf8; use locale; $a="s¤t"; $b=qr($a); 
$a=~ s/$b/BOLL/gi; print $a."\n";'
Segmentation fault
mikael@crydee:~$ perl -e 'use encoding utf8; use locale; $a="söt"; $b=qr($a); 
$a=~ s/$b/BOLL/gi; print $a."\n";'
BOLLöt
mikael@crydee:~$ perl -e 'use encoding utf8; use locale; $a="söt€"; 
$b=qr($a); $a=~ s/$b/BOLL/gi; print $a."\n";'
BOLL
mikael@crydee:~$

If not using 'use encoding utf8' it seems to work, but that breaks everything 
else for us in our application, so it is not a possible
work around.


[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
     category=core
     severity=high
---
Site configuration information for perl v5.8.4:

Configured by Debian Project at Sat Nov  6 18:41:03 UTC 2004.

Summary of my perl5 (revision 5 version 8 subversion 4) configuration:
   Platform:
     osname=linux, osvers=2.6.10-rc1-bk1, archname=i386-linux-thread-multi
     uname='linux cyberhq 2.6.10-rc1-bk1 #1 smp sat oct 23 12:56:07 pdt 2004 
i686 gnulinux '
     config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN 
-Dcccdlflags=-fPIC -Darchname=i386-linux -Dprefix=/usr 
-Dprivlib=/usr/share/perl/5.8 -Darchlib=/usr/lib/perl/5.8 -Dvendorprefix=/usr 
-Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 
-Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.8.4 
-Dsitearch=/usr/local/lib/perl/5.8.4 -Dman1dir=/usr/share/man/man1 
-Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 
-Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl 
-Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Uusesfio -Uusenm -Duseshrplib 
-Dlibperl=libperl.so.5.8.4 -Dd_dosuid -des'
     hint=recommended, useposix=true, d_sigaction=define
     usethreads=define use5005threads=undef useithreads=define 
usemultiplicity=define
     useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
     use64bitint=undef use64bitall=undef uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
   Compiler:
     cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS 
-DDEBIAN -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE 
-D_FILE_OFFSET_BITS=64',
     optimize='-O2',
     cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN 
-fno-strict-aliasing -I/usr/local/include'
     ccversion='', gccversion='3.3.5 (Debian 1:3.3.5-2)', gccosandvers=''
     intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
     d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
     ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', 
lseeksize=8
     alignbytes=4, prototype=define
   Linker and Libraries:
     ld='cc', ldflags =' -L/usr/local/lib'
     libpth=/usr/local/lib /lib /usr/lib
     libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
     perllibs=-ldl -lm -lpthread -lc -lcrypt
     libc=/lib/libc-2.3.2.so, so=so, useshrplib=true, libperl=libperl.so.5.8.4
     gnulibc_version='2.3.2'
   Dynamic Linking:
     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
     cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:


---
@INC for perl v5.8.4:
     /etc/perl
     /usr/local/lib/perl/5.8.4
     /usr/local/share/perl/5.8.4
     /usr/lib/perl5
     /usr/share/perl5
     /usr/lib/perl/5.8
     /usr/share/perl/5.8
     /usr/local/lib/site_perl
     /usr/local/lib/perl/5.8.3
     /usr/local/share/perl/5.8.3
     .

---
Environment for perl v5.8.4:
     HOME=/home/mikael
     LANG=en_US.UTF-8
     LANGUAGE (unset)
     LC_ALL=en_US.UTF-8
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
     PATH=/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
     PERL_BADLANG (unset)
     SHELL=/bin/sh


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About