develooper Front page | perl.perl5.porters | Postings from November 2003

[perl #24541] substr and utf8 and use bytes

From:
William R Ward
Date:
November 21, 2003 21:08
Subject:
[perl #24541] substr and utf8 and use bytes
Message ID:
rt-24541-67750.9.59761179084659@rt.perl.org
# New Ticket Created by  William R Ward 
# Please include the string:  [perl #24541]
# in the subject line of all future correspondence about this issue. 
# <URL: http://rt.perl.org/rt2/Ticket/Display.html?id=24541 >



This is a bug report for perl from william.ward@oracle.com,
generated with the help of perlbug 1.34 running under perl v5.8.1.


-----------------------------------------------------------------
[Please enter your report here]

We have a need to take a string containing utf8-encoded multibyte
characters, and then, treating the string as bytes, extract a
substring of N characters from it.

This is what "use bytes" was meant for, and it works great on Perl
5.6.1.  But in Perl 5.8.1 it corrupts the multi-byte characters in the
process of extracting them.  The following test script illustrates the
issue well.  Run it under 5.6.1 and 5.8.1 and check the difference.

#!perl -w

use utf8;

$omega="\x{2126}";
$str=$omega.'1234567890';

open(FILE, ">","utf8-substr.out") || die "open - $!\n";
eval ($] > 5.008) && binmode(FILE, ":utf8");
print FILE "original str=$str";
print FILE "\n\n";

$str1=substr($str,0,10);
print FILE "before use bytes: str=$str, substr=$str1\n\n";

$str1b = unpack("a10", $str);
print FILE "before use bytes, unpack: str=$str, substr=$str1b\n\n";

($str1c) = ($str =~ /(..........)/);
print FILE "before use bytes, regex: str=$str, substr=$str1c\n\n";

@chars = split "", $str;
$str1d = join("", @chars[0..9]);
print FILE "before use bytes, split/join: str=$str, substr=$str1d\n\n";

{
    use bytes;
    $str2=substr($str,0,10);
    print FILE "after use bytes: str=$str, substr=$str2\n\n";

    $str2b = unpack("a10", $str);
    print FILE "after use bytes, unpack: str=$str, substr=$str2b\n\n";

    ($str2c) = ($str =~ /(..........)/);
    print FILE "after use bytes, regex: str=$str, substr=$str2c\n\n";

    @chars = split "", $str;
    $str2d = join("", @chars[0..9]);
    print FILE "after use bytes, split/join: str=$str, substr=$str2d\n\n";
}

close(FILE);



In case it helps, here is the "perl -V" for our 5.6.1 instance (the
5.8.1 "perl -V" is included below by perlbug itself):

Summary of my perl5 (revision 5.0 version 6 subversion 1) configuration:
  Platform:
    osname=linux, osvers=2.4.9-12custom, archname=i686-linux
    uname='linux ap607wgs 2.4.9-12custom #1 smp wed feb 6 13:55:56 pst 
2002 i686 unknown '
    config_args='-de -Dprefix=/arudev -Dmake=/usr/bin/make 
-Dbin=/arudev/bin/ -Uinstallusrbinperl -Dstartperl=#!/arudev/bin/perl 
-Dscriptdir=/nfs/group/arudev/arch/share/bin -Dsitebin=/arudev/bin'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef 
usemultiplicity=undef
    useperlio=undef d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
  Compiler:
    cc='cc', ccflags ='-fno-strict-aliasing -D_LARGEFILE_SOURCE 
-D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-fno-strict-aliasing'
    ccversion='', gccversion='2.96 20000731 (Red Hat Linux 7.1 
2.96-85)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', 
lseeksize=8
    alignbytes=4, usemymalloc=n, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lndbm -lgdbm -ldl -lm -lc -lcrypt -lutil
    perllibs=-lnsl -ldl -lm -lc -lcrypt -lutil
    libc=/lib/libc-2.2.2.so, so=so, useshrplib=false, libperl=libperl.a
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'


Characteristics of this binary (from libperl):
  Compile-time options: USE_LARGE_FILES
  Built under linux
  Compiled at Sep 20 2002 15:20:12
  @INC:
    /arudev/lib/perl5/5.6.1/i686-linux
    /arudev/lib/perl5/5.6.1
    /arudev/lib/perl5/site_perl/5.6.1/i686-linux
    /arudev/lib/perl5/site_perl/5.6.1
    /arudev/lib/perl5/site_perl
    .


[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
    category=core
    severity=high
---
Site configuration information for perl v5.8.1:

Configured by srdas at Sun Nov  9 23:25:48 PST 2003.

Summary of my perl5 (revision 5.0 version 8 subversion 1) configuration:
  Platform:
    osname=linux, osvers=2.4.9-e.18smp, archname=i686-linux-thread-multi
    uname='linux ap630wgs 2.4.9-e.18smp #1 smp fri apr 11 18:24:51 edt 
2003 i686 unknown '
    config_args='-de -Dprefix=/arudev -Dmake=/usr/bin/make 
-Dbin=/arudev/bin/ -Uinstallusrbinperl -Dusethreads 
-Dstartperl=#!/arudev/bin/perl -Dinc_version_list=none 
-Dscriptdir=/nfs/group/arudev/arch/share/bin -Dsitebin=/arudev/bin'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define 
usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS 
-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE 
-D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
    optimize='-O2',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS 
-fno-strict-aliasing -I/usr/local/include -I/usr/include/gdbm'
    ccversion='', gccversion='2.96 20000731 (Red Hat Linux 7.2 
2.96-108.1)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', 
lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    libc=/lib/libc-2.2.4.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.2.4'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
   

---
@INC for perl v5.8.1:
    /arudev/lib/perl5/5.8.1/i686-linux-thread-multi
    /arudev/lib/perl5/5.8.1
    /arudev/lib/perl5/site_perl/5.8.1/i686-linux-thread-multi
    /arudev/lib/perl5/site_perl/5.8.1
    /arudev/lib/perl5/site_perl
    .

---
Environment for perl v5.8.1:
    HOME=/home/wward
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH=/usr/openwin/lib:/oracle/8.1.6/lib
    LOGDIR (unset)
    
PATH=/home/wward/bin:/arudev/bin:/arudev/tools/bin:/usr/local/bin:/bin:/sbin:/usr/bin:/usr/sbin:/local/bin:/home/wward/bin:/arudev/bin:/arudev/tools/bin:/usr/local/bin:/bin:/sbin:/usr/bin:/usr/sbin:/local/bin:/usr/bin/X11:.:/home/wward/bin:/oracle/8.1.6/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/bin/X11:/usr/local/bin:/etc:/appldev/bin:/usr/atria/bin:/arudev/bin:/arudev/tools/bin:/local/bin
    PERL_BADLANG (unset)
    SHELL=/usr/local/bin/tcsh






nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About