develooper Front page | perl.perl5.porters | Postings from January 2004

[perl #24888] chomp ignores utf8

From:
Nicholas Clark
Date:
January 12, 2004 20:11
Subject:
[perl #24888] chomp ignores utf8
Message ID:
rt-3.0.8-24888-69959.12.2317179856103@perl.org
# New Ticket Created by  Nicholas Clark 
# Please include the string:  [perl #24888]
# in the subject line of all future correspondence about this issue. 
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=24888 >



This is a bug report for perl from nick@ccl4.org,
generated with the help of perlbug 1.34 running under perl v5.8.3.


-----------------------------------------------------------------
[Please enter your report here]

While working my way down doop.c, I discovered that chomp completely ignores
utf8 flags in both the chomped string and $/

With the following patch to t/op/chop.t there are many test failures.
I'm not sure of the most efficient way to patch Perl_do_chomp to cure them.
I guess use the existing byte comparison code if utf8 flags are the same
on both the target and $/, and do conversion otherwise, but I'm not going to
look further until after 5.8.3 is released.

ok 52 - start=78 end=78
ok 53 - start=78 end=163
not ok 54 - start=78 end=163 (end as bytes)
# Failed at t/op/chop.t line 203
#      got 'NÂ'
# expected 'N£'
ok 55 - start=78 end=163 ($/ as bytes)
ok 56 - start=78 end=164
not ok 57 - start=78 end=164 (end as bytes)
# Failed at t/op/chop.t line 203
#      got 'N'
# expected 'N¤'
not ok 58 - start=78 end=164 ($/ as bytes)
# Failed at t/op/chop.t line 209
#      got 'N'
# expected 'N¤'
ok 59 - start=78 end=1296
not ok 60 - start=78 end=1296 (end as bytes)
# Failed at t/op/chop.t line 203
#      got 'N'
# expected 'NÔ
not ok 61 - start=78 end=1296 ($/ as bytes)
# Failed at t/op/chop.t line 209
#      got 'N'
Wide character in print at ./test.pl line 38.
# expected 'NÔ
ok 62 - start=163 end=78
ok 63 - start=163 end=163
not ok 64 - start=163 end=163 (end as bytes)
# Failed at t/op/chop.t line 203
#      got '£Â'
# expected '£Â£'
ok 65 - start=163 end=163 ($/ as bytes)
ok 66 - start=163 end=164
not ok 67 - start=163 end=164 (end as bytes)
# Failed at t/op/chop.t line 203
#      got '£'
# expected '£Â¤'
not ok 68 - start=163 end=164 ($/ as bytes)
# Failed at t/op/chop.t line 209
#      got '£'
# expected '£¤'
ok 69 - start=163 end=1296
not ok 70 - start=163 end=1296 (end as bytes)
# Failed at t/op/chop.t line 203
#      got '£'
# expected '£Ô
not ok 71 - start=163 end=1296 ($/ as bytes)
# Failed at t/op/chop.t line 209
#      got '£'
Wide character in print at ./test.pl line 38.
# expected '£Ô
ok 72 - start=164 end=78
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 94.
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 95.
not ok 73 - start=164 end=163
# Failed at t/op/chop.t line 193
Wide character in print at ./test.pl line 38.
#      got '¤Â'
# expected '¤'
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 94.
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 95.
not ok 74 - start=164 end=163 (end as bytes)
# Failed at t/op/chop.t line 203
Wide character in print at ./test.pl line 38.
#      got '¤ÂÂ'
# expected '¤Â£'
not ok 75 - start=164 end=163 ($/ as bytes)
# Failed at t/op/chop.t line 209
#      got '¤'
# expected '¤£'
ok 76 - start=164 end=164
not ok 77 - start=164 end=164 (end as bytes)
# Failed at t/op/chop.t line 203
#      got '¤Â'
# expected '¤Â¤'
not ok 78 - start=164 end=164 ($/ as bytes)
# Failed at t/op/chop.t line 209
#      got '¤'
# expected '¤¤'
ok 79 - start=164 end=1296
ok 80 - start=164 end=1296 (end as bytes)
not ok 81 - start=164 end=1296 ($/ as bytes)
# Failed at t/op/chop.t line 209
#      got '¤'
Wide character in print at ./test.pl line 38.
# expected '¤Ô
ok 82 - start=1296 end=78
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 94.
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 95.
not ok 83 - start=1296 end=163
# Failed at t/op/chop.t line 193
Wide character in print at ./test.pl line 38.
#      got 'Ô
Wide character in print at ./test.pl line 38.
# expected 'Ô
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 94.
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 95.
not ok 84 - start=1296 end=163 (end as bytes)
# Failed at t/op/chop.t line 203
Wide character in print at ./test.pl line 38.
#      got 'Ô
Wide character in print at ./test.pl line 38.
# expected 'Ô
not ok 85 - start=1296 end=163 ($/ as bytes)
# Failed at t/op/chop.t line 209
Wide character in print at ./test.pl line 38.
#      got 'Ô
Wide character in print at ./test.pl line 38.
# expected 'Ô
ok 86 - start=1296 end=164
not ok 87 - start=1296 end=164 (end as bytes)
# Failed at t/op/chop.t line 203
Wide character in print at ./test.pl line 38.
#      got 'Ô
Wide character in print at ./test.pl line 38.
# expected 'Ô
not ok 88 - start=1296 end=164 ($/ as bytes)
# Failed at t/op/chop.t line 209
Wide character in print at ./test.pl line 38.
#      got 'Ô
Wide character in print at ./test.pl line 38.
# expected 'Ô
ok 89 - start=1296 end=1296
ok 90 - start=1296 end=1296 (end as bytes)
not ok 91 - start=1296 end=1296 ($/ as bytes)
# Failed at t/op/chop.t line 209
Wide character in print at ./test.pl line 38.
#      got 'Ô
Wide character in print at ./test.pl line 38.
# expected 'Ô

This is not a new utf8 bug.

--- t/op/chop.t.orig	Mon Nov  4 06:34:41 2002
+++ t/op/chop.t	Mon Jan 12 20:56:02 2004
@@ -6,7 +6,7 @@ BEGIN {
     require './test.pl';
 }
 
-plan tests => 51;
+plan tests => 91;
 
 $_ = 'abc';
 $c = do foo();
@@ -183,3 +183,29 @@ ok($@ =~ /Can\'t modify.*chop.*in.*assig
 eval 'chomp($x, $y) = (1, 2);';
 ok($@ =~ /Can\'t modify.*chom?p.*in.*assignment/);
 
+my @chars = ("N", "\xa3", substr ("\xa4\x{100}", 0, 1), chr 1296);
+foreach my $start (@chars) {
+  foreach my $end (@chars) {
+    local $/ = $end;
+    my $message = "start=" . ord ($start) . " end=" . ord $end;
+    my $string = $start . $end;
+    chomp $string;
+    is ($string, $start, $message);
+
+    my $end_utf8 = $end;
+    utf8::encode ($end_utf8);
+    next if $end_utf8 eq $end;
+
+    # $end ne $end_utf8, so these should not chomp.
+    $string = $start . $end_utf8;
+    my $chomped = $string;
+    chomp $chomped;
+    is ($chomped, $string, "$message (end as bytes)");
+
+    $/ = $end_utf8;
+    $string = $start . $end;
+    $chomped = $string;
+    chomp $chomped;
+    is ($chomped, $string, "$message (\$/ as bytes)");
+  }
+}

[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
    category=core
    severity=medium
---
Site configuration information for perl v5.8.3:

Configured by nick at Fri Jan  9 10:31:25 GMT 2004.

Summary of my perl5 (revision 5.0 version 8 subversion 3) configuration:
  Platform:
    osname=linux, osvers=2.4.19-rmk4, archname=armv4l-linux
    uname='linux bagpuss.unfortu.net 2.4.19-rmk4 #3 fri oct 25 21:57:55 bst 2002 armv4l unknown '
    config_args='-Dusedevel=y -Dcc=ccache gcc-3.0 -Dld=gcc -Ubincompat5005 -Uinstallusrbinperl -Dcf_email=nick@ccl4.org -Dperladmin=nick@ccl4.org -Dinc_version_list=  -Dinc_version_list_init=0 -Doptimize=-O1 -Dusethreads=n -Dprefix=/usr/local/perl5.8.3/ -Dinstallman1dir=none -Dinstallman3dir=none -de'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='ccache gcc-3.0', ccflags ='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O1',
    cppflags='-fno-strict-aliasing -I/usr/local/include'
    ccversion='', gccversion='3.0.4', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=8
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lgdbm -ldbm -ldb -ldl -lm -lcrypt -lutil -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.2.5.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.2.5'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    MAINT22085

---
@INC for perl v5.8.3:
    lib
    /usr/local/perl5.8.3/lib/5.8.3/armv4l-linux
    /usr/local/perl5.8.3/lib/5.8.3
    /usr/local/perl5.8.3/lib/site_perl/5.8.3/armv4l-linux
    /usr/local/perl5.8.3/lib/site_perl/5.8.3
    /usr/local/perl5.8.3/lib/site_perl
    .

---
Environment for perl v5.8.3:
    HOME=/home/nick
    LANG (unset)
    LANGUAGE (unset)
    LC_CTYPE=en_GB.ISO-8859-1
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/nick/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games:/sbin:/usr/sbin:/usr/local/sbin
    PERL_BADLANG (unset)
    SHELL=/bin/bash



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About