develooper Front page | perl.perl5.porters | Postings from September 2011

[perl #79960] Setting $/ to read fixed records can corrupt valid UTF-8 input

Thread Next
From:
Brian Fraser via RT
Date:
September 29, 2011 01:50
Subject:
[perl #79960] Setting $/ to read fixed records can corrupt valid UTF-8 input
Message ID:
rt-3.6.HEAD-31297-1317227005-964.79960-15-0@perl.org
On Mon Nov 29 08:05:05 2010, nicholas wrote:
> 
> This is a bug report for perl from nick@ccl4.org,
> generated with the help of perlbug 1.39 running under perl 5.13.7.
> 
> 
> -----------------------------------------------------------------
> [Please describe your issue here]
> 
> It's possible to get the perl interpreter to have corrupt internal
>    state on
> a valid UTF-8 input stream, by setting $/ to case fixed-length reads.
> 
> [Command-line -C7 sets UTF-8 on STD{IN,OUT,ERR}, and $/ = \4096 sets
>    reads to
> a fixed size of 4096]
> 
> $ ./perl -C7 -e 'print "\x{20AC}" x 1366' | ./perl -C7 -e '$/ = \4096;
>    $_ = <>; printf "%s\n", length $_'
> Malformed UTF-8 character (unexpected end of string) in length at -e
>    line 1, <> chunk 1.
> 1365
> 
> Note that unlike other concerns with the utf8 layer not trapping
>    *in*valid
> input, this bug is for *valid* input.
> 
> Clearer to see is:
> 
> $ ./perl -C7 -e 'print "\x{20AC}"' | ./perl -C7 -e '$/ = \2; $_ = <>;
>    printf "%s\n", length $_'
> Malformed UTF-8 character (unexpected end of string) in length at -e
>    line 1, <> chunk 1.
> 0
> 
> The input is truncated at 2 octets:
> 
> $ ./perl -C7 -e 'print "\x{20AC}"' | ./perl -C7 -Ilib -MDevel::Peek -e
>    '$/ = \2; $_ = <>; Dump $_'
> SV = PV(0xa1e090) at 0xa40f50
>   REFCNT = 1
>   FLAGS = (POK,pPOK,UTF8)
>   PV = 0xa3b3e0 "\342\202"\0 [UTF8 "\x{2080}"]
>   CUR = 2
>   LEN = 80
> 
> 
> The dump should look like this:
> 
> $ ./perl -C7 -Ilib -MDevel::Peek -e 'Dump "\x{20AC}"'
> SV = PV(0xa1e2a0) at 0xa33098
>   REFCNT = 1
>   FLAGS = (POK,READONLY,pPOK,UTF8)
>   PV = 0xa3aca0 "\342\202\254"\0 [UTF8 "\x{20ac}"]
>   CUR = 3
>   LEN = 16
> 
> 
> Curiously there also seems to be range checking error in the dump
>    code, as a
> truncated pound sign causes a lot more grief:
> 
> $ ./perl -C7 -e 'print "\x{A3}"' | ./perl -Ilib -MDevel::Peek -C7 -we
>    '$/ = \1; $_ = <>; Dump $_'
> utf8 "\xC2" does not map to Unicode at -e line 1, <> chunk 1.
> SV = PV(0xa1e090) at 0xa40f50
>   REFCNT = 1
>   FLAGS = (POK,pPOK,UTF8)
>   PV = 0xa3b3e0 "\302"\0Malformed UTF-8 character (unexpected non-
>    continuation byte 0x00, immediately after start byte 0xc2) in
>    subroutine entry at -e line 1, <> chunk 1.
>  [UTF8 "\x{0}"]
>   CUR = 1
>   LEN = 80
> 
> 
> The relevant code for this problem is in S_sv_gets_read_record().
> [I refactored it out of Perl_sv_gets() earlier today]
> 
> 
> It's not immediately obvious to me what the correct solution is.
> 
> On the one hand, the user asked for a fixed record length, and on VMS
>    we use
> a record based file API, so we could try to honour that either by
> 
> a: refusing to read on UTF-8 file handles. (make it croak)
> b: throwing an error if the read results in a truncated UTF-8 sequence
>    (make it croak *some* of the time)
> 
> Or we could try to do what read and sysread do, and treat the length
>    parameter
> as characters, so that on a UTF-8 flagged handle we loop until we read
>    in
> sufficient characters. But that blows the idea of "record based"
>    completely
> on a UTF-8 handle.
> 
> Nicholas Clark
> 
> [Please do not change anything below this line]
> -----------------------------------------------------------------
> ---
> Flags:
>     category=core
>     severity=low
> ---
> Site configuration information for perl 5.13.7:
> 
> Configured by nick at Mon Nov 29 15:00:15 GMT 2010.
> 
> Summary of my perl5 (revision 5 version 13 subversion 7)
>    configuration:
>   Derived from: 0f93bb20132f1d122993dac5d6e249240a28646e
>   Platform:
>     osname=linux, osvers=2.6.35.4, archname=x86_64-linux
>     uname='linux eris 2.6.35.4 #4 smp tue sep 21 09:54:22 bst 2010
>    x86_64 gnulinux '
>     config_args='-Dusedevel=y -Dcc=ccache gcc -Dld=gcc -Ubincompat5005
>    -Uinstallusrbinperl -Dcf_email=nick@ccl4.org
>    -Dperladmin=nick@ccl4.org -Dinc_version_list=
>    -Dinc_version_list_init=0 -Doptimize=-g -Uusethreads
>    -Uuselongdouble -Uuse64bitall -Uusemymalloc -Duseperlio
>    -Dprefix=~/Sandpit/snap5.9.x-v5.13.7-190-g0f93bb20
>    -Uusevendorprefix
>    -Uvendorprefix=~/Sandpit/snap5.9.x-v5.13.7-190-g0f93bb20
>    -Dinstallman1dir=none -Dinstallman3dir=none -Uuserelocatableinc
>    -Umad -Accccflags-DPURIFY -de'
>     hint=recommended, useposix=true, d_sigaction=define
>     useithreads=undef, usemultiplicity=undef
>     useperlio=define, d_sfio=undef, uselargefiles=define,
>    usesocks=undef
>     use64bitint=define, use64bitall=undef, uselongdouble=undef
>     usemymalloc=n, bincompat5005=undef
>   Compiler:
>     cc='ccache gcc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe
>    -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE
>    -D_FILE_OFFSET_BITS=64',
>     optimize='-g',
>     cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector
>    -I/usr/local/include'
>     ccversion='', gccversion='4.3.2', gccosandvers=''
>     intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
>     d_longlong=define, longlongsize=8, d_longdbl=define,
>    longdblsize=16
>     ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
>    lseeksize=8
>     alignbytes=8, prototype=define
>   Linker and Libraries:
>     ld='gcc', ldflags =' -fstack-protector -L/usr/local/lib'
>     libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64
>     libs=-lnsl -ldb -ldl -lm -lcrypt -lutil -lc
>     perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
>     libc=/lib/libc-2.7.so, so=so, useshrplib=false, libperl=libperl.a
>     gnulibc_version='2.7'
>   Dynamic Linking:
>     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
>     cccdlflags='-fPIC', lddlflags='-shared -g -L/usr/local/lib
>    -fstack-protector'
> 
> Locally applied patches:
> 
> 
> ---
> @INC for perl 5.13.7:
>     lib
>     /home/nick/Sandpit/snap5.9.x-v5.13.7-190-
>    g0f93bb20/lib/perl5/site_perl/5.13.7/x86_64-linux
>     /home/nick/Sandpit/snap5.9.x-v5.13.7-190-
>    g0f93bb20/lib/perl5/site_perl/5.13.7
>     /home/nick/Sandpit/snap5.9.x-v5.13.7-190-
>    g0f93bb20/lib/perl5/5.13.7/x86_64-linux
>     /home/nick/Sandpit/snap5.9.x-v5.13.7-190-
>    g0f93bb20/lib/perl5/5.13.7
>     .
> 
> ---
> Environment for perl 5.13.7:
>     HOME=/home/nick
>     LANG (unset)
>     LANGUAGE (unset)
>     LD_LIBRARY_PATH (unset)
>     LOGDIR (unset)
>     PATH=/home/nick/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/usr/
local/sbin:/sbin:/usr/sbin
>     PERL_BADLANG (unset)
>     SHELL=/bin/bash

I'd say make it croak, maybe add a "consider using sysread() or binmode
() instead"-like entry in perldiag. I guess it could lead to a bit of 
an action-at-a-distance, but getting broken UTF-8 is basically never 
right.



Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About