Front page | perl.perl5.porters |
Postings from November 2010
[perl #79960] Setting $/ to read fixed records can corrupt valid UTF-8 input
Thread Next
From:
Nicholas Clark
Date:
November 29, 2010 10:17
Subject:
[perl #79960] Setting $/ to read fixed records can corrupt valid UTF-8 input
Message ID:
rt-3.6.HEAD-13564-1291046706-644.79960-75-0@perl.org
# New Ticket Created by Nicholas Clark
# Please include the string: [perl #79960]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=79960 >
This is a bug report for perl from nick@ccl4.org,
generated with the help of perlbug 1.39 running under perl 5.13.7.
-----------------------------------------------------------------
[Please describe your issue here]
It's possible to get the perl interpreter to have corrupt internal state on
a valid UTF-8 input stream, by setting $/ to case fixed-length reads.
[Command-line -C7 sets UTF-8 on STD{IN,OUT,ERR}, and $/ = \4096 sets reads to
a fixed size of 4096]
$ ./perl -C7 -e 'print "\x{20AC}" x 1366' | ./perl -C7 -e '$/ = \4096; $_ = <>; printf "%s\n", length $_'
Malformed UTF-8 character (unexpected end of string) in length at -e line 1, <> chunk 1.
1365
Note that unlike other concerns with the utf8 layer not trapping *in*valid
input, this bug is for *valid* input.
Clearer to see is:
$ ./perl -C7 -e 'print "\x{20AC}"' | ./perl -C7 -e '$/ = \2; $_ = <>; printf "%s\n", length $_'
Malformed UTF-8 character (unexpected end of string) in length at -e line 1, <> chunk 1.
0
The input is truncated at 2 octets:
$ ./perl -C7 -e 'print "\x{20AC}"' | ./perl -C7 -Ilib -MDevel::Peek -e '$/ = \2; $_ = <>; Dump $_'
SV = PV(0xa1e090) at 0xa40f50
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0xa3b3e0 "\342\202"\0 [UTF8 "\x{2080}"]
CUR = 2
LEN = 80
The dump should look like this:
$ ./perl -C7 -Ilib -MDevel::Peek -e 'Dump "\x{20AC}"'
SV = PV(0xa1e2a0) at 0xa33098
REFCNT = 1
FLAGS = (POK,READONLY,pPOK,UTF8)
PV = 0xa3aca0 "\342\202\254"\0 [UTF8 "\x{20ac}"]
CUR = 3
LEN = 16
Curiously there also seems to be range checking error in the dump code, as a
truncated pound sign causes a lot more grief:
$ ./perl -C7 -e 'print "\x{A3}"' | ./perl -Ilib -MDevel::Peek -C7 -we '$/ = \1; $_ = <>; Dump $_'
utf8 "\xC2" does not map to Unicode at -e line 1, <> chunk 1.
SV = PV(0xa1e090) at 0xa40f50
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0xa3b3e0 "\302"\0Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xc2) in subroutine entry at -e line 1, <> chunk 1.
[UTF8 "\x{0}"]
CUR = 1
LEN = 80
The relevant code for this problem is in S_sv_gets_read_record().
[I refactored it out of Perl_sv_gets() earlier today]
It's not immediately obvious to me what the correct solution is.
On the one hand, the user asked for a fixed record length, and on VMS we use
a record based file API, so we could try to honour that either by
a: refusing to read on UTF-8 file handles. (make it croak)
b: throwing an error if the read results in a truncated UTF-8 sequence
(make it croak *some* of the time)
Or we could try to do what read and sysread do, and treat the length parameter
as characters, so that on a UTF-8 flagged handle we loop until we read in
sufficient characters. But that blows the idea of "record based" completely
on a UTF-8 handle.
Nicholas Clark
[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
category=core
severity=low
---
Site configuration information for perl 5.13.7:
Configured by nick at Mon Nov 29 15:00:15 GMT 2010.
Summary of my perl5 (revision 5 version 13 subversion 7) configuration:
Derived from: 0f93bb20132f1d122993dac5d6e249240a28646e
Platform:
osname=linux, osvers=2.6.35.4, archname=x86_64-linux
uname='linux eris 2.6.35.4 #4 smp tue sep 21 09:54:22 bst 2010 x86_64 gnulinux '
config_args='-Dusedevel=y -Dcc=ccache gcc -Dld=gcc -Ubincompat5005 -Uinstallusrbinperl -Dcf_email=nick@ccl4.org -Dperladmin=nick@ccl4.org -Dinc_version_list= -Dinc_version_list_init=0 -Doptimize=-g -Uusethreads -Uuselongdouble -Uuse64bitall -Uusemymalloc -Duseperlio -Dprefix=~/Sandpit/snap5.9.x-v5.13.7-190-g0f93bb20 -Uusevendorprefix -Uvendorprefix=~/Sandpit/snap5.9.x-v5.13.7-190-g0f93bb20 -Dinstallman1dir=none -Dinstallman3dir=none -Uuserelocatableinc -Umad -Accccflags-DPURIFY -de'
hint=recommended, useposix=true, d_sigaction=define
useithreads=undef, usemultiplicity=undef
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=define, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='ccache gcc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-g',
cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
ccversion='', gccversion='4.3.2', gccosandvers=''
intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='gcc', ldflags =' -fstack-protector -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64
libs=-lnsl -ldb -ldl -lm -lcrypt -lutil -lc
perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
libc=/lib/libc-2.7.so, so=so, useshrplib=false, libperl=libperl.a
gnulibc_version='2.7'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
cccdlflags='-fPIC', lddlflags='-shared -g -L/usr/local/lib -fstack-protector'
Locally applied patches:
---
@INC for perl 5.13.7:
lib
/home/nick/Sandpit/snap5.9.x-v5.13.7-190-g0f93bb20/lib/perl5/site_perl/5.13.7/x86_64-linux
/home/nick/Sandpit/snap5.9.x-v5.13.7-190-g0f93bb20/lib/perl5/site_perl/5.13.7
/home/nick/Sandpit/snap5.9.x-v5.13.7-190-g0f93bb20/lib/perl5/5.13.7/x86_64-linux
/home/nick/Sandpit/snap5.9.x-v5.13.7-190-g0f93bb20/lib/perl5/5.13.7
.
---
Environment for perl 5.13.7:
HOME=/home/nick
LANG (unset)
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/home/nick/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/usr/local/sbin:/sbin:/usr/sbin
PERL_BADLANG (unset)
SHELL=/bin/bash
Thread Next
-
[perl #79960] Setting $/ to read fixed records can corrupt valid UTF-8 input
by Nicholas Clark