[perl #79960] Setting $/ to read fixed records can corrupt valid UTF-8 input

Nicholas Clark
November 29, 2010 10:17
[perl #79960] Setting $/ to read fixed records can corrupt valid UTF-8 input
It's possible to get the perl interpreter to have corrupt internal state on
a valid UTF-8 input stream, by setting $/ to case fixed-length reads.

[Command-line -C7 sets UTF-8 on STD{IN,OUT,ERR}, and $/ = \4096 sets reads to
a fixed size of 4096]

$ ./perl -C7 -e 'print "\x{20AC}" x 1366' | ./perl -C7 -e '$/ = \4096; $_ = <>; printf "%s\n", length $_'
Malformed UTF-8 character (unexpected end of string) in length at -e line 1, <> chunk 1.

Note that unlike other concerns with the utf8 layer not trapping *in*valid
input, this bug is for *valid* input.

Clearer to see is:

$ ./perl -C7 -e 'print "\x{20AC}"' | ./perl -C7 -e '$/ = \2; $_ = <>; printf "%s\n", length $_'
Malformed UTF-8 character (unexpected end of string) in length at -e line 1, <> chunk 1.

The input is truncated at 2 octets:

$ ./perl -C7 -e 'print "\x{20AC}"' | ./perl -C7 -Ilib -MDevel::Peek -e '$/ = \2; $_ = <>; Dump $_'
SV = PV(0xa1e090) at 0xa40f50
  REFCNT = 1
  PV = 0xa3b3e0 "\342\202"\0 [UTF8 "\x{2080}"]
  CUR = 2
  LEN = 80

The dump should look like this:

$ ./perl -C7 -Ilib -MDevel::Peek -e 'Dump "\x{20AC}"'
SV = PV(0xa1e2a0) at 0xa33098
  REFCNT = 1
  PV = 0xa3aca0 "\342\202\254"\0 [UTF8 "\x{20ac}"]
  CUR = 3
  LEN = 16

Curiously there also seems to be range checking error in the dump code, as a
truncated pound sign causes a lot more grief:

$ ./perl -C7 -e 'print "\x{A3}"' | ./perl -Ilib -MDevel::Peek -C7 -we '$/ = \1; $_ = <>; Dump $_'
utf8 "\xC2" does not map to Unicode at -e line 1, <> chunk 1.
SV = PV(0xa1e090) at 0xa40f50
  REFCNT = 1
  PV = 0xa3b3e0 "\302"\0Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xc2) in subroutine entry at -e line 1, <> chunk 1.
 [UTF8 "\x{0}"]
  CUR = 1
  LEN = 80

The relevant code for this problem is in S_sv_gets_read_record().
[I refactored it out of Perl_sv_gets() earlier today]

It's not immediately obvious to me what the correct solution is.

On the one hand, the user asked for a fixed record length, and on VMS we use
a record based file API, so we could try to honour that either by

a: refusing to read on UTF-8 file handles. (make it croak)
b: throwing an error if the read results in a truncated UTF-8 sequence
   (make it croak *some* of the time)

Or we could try to do what read and sysread do, and treat the length parameter
as characters, so that on a UTF-8 flagged handle we loop until we read in
sufficient characters. But that blows the idea of "record based" completely
on a UTF-8 handle.

Nicholas Clark

