develooper Front page | perl.perl5.porters | Postings from February 2012

Re: [perl #79960] Setting $/ to read fixed records can corrupt validUTF-8 input

Thread Previous
From:
Eric Brine
Date:
February 23, 2012 21:56
Subject:
Re: [perl #79960] Setting $/ to read fixed records can corrupt validUTF-8 input
Message ID:
CALJW-qFXGNes-03i9snLhOS_6jCw6ZPditcWObbwkcxx=VwPLg@mail.gmail.com
On Tue, Feb 21, 2012 at 11:33 PM, David Nicol <davidnicol@gmail.com> wrote:

> On Mon, Oct 24, 2011 at 1:54 PM, Eric Brine <ikegami@adaelis.com> wrote:
> > ok, so we have:
> >
> > a: Refusing to read on UTF-8 file handles by croaking.
> > b: Treat the record length as bytes. Croak on truncated UTF-8 results.
> > c: Treat the record length as characters like read and sysread.
> >
> >
> > - Only (c) can handle both records measured in bytes and records
> measured in
> > chars.
> > - Only (c) is consistent with read and sysread.
>
> Why would anyone possibly want fixed-length records in chars? Because
> they're packing a Twitter archive? if this use case really exists,
> there should be a way to turn it on instead -- maybe by setting $/ to
> a reference to an array of integers which will be interpreted as field
> lengths in characters and cycled through.
>
> > - (a) and (c) are more self-consistent than (b): One either deals with
> bytes
> > or chars, not both at the same time.
> >
> > But:
> >
> > - Only (b) is backwards compatible with existing behaviour (although the
> > behaviour isn't exactly documented).
> >
> > - Eric
>
> d: b, but there is a way to turn off the croaking, and when it has
> been turned off, the invalid segments get downgraded to bytes. ("no
> strict utf8" perhaps?)


If I read that correctly, (d) would mean that

no strict "utf8"; # or whatever
open(my $fh, '<:encoding(UTF-16be)', \"\x26\x60\x26\x60\x26\x60");
sysread($fh, my $buf, $i);

does

when ($i==1) { $buf = "\xE2"; }
when ($i==2) { $buf = "\xE2\x99"; }
when ($i==3) { $buf = "\x{2660}"; }
when ($i==4) { $buf = "\x{2660}\xE2"; }
when ($i==5) { $buf = "\xE2\x99\A0\xE2\x99"; }
when ($i==6) { $buf = "\x{2660}\x{2660}"; }

I don't see how returning bytes that don't even exist in the file is of any
use.

- Eric

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About