On Mon, Oct 24, 2011 at 1:54 PM, Eric Brine <ikegami@adaelis.com> wrote: > ok, so we have: > > a: Refusing to read on UTF-8 file handles by croaking. > b: Treat the record length as bytes. Croak on truncated UTF-8 results. > c: Treat the record length as characters like read and sysread. > > > - Only (c) can handle both records measured in bytes and records measured in > chars. > - Only (c) is consistent with read and sysread. Why would anyone possibly want fixed-length records in chars? Because they're packing a Twitter archive? if this use case really exists, there should be a way to turn it on instead -- maybe by setting $/ to a reference to an array of integers which will be interpreted as field lengths in characters and cycled through. > - (a) and (c) are more self-consistent than (b): One either deals with bytes > or chars, not both at the same time. > > But: > > - Only (b) is backwards compatible with existing behaviour (although the > behaviour isn't exactly documented). > > - Eric d: b, but there is a way to turn off the croaking, and when it has been turned off, the invalid segments get downgraded to bytes. ("no strict utf8" perhaps?) The mechanism for making that adjustment is named in the croak. Would (d) support the forward-looking case of migrating a working legacy system handling packed records to a new environment where the input streams are all chars instead, or a future perl where -C7 is the default, with a minimum of maintenance? Would it be better in that situation to require byte mode on the file handle in question? If so that's e: a, plus also croak earlier, at compile time if possible, by doing flow analysis: whenever the parser notices that $/ is going to get set to a reference, hmm, that isn't practical.Thread Previous | Thread Next