Front page | perl.perl5.porters |
Postings from January 2012
Re: pack and ASCII
January 13, 2012 05:30
Re: pack and ASCII
Message ID: 4F1031FF.email@example.com
On 01/12/12 11:40, Eric Brine wrote:
> On Thu, Jan 12, 2012 at 8:58 AM, John P. Linderman (jpl)
> <firstname.lastname@example.org <mailto:email@example.com>> wrote:
> I am *not* proposing that the behavior of "A" be changed. Too
> much code would break. However, the list of "surprises" that
> might happen when ASCII text is replaced with general unicode text
> should be mentioned.
> Yes, I realise you are only asking for a documentation changes (or a
> new letter for new behaviour). Others are advocating changes to the
> existing behaviour, though. My comments are directed at them. I'm
> sorry you got caught in the middle.
> 2) I have many applications that write records of fixed length
> (measured in octets). Files of such records can easily be
> searched with binary search, and it is trivial to read the Nth
> record. If this is a fringe requirement, there's not a lot left
> to say. But I suspect I am not alone in finding this a convenient
> I fully agree you should be able to do this.
> 6) The C<<$reclen = length(pack($format))>> metaphor is just a
> lower limit on record lengths.
> Only if you both forgot to encode your text and peek at Perl
> internals. (C<print> does the latter, but will warn when it does so.)
> 7) C<<print $fh $pack-output>> may grouse about wide characters (I
> regard this as a feature, but it can nevertheless be a surprise).
> Excellent, so Perl did report the error to you. Add encode() before
> pack(), and you're good to go.
> - Eric
To quote perldoc Encode, which, in turn, is quoting "Programming Perl,
Goal #2: Old byte-oriented programs should magically start working on
the new character-oriented data when appropriate.
Some of the "magic" is gone if it is necessary to explicitly encode
before packing and decode after unpacking. It's too late to have "A20"
do what I meant, but we can make it relatively painless if there is a
"pack pragma" (or something similar) that turns on that behavior without
having to (otherwise) modify programs. "That behavior", just to be
clear, means interpreting the number following "A" (or "a") as the
number of octets that will be stored, with "pack" utf-encoding the data
prior to padding, and "unpack" utf-decoding the octets after stripping
off the padding. (What to do if it is necessary to truncate the encoded
octets needs thought. Truncate at the previous character boundary and
Although it is now irrelevant, I'm having trouble thinking of where the
current behavior is useful. Why would one want to pad to a specified
character (not octet) length? -- jpl