Front page | perl.perl5.porters |
Postings from January 2012
Re: pack and ASCII
January 12, 2012 05:58
Re: pack and ASCII
Message ID: 4F0EE708.email@example.com
I realize I have been less than crystal clear about "my problem", and
that has led to confusion. So I'll try to clarify.
1) My concern is about the properties of pack and unpack, less so with
interfaces to C extensions (over which I have greater control).
2) I have many applications that write records of fixed length (measured
in octets). Files of such records can easily be searched with binary
search, and it is trivial to read the Nth record. If this is a fringe
requirement, there's not a lot left to say. But I suspect I am not
alone in finding this a convenient format.
3) It is often useful to process these files with familiar unix tools.
For this reason, "A" format is preferable to "a" format, because tools
are almost always capable of dealing with blanks, but may choke on
"\0"s. To the extent that I want to use unix tools, I'd also like to
avoid octets like "\n", which have special meaning to many unix tools.
4) If the text associated with "A" (or "a") is entirely ASCII (or 8 bits
with the high bit on), then I find no "surprises". However, unicode
text need not have that property, and I then find the following "surprises":
5) A number following the "A" (or "a") need not be the number of octets
in what pack produces, or unpack expects.
This makes 2) pretty much undoable (using just pack/unpack).
6) The C<<$reclen = length(pack($format))>> metaphor is just a lower
limit on record lengths.
7) C<<print $fh $pack-output>> may grouse about wide characters (I
regard this as a feature, but it can nevertheless be a surprise).
I am *not* proposing that the behavior of "A" be changed. Too much code
would break. However, the list of "surprises" that might happen when
ASCII text is replaced with general unicode text should be mentioned.
And, to the extent that 2) should be easy to do, it would be nice to
have a modifier or new keyletter where 5) and 6) behave as they do for
ASCII, treating the count as the number of octets produced. If the
result were utf8, it shouldn't be too hard to have pack and unpack deal
with arbitrary unicode, and utf8 avoids most of the troublesome octets
alluded to in 3) (not a big surprise, since it was designed by Ken
Thompson, a fan of unix tools). print should never grouse, although
pack and unpack might well warning about truncations, particularly those
that truncate a multi-octet utf8 character.
Apologies for all the confusion and wasted bandwidth. -- jpl