develooper Front page | perl.perl5.porters | Postings from January 2012

Re: pack and ASCII

Thread Previous | Thread Next
January 12, 2012 05:58
Re: pack and ASCII
Message ID:
I realize I have been less than crystal clear about "my problem", and 
that has led to confusion.  So I'll try to clarify.

1) My concern is about the properties of pack and unpack, less so with 
interfaces to C extensions (over which I have greater control).

2) I have many applications that write records of fixed length (measured 
in octets).  Files of such records can easily be searched with binary 
search, and it is trivial to read the Nth record.  If this is a fringe 
requirement, there's not a lot left to say.  But I suspect I am not 
alone in finding this a convenient format.

3) It is often useful to process these files with familiar unix tools.  
For this reason, "A" format is preferable to "a" format, because tools 
are almost always capable of dealing with blanks, but may choke on 
"\0"s.  To the extent that I want to use unix tools, I'd also like to 
avoid octets like "\n", which have special meaning to many unix tools.

4) If the text associated with "A" (or "a") is entirely ASCII (or 8 bits 
with the high bit on), then I find no "surprises".  However, unicode 
text need not have that property, and I then find the following "surprises":

5) A number following the "A" (or "a") need not be the number of octets 
in what pack produces, or unpack expects.
This makes 2) pretty much undoable (using just pack/unpack).

6) The C<<$reclen = length(pack($format))>> metaphor is just a lower 
limit on record lengths.

7) C<<print $fh $pack-output>> may grouse about wide characters (I 
regard this as a feature, but it can nevertheless be a surprise).

I am *not* proposing that the behavior of "A" be changed.  Too much code 
would break.  However, the list of "surprises" that might happen when 
ASCII text is replaced with general unicode text should be mentioned.  
And, to the extent that 2) should be easy to do, it would be nice to 
have a modifier or new keyletter where 5) and 6) behave as they do for 
ASCII, treating the count as the number of octets produced.  If the 
result were utf8, it shouldn't be too hard to have pack and unpack deal 
with arbitrary unicode, and utf8 avoids most of the troublesome octets 
alluded to in 3) (not a big surprise, since it was designed by Ken 
Thompson, a fan of unix tools).  print should never grouse, although 
pack and unpack might well warning about truncations, particularly those 
that truncate a multi-octet utf8 character.

Apologies for all the confusion and wasted bandwidth.  -- jpl

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About