develooper Front page | perl.perl5.porters | Postings from March 2003

[PATCH] Tie-File

Thread Next
From:
Tels
Date:
March 1, 2003 11:21
Subject:
[PATCH] Tie-File
Message ID:
perl.perl5.porters-72604@nntp.perl.org
-----BEGIN PGP SIGNED MESSAGE-----

Moin,

I did some weeks ago some of work on Tie-File, but apart from
some initial response from mjd, I never got an answer to my sent-in patches.
I'll try it here again, since I like to see these fixes applied. And
probably somebody else has comments and/or/xor finds this usefull.

Tie-File maps records (which by default are lines) to a Perl array, which
makes it much easier to access record/line-based files. It's manpage states
that:

>The file is I<not> loaded into memory, so this will work even for
>gigantic files.

This does, in praxis, however, work only for some definitions of "gigantic".
The main reasons are speed and memory usage, and in both cases basically
Perl get's in the way of the good intentions.

Here is way: If you have very little, but long records, it works quite well.
However, one common case are textfiles, and the records are mostly shorter
than 80 bytes (chars). In my case, I had an even more degenerative case: I
wanted to access dictionary files, and typically they have _lots_ of
records, but each of them is very small, like 8 byes on average. In these
cases Tie::File is a bit sub-optimal.

Since Tie::File has a heavy per-record overhead on speed and memory usage,
it was not very usefull for even a 10 Megabyte file for me. And 10 Mb
doesn't count as gigantic in my book :)

Here are the basic defects and how my patch (tries to) fix(es) them:
 
* Speed. Whenever you access the size of the array, Tie::File goes trough
  the entire file and counts all records. It does so, however, in a very
  inefficient, iterative manner, which lot's of bells, whistles,
  side-orders and sub-routine calls.
  My patch fixes that to be a tight loop. In will not show any speed
  advantage if you have, for instance, only 2 records, each one 2
  Mb long, but if you have more than 1000 records, you _will_ see speedups.
  For a 1.3 million record file the time for FETCHSIZE drops from 1m20s to
  under 8 s on a 800 Mhz PIII mobile.
* Speed2: I also streamlined some of the hot code, to give speedups when
  accessing records, filling the cache etc. Although the adddemo.pl scrip
  from mjd tests mainly write-access to the file, it shows some speedups
  after this patch, too.
* Memory: Each record has an offset, which is stored in memory. For some
  reason, tell() seems to return a string value (or float?). Wrapping it in
  int() brings down the memory overhead from 25 to 21 bytes per record on
  my 32 bit Perl. Which means "gigantic" got a bit more gigantic, since
  you can now tie a bigger file without running out of memory so fast.
* Memory 2: Each cache entry carries a hefty 310 byte overhead. I currently
  couldn't change this, but this needs at least documenting. The reason why
  this is problematic, is that the cache only tracks the size of the
  entries in it, but not the overhead per entry nor the number of entries.
  A lot of small records will fit into the default 2 MB cache causing you
  to exhauxt the system memory with the overhead alone. (In my case, the 2
  MB cache carried nearly 90 MB of overhead with it...and that's for a
  mere 10 MB dictionary file!)
* small stuff:
  * a routine to access the offset of a record. (Tie-File know this, so it
    could also tell us...) - This will be very usefull for one applicatipn
    of mine. This still needs doc and tests - however I never got an answer
    on whether to include it or not.
  * INSTALLDIRS => perl
  * some small cleanups and doc stuff
 
For the memory issue we could use pluggable modules, written in XS,
that can use a much more tight memory allocation. I would be willing to
work on this, e.g. at least write one for the record offsets (4 byte (or
maybe 8 on 64 bit) overhead per record should be enough, instead of 21) and
make the changes to Tie::File to allow it to use a third-party module for
the offset storage and/or the cache entries.

I didn't add testcases, mainly because:

* The testsuite is already very extensive and a lot of thought seems to
  have gone into it - mjd knows way more about the Tie-File innards than
  me.
* none of my changes triggered a testcase failure (which means it either
  works, or is undertested, but see above)
  * execpt the inlining all of the stuff into FETCHSIZE, and this passes now
    all tests, so it probably does the Right Thing[tm](r) now.
    The reason that I did this inlining becomes clear when you read the
    comment:

+  # _read_record fixes the last record, so we read all records manually
+  # and then fix the last one, too. The code below does exactly the same as
+  # the next three lines, except that it is about 5 times faster:
+  #while ( defined $self->_read_record()) {
+  #  # int() saves us memory here
+  #  push @OFF, int(tell $fh);
+  #}

    Which is then followed by the fast, but big and ugly version #:~)=

There is at least one place, where we could improve on some (corner?) cases:

   return 1 if exists $self->{deferred}{$n};
- -  $self->_fill_offsets_to($n);  # I think this is unnecessary
+  # $self->_fill_offsets_to($n);  # I think this is unnecessary
+  # Tels 2003-02-07: Yes, it is, FETCHSIZE will fill all the offsets
   $n < $self->FETCHSIZE;

Imagine you have read the first 10 records, and then want to know if the
20th record exists. The old code would first fill the offsets until the
20th, then fill _all_ the remaining offsets (indirectly via FETCHSIZE). The
new code skips the fill_offsets_to(), which is slightly better. However, if
your file has some gazzillion records, FETCHSIZE uses a lot of time and
memory. It might be better to use fill_offsets_to($n) and then determine
the return value of the exists without using FETCHSIZE. And also we could
return true when we already have the offset for the $n'th record, without
fill_offsets_to($n) or FETCHSIZE, which would also make it faster in quite
some cases. However, I am not sure if this is really important.

Anyway, attached is a gz file containing:

* Tie-File-0.94.tar.gz - in all it's glory
* patch to bring 0.93 to 0.94 - adapting this for the core is left as an
  excercise for the reader :)
* Some profilings and benchmark outputs

Please have fun,

Tels




- -- 
 http://www.notcpa.org/ You still can run any code on your CPU. How long?
 Signed on Sat Mar  1 20:20:01 2003 with http://bloodgate.com/tels.asc

 perl -MDev::Bollocks -le'print Dev::Bollocks->rand()'
 authoritatively facilitate revolutionary e-business

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: When cryptography is outlawed, bayl bhgynjf jvyy unir cevinpl.

iQEVAwUBPmEH6ncLPEOTuEwVAQHTaAf/TGkW7clfxAftjfTq4kQks8yaukplOTvW
33Rf58UurJLx1VODdbcrbWRgFIxixEedimIlxartWTx/1LQKDb3H7u6FA6EyaxOb
tfyVgTxiq3xQ5eC0VCZGrh2p/OTR+Yu9F4vG0CT3LiZxehK4KxPvcsf2y5wwe7A6
wgutGlmy+89r+9cXJlWFlLlZIyQxjhAvZV2tiIyXTyvwDu5VN4sxlpx+K7CIJ2Ds
66jBg4irWsexF9UZZj0kkygMPuksiT0qduJkDAyVGhvKVoJpEyWJrl+dYHbqw89j
w+B5ErsMWaBuRfCVY3HD7U3wnx6jMHzMlrvqi3UzvyPzaqFK5d/8vw==
=VUx8
-----END PGP SIGNATURE-----

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About