develooper Front page | perl.perl5.porters | Postings from April 2010

A new attempt/idea to lower perl memory requirements (significantly?)

Thread Next
From:
Risanecek
Date:
April 13, 2010 08:32
Subject:
A new attempt/idea to lower perl memory requirements (significantly?)
Message ID:
z2ndda06d3e1004130832x94e49f90sb7d76686317b007b@mail.gmail.com
Hi everybody,

perl memory requirements are extraordinary. I use that euphemism,
because using words like "insane" could stop important people from
reading any further. ;-)

perldebguts has a nice characterization:

"Perl is a profligate wastrel when it comes to memory use. There is a
saying that to estimate memory usage of Perl, assume a reasonable
algorithm for memory allocation, multiply that estimate by 10, and
while you still may miss the mark, at least you won't be quite so
astonished."

And the factor 10 is probably true for 32bit architectures only:

"Assume that an integer cannot take less than 20 bytes of memory, a
float cannot take less than 24 bytes, a string cannot take less than
32 bytes (all these examples assume 32-bit architectures, the result
are quite a bit worse on 64-bit architectures)."

...

Let me describe my situation:

We use Perl in our company for a core product, which is a natural
language processing/understanding software suite. There are multiple
and huge lexica that have to be loaded in memory. Tie to a DB is not
an option because of the latency penalty (yes - even on SSD RAID
Arrays) this involves. We have been battling with perl memory
requirements since 2002 and fortunately both our programming efforts
as well as increasing capacity of the available hardware have allowed
us to cope with them.
Unfortunately, during the last few months, our data mining/acquisition
team made perfect progress - and because of this great progress we got
burned elsewhere: Few months ago, we used to run the software suite on
32bit VMs that were running on "big iron". These 32bit VMs were
running on their memory limit somewhere 3,8Gigabyte per VM.
Unfortunately the lexica have grown - and new languages were added -
so that there is no way to keep them on one 32bit machine anymore.

Of course the suite runs on 64bit too, and migration is not a problem,
but then it takes nearly double the space it took on the 32bit VMs.
Sure, we can solve that problem, because getting a 64bit machine with
96 or even 192GB RAM is not prohibitive anymore. For R&D purposes even
512GB RAM iron is acceptable, but these things get prohibitive when
postulated as installation requirements at the customers site.

My time with the C-programming language on the PC architecture is a
thing of the past and I do not know the perl implementation in such a
detail that I could immediately suggest to "turn that knob and push
that button", but I've been going pregnant with some ideas how to
lower perl memory requirements, for a long time now. And I would like
to present them here:

Our data are organized in a structure that could be described as a
hash(1) of hashes(2) of lists of hashes of ... The biggest hashes are
the hashes(2) (Devel::Size says something about 700MB per hash on
32bit), because they contain the lexica for a given language. Imagine
some 50 to 500 thousand key/value pairs where the value itself is a
nontrivial data structure.
Despite this size, no single hash(2) is bigger than 4GB on a 32bit
architecture for the foreseeable future(TM). The "problem" is, we have
started to support the full iso639-3 language set and have now more
than 500 languages. Ok, most lexica are pretty tiny, but getting all
of them on a single 64bit machine would need more than 64GB RAM.

Now there might be no such thing as free lunch, but we could very well
live with some constraints on these hashes if they would promise a
lower memory footprint:

Idea 1: Wouldn't it be possible to mark a hash(2) upon creation with
attributes that would assert certain limits and thus allow the
interpreter to optimize the internal data structure? Using attributes
for example; More specifically: If I was to write something like:

my %hash :32bit = ();

then this could be interpreted as a promise, that the hash will never
grow over a 32bit limit, and the interpreter - albeit on a 64bit
architecture - would use just 32bit pointers for this hash.  I have
come to this idea from way back experience when programming
microcontrollers in assembler. Eons ago, motorola 68HC11 assembler
would automatically use 8bit (relative) pointers if the target of a
reference was in a range of +/- 127 bytes - or so. Having 256bytes
EEPROM was common these days. ;-)


Idea 2: Although the values of hash(2) are quite complex, in a
computational sense they are quite tame: they are acyclic. Could the
assertion, that a data structure (hash, list) does not contain any
cyclic references be also used to lower memory footprint?

my %hash :32bit :nocycle = ();

Idea 3: In a similar fashion, there are more restrictions that apply
on our data:
   * "atoms" stored deep down below are always strings. They will
never ever ever ever... be treated as integers or floats
   * These "atoms" (let's call them leafs) will also never ever ...
contain references to any data tructures, because the cross references
are symbolic anyway.
   * If I understand
http://cpansearch.perl.org/src/RURBAN/illguts-0.21/index.html
correctly, the refcount is 32bit independent of the architecture?
     I'm pretty sure there is no single data in our software that
would need a refcount bigger than 65536, maybe even with 8bit
refcounts all would still work.

You get the idea... :-) It's a trade-off between restrictions on perl
data structure features and their memory requirements.

Implementing things like these would have a very good chance to get
supported (financially as well as periodical "good luck" emails) by
our company.

Please comment. Feasibility? 5.14 anyone?


Richard

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About