develooper Front page | perl.perl5.porters | Postings from August 2021

Re: Robin Hood Hashing for the perl core

Thread Previous | Thread Next
From:
hv
Date:
August 3, 2021 16:35
Subject:
Re: Robin Hood Hashing for the perl core
Message ID:
202108031554.173FsuP11755@crypt.org
Nicholas Clark <nick@ccl4.org> wrote:
:I'd like to suggest that we switch the core's hashtable implementation from
:the current array of linked-lists (which is not friendly with CPU caches) to
:"open addressing" hashing (which is). More specifically, to "Robin Hood
:Hashing", and most specifically to the implementation that I wrote for
:MoarVM (and Perl 5) - "A Better Hash".

How separable is it, would it be possible to add the new hash algorithm
separately from the switch from linked lists? The latter sounds like it
necessarily needs a big-bang approach, but we have prior experience in
supporting alternate hash algorithms.

:The downsides are
:
:1) more CPU work in maintaining that flat array in order

Can you quantify that, or describe in more detail what is involved?
Some of my maths code involves hashes with order of 10^8 keys, typically
"seen" hashes (that let me do X if I haven't seen this key before).
Typically for those the only interaction with the hash is a single
line like:
  do_X($obj) unless $seen{$obj->serialize}++;

In particular, it sounds like ABH is optimized for cases where searches
are much more common than inserts/deletes, so I think "seen" hashes
in workloads where duplication is rare would be close to worst-case.

Having read through the comments in MoarVM, I find some of it confusing,
but it sounds like the trade-off could be tweaked by controlling the
number of extra hash bits stored. Is that something we could consider
making controllable on a per-hash basis?

:I think we can patch them all. Several are just truth tests - I propose
:a new macro HvIS_EMPTY(hv) to replace them, which we also add to ppport.h

I agree with leonerd that we should do this one right away.

:Others just iterate over all the hash buckets - I propose we add
:hv_foreach() to core and ppport.h to handle these cleanly.

Does that cover all remaining uses of HvARRAY that you found? Ie could
we provide HvIS_EMPTY and hv_foreach and then deprecate HvARRAY?

I'm nervous that there are enough uses of the various affected structures
on CPAN that it seems likely there'll be yet more in the DarkPAN; I think
we should aim to mitigate as much as possible in advance with a combination
of new interfaces and deprecations.

Additionally there is a fair amount of discussion in the notes about
randomization to avoid algorithmic complexity attacks, but I see none
about the converse - turning off the randomization to allow reproducible
behaviour, eg for debugging. I think that's also important, and also
merits some commentary.

Hugo

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About