develooper Front page | perl.perl5.porters | Postings from September 2021

Re: Robin Hood Hashing for the perl core

Thread Previous | Thread Next
From:
Nicholas Clark
Date:
September 10, 2021 08:50
Subject:
Re: Robin Hood Hashing for the perl core
Message ID:
YTscYvKdmwcyTtJ1@etla.org
On Wed, Aug 04, 2021 at 02:45:30PM +0100, Dave Mitchell wrote:

> A good start would be:
> 
> $ ./perl -Ilib Porting/bench.pl --tests='/^expr::hash/' -j 8 -w /tmp/results -v perlbinary1 perlbinary2 ...
> 
> which runs cachegrind on various empty and full loops containing the
> snippets of code matched in t/perf/benchmarks and writes the results (with
> startup and loop overhead factored out) into the file /tmp/results.
> The -j 8 says run 8 benchmarks in parallel. The -v says show what's going on.

It takes more CPU cycles for the hash tests, and about the same everywhere
else.

The tests are all for hashes with 1 or 2 keys? If so, this isn't at all
surprising as I know that it's trading instructions for memory locality,
but also that hashes with few keys are on average slightly larger.

> :The downsides are
> :
> :1) more CPU work in maintaining that flat array in order
> 
> Can you quantify that, or describe in more detail what is involved?

quantify is hard, as every benchmark will produce something different.

But for the existing hash, insertion or deletion are operations on a linked
list, so are O(1), and are just pointer updates.

(A linked list with potentially no cache locality, so O(1) isn't sure fire
cheap)


For any Robin Hood hash, insertion or deletion can involve moving other keys
around, so I guess that's internally O(m) where `m` is how many keys you
need to move. So that's memmove of a block rather than pointer updates, *but*
there's good cache locality, so it might be close to free.

("free" for the first order approximation/assumption that cache misses
dominate CPU throughput)

> Some of my maths code involves hashes with order of 10^8 keys, typically
> "seen" hashes (that let me do X if I haven't seen this key before).
> Typically for those the only interaction with the hash is a single
> line like:
>   do_X($obj) unless $seen{$obj->serialize}++;
> 
> In particular, it sounds like ABH is optimized for cases where searches
> are much more common than inserts/deletes, so I think "seen" hashes
> in workloads where duplication is rare would be close to worst-case.

Yes. I think your reasoning is spot on.

So, after some experimenting, here is a benchmark tuned to flatter it:

#!perl -w
use strict;

my $size = shift;
$size ||= 1e5;

my %hash;

if ($size < 0) {
    $size = -$size;
    keys %hash = $size;
}

# Avoid a lot of scope entries/exits so that we're mostly timing what we care
# about

++$hash{$_ * 3}
    for 1 .. $size;

$hash{$_ * 5} || 0
    for 1 .. $size;

$hash{$_ * 5} || 0
    for 1 .. $size;

$hash{$_ * 5} || 0
    for 1 .. $size;

__END__


Running that with -1e7 (1 million keys, pre-grow hash)

blead at 75595dd289

==8755== I   refs:      36,259,338,443
==8755== I1  misses:            34,395
==8755== LLi misses:             7,111
==8755== I1  miss rate:           0.00%
==8755== LLi miss rate:           0.00%
==8755==
==8755== D   refs:      16,767,897,454  (11,109,054,413 rd   + 5,658,843,041 wr)
==8755== D1  misses:       215,912,461  (   190,745,837 rd   +    25,166,624 wr)
==8755== LLd misses:       206,452,293  (   183,566,167 rd   +    22,886,126 wr)
==8755== D1  miss rate:            1.3% (           1.7%     +           0.4%  )
==8755== LLd miss rate:            1.2% (           1.7%     +           0.4%  )
==8755==
==8755== LL refs:          215,946,856  (   190,780,232 rd   +    25,166,624 wr)
==8755== LL misses:        206,459,404  (   183,573,278 rd   +    22,886,126 wr)
==8755== LL miss rate:             0.4% (           0.4%     +           0.4%  )


ABH with a load factor of 0.675 (effectively what the branch now has as HEAD):

==10595== I   refs:      41,940,009,601
==10595== I1  misses:            33,078
==10595== LLi misses:             6,927
==10595== I1  miss rate:           0.00%
==10595== LLi miss rate:           0.00%
==10595==
==10595== D   refs:      18,636,752,812  (12,272,777,479 rd   + 6,363,975,333 wr)
==10595== D1  misses:       222,023,137  (   182,141,483 rd   +    39,881,654 wr)
==10595== LLd misses:       181,489,351  (   145,139,062 rd   +    36,350,289 wr)
==10595== D1  miss rate:            1.2% (           1.5%     +           0.6%  )
==10595== LLd miss rate:            1.0% (           1.2%     +           0.6%  )
==10595==
==10595== LL refs:          222,056,215  (   182,174,561 rd   +    39,881,654 wr)
==10595== LL misses:        181,496,278  (   145,145,989 rd   +    36,350,289 wr)
==10595== LL miss rate:             0.3% (           0.3%     +           0.6%  )


Lots more CPU instructions.
More CPU cache references, but far fewer last level read misses.

Deliberate choices that favour it here

1) pre-grow the hash
2) more reads than writes
3) lexical hash, so we're timing hash free as well


Note that there's quite a lot more data writing - that's all the memmove()


> Having read through the comments in MoarVM, I find some of it confusing,
> but it sounds like the trade-off could be tweaked by controlling the
> number of extra hash bits stored. Is that something we could consider
> making controllable on a per-hash basis?

Yes. Things that can be tuned

1) load factor
2) extra hash bits store (if any)
3) strategy for "do i need to grow" - try to avoid it by reducing the extra
   hash bits, or just bite the bullet and allocate more memory

and

1) load factor could change as a function of size
2) as could extra hash bits


in playing with that benchmark, I found that a fixed load factor of 0.675
seemed to be better than either 0.75 (what I started with) or 0.5

Also, 5 extra hash bits *seemed* to be optimal. 6, 4, and 0 all produce
more cache misses.

(0 would mean that the extra hash bits aren't used, hence the code that
supports them could be deleted. That would mean fewer CPU instructions
generally, but the testing suggested that doing this hurts the data cache
more)


Nicholas Clark

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About