develooper Front page | perl.perl5.porters | Postings from August 2021

Re: Robin Hood Hashing for the perl core

Thread Previous | Thread Next
From:
demerphq
Date:
August 6, 2021 09:27
Subject:
Re: Robin Hood Hashing for the perl core
Message ID:
CANgJU+URvqaoj7HhPDsGR8e_a8Dr8RgAN=kn5sW+Z_wGeVxb6g@mail.gmail.com
On Tue, 3 Aug 2021 at 16:28, Nicholas Clark <nick@ccl4.org> wrote:

> Sorry that this message is about 4 months late for April Fool's Day.
>
> So, this doesn't really fit the "elevator pitch" shape because I don't
> think
> that proposal would really fit the RFC shape either.
>
>
> I'd like to suggest that we switch the core's hashtable implementation from
> the current array of linked-lists (which is not friendly with CPU caches)
> to
> "open addressing" hashing (which is). More specifically, to "Robin Hood
> Hashing", and most specifically to the implementation that I wrote for
> MoarVM (and Perl 5) - "A Better Hash".
>

Wow. I had looked into this once, and got discouraged.

I am super happy to hear that you did this and I am in favour.

NIcholas if there is anything I can do to help please reach out to me at my
work email.


> Why...
>
> For the simple explanation, you need to know about "load factor".
>

When you say this I assume you mean "to understand why open hashing is
preferable you need to know about load factor". That stray pesky comma...
:-)


>
> The top level of all hashtable implementations is an array of something
> (pointers or structure) - an array that gets resized "as needed", but when
> that happens is a trade off.
>
> To make things work, you can't use all the entries in this array. Some have
> to be empty. The proportion that are full is your load factor. It can't go
> above 100%. If you load factor gets too high, you resize the hash.
> (Typically you double its size. As there are the same number of keys, your
> load factor halves)
>
> Assume a typical load factor of 50%. This means that half your top level
> array elements are empty. Imagine that each used element is next to an
> empty
> element - it makes these diagrams easier:
>

Just for the record: in modern perls we hash split on collision only and
when the load factor is about 66%, prior to that change we used to split
when the load factor reached 100%. (100% is considered very high.)


>
> The current hash implementation is an array of pointers to linked lists of
> HE structs, which in turn point to the key and value:
>
>   |      |
>   +------+    +------+      +------+
>   | HE * | -> | Key  | ---> | HEK  |
>   +------+    +-    -+      +------+
>   | NULL |    | Val  | -+
>   +------+    +-    -+  |   +------+
>   |      |    | NULL |  +-> | SV   |
>               +------+      +------+
>
>                 ^^^^ points to next entry in the list if needed.
>
> The proposed implementation "open addressing" eliminates the linked lists
> by
> storing HE structs in a flat array:
>
>   |      |
>   +------+      +------+
>   | Key  | ---> | HEK  |
>   +-    -+      +------+
>   | Val  | -+
>   +------+  |   +------+
>   | NULL |  +-> | SV   |
>   +-    -+      +------+
>   | NULL |
>   +------+
>   |      |
>
> This
>
> 1) reduces the overhead by (on average) 1 pointer per hash entry,
> 2) eliminates one pointer dereference (and potential CPU cache miss)
> 3) keeps all the hash close in memory (likely more CPU cache hits)
>
> The downsides are
>
> 1) more CPU work in maintaining that flat array in order
> 2) slightly more memory allocation for hashes under about 50 entries
>    (but usually that memory is allocated but not needed)
>

One additional downside of open hashing is that it does not allow
constructing a hash that can have more than 100% load factor. i dont think
we care, but its an important point.  With chained hashing you can have a
load factor higher than 100% if you want, and i believe that is what
happens when the existing bucket array gets oversized. I dont think it
matters, in practice such a hash would be insanely large in memory.

I also believe that the penalty of a poor hash function is higher, but I
dont know that I can prove it. (Eg, a poor hash function that produces too
many collisiions causes more damage to open addressing than to
chain-addressed) but with SipHash as the standard hash function this isnt a
problem.

I propose that we use the hash that MoarVM switched to last year.
> It's my independent C implementation of the design of Martin Ankerl's hash
> at https://github.com/martinus/robin-hood-hashing
>
> I named mine "A Better Hash"
>
> There are a lot of comments explaining how it works at
> https://github.com/MoarVM/MoarVM/blob/master/src/core/str_hash_table.h
>
> The MoarVM version has 64 bit hash values, but 32 bit sizes for hashtables
> (so at most 4G entries)
>
> Perl 5 currently has 32 bit, 32 bit, and a 31 bit length limit for keys.
>
> I propose that while we're disrupting everything else, we switch to U64
> hashes, SSize_t entries and SSize_t keys.
>
> Because this all sounds crazy and impossible and you'd never believe me, I
> have a working proof of concept at
>
> https://github.com/nwc10/perl5/tree/smoke-me/nicholas/ABH-POC
>
> It builds DBI, Moose, Moo, JSON::XS, YAML::XS just fine.
>
> As in, all dependencies, all tests pass, vanilla, no changes.
>
> (it's definitely a Proof Of Concept. The code is a mess of 3 styles.
> Some dump.c functionality and a few regression tests are disabled.
> keys %hash = $value; is not implemented yet. ithreads cloning doesn't
> clone the hash iterator. The commit messages could be better.
> Apart from that it works. At least on Linux, AIX and HP/UX)
>
> Internally the core is ABH with 64 bit hash values and 64 bit sizes in the
> C
> APIs, but it presents a legacy 32 bit interface to XS, and XS works.
>
> It seems that you only need 32/64 bit compatibility wrappers around 9 core
> API functions.
>
> I *did* need to patch Cpanel::JSON::XS and Sereal to remove
> -Werror-no-declaration-after-statement from their C compiler flags.
>

Why? Doesn't this just make it unprotable to AIX and old MS builds?


> Until "this week" that flag was an excellent idea, to avoid accidentally
> writing non-C89 code. However, I want to use C99 in my inline wrappers, and
> hence I've just discovered one of our future C99 pain points....
>
>
> Sereal needed t/020_sort_keys.t temporarily disabled, and these changes:
>

FWIW, I would be happy to take patches to Sereal to ensure that it compiles
properly with the old and new hash. And I would love to try it out and help
to the extent I can.


>
> +++ srl_encoder.c       2021-08-02 21:39:24.461838518 +0000
> @@ -1203,6 +1203,23 @@
>  {
>      HE *he;
>      const int do_share_keys = HvSHAREKEYS((SV *)src);
> +
> +#ifdef HvABH
> +    Perl_ABH_Table *table= HvABH(src);
> +    Perl_ABH_Iterator iterator= Perl_ABH_first(table);
> +    while (!Perl_ABH_at_end(table, iterator)) {
> +        HE *he= (HE *) Perl_ABH_current(table, iterator);
> +        SV *v= HeVAL(he);
> +        if (v != &PL_sv_placeholder) {
> +            srl_dump_hk(aTHX_ enc, he, do_share_keys);
> +            CALL_SRL_DUMP_SV(enc, v);
> +            if (--n == 0) {
> +                break;
> +            }
> +        }
> +        iterator= Perl_ABH_next(table, iterator);
> +    }
> +#else
>      HE **he_ptr= HvARRAY(src);
>      HE **he_end= he_ptr + HvMAX(src) + 1;
>
> @@ -1219,6 +1236,7 @@
>              }
>          }
>      } while ( he_ptr < he_end );
> +#endif
>  }
>
>  SRL_STATIC_INLINE void
> @@ -1508,7 +1526,12 @@
>          if ( share_keys && SRL_ENC_HAVE_OPTION(enc,
> SRL_F_SHARED_HASHKEYS) /* only enter branch if shared hk's enabled */
>  #if PERL_VERSION >= 10
>               && (!DO_SHARED_HASH_ENTRY_REFCOUNT_CHECK
> -                || src->he_valu.hent_refcount > 1)
> +#ifdef HvABH
> +                || src->hent_hek->hek_refcount > 1
> +#else
> +                || src->he_valu.hent_refcount > 1
> +#endif
> +                 )
>  #endif
>              )
>          {
>
>
>
> I think that we can fake everything in the existing (macro soup) API
> *except*
>
> 1) HvARRAY



2) The type of PL_strtab
>

I am not sure I get this one, can you explain more.


>
>
> HvARRAY can't be replaced or emulated. I think that 14 distros use it:
>
> https://metacpan.org/release/Devel-Arena
> https://metacpan.org/release/Devel-MAT-Dumper
> https://metacpan.org/release/Devel-SizeMe
> https://metacpan.org/release/Glib
> https://metacpan.org/release/Hash-Spy
> https://metacpan.org/release/Internals-DumpArenas
> https://metacpan.org/release/JSPL
> https://metacpan.org/release/JavaBin
> https://metacpan.org/release/Sereal-Encoder
> https://metacpan.org/release/Sub-Disable
> https://metacpan.org/release/autovivification
> https://metacpan.org/release/mod_perl
> https://metacpan.org/release/namespace-clean-xs
> https://metacpan.org/release/perlec
>
> grep.metacpan.org can't filter out results from ppport.h (or embed.fnc)
> so I might have missed some
> (this is a feature of grep.cpan.me that I miss)
>
> I think we can patch them all. Several are just truth tests - I propose
> a new macro HvIS_EMPTY(hv) to replace them, which we also add to ppport.h
>

Make sense.


> Others just iterate over all the hash buckets - I propose we add
> hv_foreach() to core and ppport.h to handle these cleanly.
>

Ok, the above comment about HvARRY is explained now I guess.


>
> PL_strtab is mentioned in
>
> https://metacpan.org/release/Devel-Arena
> https://metacpan.org/release/Devel-GC-Helper
> https://metacpan.org/release/Devel-MAT-Dumper
>
> that I can find, and as it turns out at least in xsh/threads.h, which a
> routine search can't find.
>

Its moral existance is pretty important to Sereal, even if we dont mention
it directly.  I still dont get the issue with it.


>
> The "XS exposure" to these two is much smaller than you might imagine.
> As long as the iterator APIs keep working and returning HE*s and HEK*s,
> most code is happy.
>
>
> Why am I suggesting "A Better Hash" and not $other...
>
> 1) C implementation exists
> 2) It's portable
> 3) It's suitably licensed
> 4) It supports hash randomisation
> 5) You can delete during hash iteration without having "problems"
>

It can? Interesting. I need to read up on that, I didnt think that was true.


>
>
> Before suggesting other hashes, please could you check that they meet all 5
> of the above criteria. I'm suggesting here something that is known to work
> and can be polished and delivered within a month or two part time. It
> builds
> on approximately every CPU and OS known to humanity (including sparc64 and
> ia64, so any endianness, and the alignment is 64 bit clean. *Not* tested on
> VMS, zOS, the CPU that zOS runs on, hppa or RiscV. Yet.)
>
> We can in turn replace this with something even better once code for that
> exists. But until then, don't let the perfect be the enemy of the good.
>

Agreed. An open hashing algorithm that is plumbed into Perl already is
worth 5 that arent.


>
> Randomisation and Iteration
>
> Randomisation is implemented by each hash having a private random 64 bit
> "salt", and getting a new salt for each hashtable size doubling. Instead of
> perturbing the iteration order, the insert position is perturbed.
>

If you happened to have a link to this in github id be grateful.


>
> For the current Linked List Hashes, the exploitable weakness was that on
> hash "grow" linked lists chains are *split* (not completely re-ordered),
> meaning that exactly 1 bit of the *key*'s hash value determines the split,
> and hence *that* bit is exposed by the iteration order. Without all the
> current mitigations to perturb iteration order it is demonstrably possible
> to infer that value of the bit in enough keys to work out the interpreter's
> global hash seed.
>
> For ABH, all 64 bits of salt (and all 32 or 64 bits of the key's hash
> value)
> are mixed to get the insert position. On hash resize a completely new salt
> is used, and hence totally new order is generated. I believe this means
> that
> anyone attempting to attack needs to figure out 96 (ideally 128) bits of
> hidden state simultaneously, not just the 1 bit for the Linked List splits.
> Hence I believe that it's secure. I'd appreciate someone competent proving
> me wrong on this.
>

Sounds good. Also since the old one was only "a bit more secure than doing
nothing" I dont think you need to hit the target of "demostrably secure"
you just need to be more secure than what I did, and I think that is pretty
easy to do. If you can provide a code link to the relevant logic it would
be helpful.


>
> This means that internally hash iterators are just an integer array index.
> This avoids all sorts of complexity with dangling pointers, allocation,
> etc.
>
> Iterators actually start at the highest index and count down to 0. This has
> three benefits
>
> 0) The "end of hash" iterator value is the same for hashes of any size
> 1) It's cheap to test a CPU register against a constant.
> 2) It's the reverse direction from internal movements caused by insertion
>    and deletion
>
> That last one is key - it means that it's possible to delete the hash entry
> at the current iterator position without perturbing the iteration.
>
> Perl documents this as supported behaviour. We need to support it.
>

Yep. If we can insert safely during implementation then that would be very
cool.


>
>
> Performance
>
> "The marvel is not that the bear dances well, but that the bear dances at
> all."
>
>
> It's actually a bit better than that. It's not noticeably slower on
> anything. I'm not sure what makes a good benchmark. Code I've tried isn't
> bottlenecked on hashing. I've tried using cachegrind - we're using more CPU
> in the hash code, and more data references, but fewer last level misses.
>
> But some of this could be down to tuning - open addressing hashes have do
> more work for insert and delete moving entries around to maintain their
> invariants. Robin Hood invariants are slightly more work on insert and
> delete, but intended to reduce work on lookups. And the approach here of
> storing more hash bits in the metadata means more ordering constraints (so
> move moving), but fewer lookups outside of the metadata bytes on reading.
>
> As I don't know what to benchmark, I've not tried tuning anything.
> (To be fair, I'm exhausted after the big push to get this working at all.
> It's why I've been quiet for a few weeks)
>

Test it with a broken hash function, or a precomputed set of keys that
collide in the right way.  If it doesnt degrade terribly IMO you are good.


>
> The load factor might be too high. There's actually quite a lot that can
> be tuned:
>
> * load factor (even make it dynamic as a function of hash size)
>

If speed is the objective AFAUI the underlying math for the birthday
paradox says what this should be, IIRC basically something in the  %60 to
~%66.  After that the collision probability  starts to approach 1
relatively quickly.  As far as I know there is no difference on this for
open hashing versus chain, the underlying costs are related to the
probability of collision which  is purely determined by how many elements
are filled, if anything open addressing benefits more from a lower load
factor than chain addressing I would think.

* thresholds for resizing
> * thresholds for how many hash bits in the metadata
> * smaller sizes (official "4", "2" or even "1", and with a load factor of
> 1.0)
>
> but without meaningful benchmarks (which I don't have) it's not going to
> answer any questions.
>

I need to read up on the implementation to say anything about that part.
Looking forward to it. :-)


So for now I need to take a few days break from this, but everyone else is
> welcome to play with the code, and figure out how to benchmark it.
> (And break the randomisation)
>

This is super awesome, I will try to find some time to look into more
generally and I will definitely find some time to look at specific parts,
and if you ask me to do something to help ill put it on my priority list.
Thanks a lot, this is one of the more exciting changes for some time from
my POV.

Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About