develooper Front page | perl.perl5.porters | Postings from September 2021

Re: Robin Hood Hashing for the perl core

Thread Previous | Thread Next
From:
demerphq
Date:
September 23, 2021 13:09
Subject:
Re: Robin Hood Hashing for the perl core
Message ID:
CANgJU+VfKL9-=fYZBPs2xJVZXM_NaoNyrA6rUVXqBUVNtrqXBA@mail.gmail.com
On Fri, 10 Sept 2021 at 13:48, Nicholas Clark <nick@ccl4.org> wrote:

> 2) security
>

>    the way we split the linked list hash table is a function of 1 bit of
> the
>    hash value. If we didn't put effort into perturbing the iteration order
> to
>    conceal the bucket order, an attacker who can control the hash table's
>    keys can infer that single bit, work backwards from there to expose the
>    internal random state, and then with that pre-compute keys that collide
>
  ...


>    hence, is the approach of salting the insertion order no less secure
>    than what we have currently?
>

When bucket perturbing was introduced we were using a weak hash function.
I added the perturbing to make attacks on the seed harder.  Now that we use
Siphash I think bucket perturbing can be ditched as an
unnecessary complication.


> But, this *might* be worth doing for us, if there are lots of hashes with
> 1 to 4 keys. (version objects seem to be 4 keys, MRO lookup tables are 1
> key, at worst 2) - "official" size 4 buckets with a load factor of 1.0,
> meaning actually allocating 7 buckets and  storing up to 4 keys.
>

Doesn't sound like a good idea to me. But you know best. :-)


>
> > I also believe that the penalty of a poor hash function is higher, but I
> > dont know that I can prove it. (Eg, a poor hash function that produces
> too
> > many collisiions causes more damage to open addressing than to
> > chain-addressed) but with SipHash as the standard hash function this
> isnt a
> > problem.
>
> I get this impression too. The folks blogging about this stress that "you
> need a good hash function" but never go into much more detail than that.
>
>
Yeah, so don't worry about it. Siphash is a good hash function.  The only
problem is that it is not 32 bit. Do we care anymore?


>
> > > I *did* need to patch Cpanel::JSON::XS and Sereal to remove
> > > -Werror-no-declaration-after-statement from their C compiler flags.
> > >
> >
> > Why? Doesn't this just make it unprotable to AIX and old MS builds?
>
> No, not AIX :-)
>
> https://www.nntp.perl.org/group/perl.perl5.porters/2021/06/msg260331.html
>
>     The PSC proposes that we no longer support VC11.0 and earlier.


> ...
>
> VC12.0 is capable of *targeting* XP and Windows Server 2003
> (build host requirement is Windows 7/Windows Server 2012)
>

We only introduced the flag so we would stop breaking our windows builds.
IF that isnt a problem anymore then yay.


>
> > 2) The type of PL_strtab
> > >
> >
> > I am not sure I get this one, can you explain more.
>
> Currently PL_strtab *is* (almost) a regular SV of type SVt_PVHV.
> It's marked as having "unshared" hash keys, but you can (in theory) do that
> to any other hash.
>

Yes, that is on my todo list to play with. For certain types of hashes IMO
the PL_strtab is a massive pessimization. We have to do a double store and
a lot of complex crap and we end up having one hidden hash that scales to
the size of the set of unique keys in all hashes, which can chew up a lot
of  memory in the background and can play havok with preload loading of
data.

Eg, consider I want to build 10 hashes with 10 million keys each, im
actually building 11 hashes, one with 100M keys, and ten with 10M. AND if I
build of them prefork, and then build a totally different hash with similar
keys, I start COW'ing the hidden hash.

I have dreamed that we would/could change how hashes work, and only use
PL_strtab when they are blessed. So when a hash starts of it would be
unshared. When it gets blessed it gets converted into a shared hash.  Eg,
maybe introduce a "no shared_keys;" pragma, or something like that.

The idea of PL_strtab as far as I know  was to dedupe keys used in objects
where there is a high level of repetition of the keys. When you are build
large hashes of arbitrary strings, or aggregations in memory with lots of
breakouts that are unlikely to overlap, or are building hashes prefork, the
value is really questionable.


> The only crazy thing about it is that the "value" is a reference count
> rather than a pointer.
>
> All the current hash code assumes a fixed HE * structure, hence a fixed
> size.
>
> Whereas the ABH internals were written to be flexible and handle hashes of
> "any size you like, as long as they start with a pointer to the key"
>
> This means that the shared string table entries can be just a pointer to
> a key - ie sizeof(void *)
>
> With the reference count moved *into* the HEK.
>
> 1) This makes some of the other code simpler - sharing a HEK is cheaper
> 2) The open addressing array wastes less space for each empty "slot"
>
>
> but this means that PL_strtab can no longer be passed into regular HV*
> functions to get stats from it.
>

That probably only breaks some (inconsequential) code I wrote. AFAIK almost
nobody knows about PL_strtab.  I wouldnt worry about it.


>
> > If speed is the objective AFAUI the underlying math for the birthday
> > paradox says what this should be, IIRC basically something in the  %60 to
> > ~%66.  After that the collision probability  starts to approach 1
> > relatively quickly.  As far as I know there is no difference on this for
> > open hashing versus chain, the underlying costs are related to the
> > probability of collision which  is purely determined by how many elements
> > are filled, if anything open addressing benefits more from a lower load
> > factor than chain addressing I would think.
>
> Interesting, 60%-66%. I'd read this mail then forgotten the details when
> exploring - of 0.5, 0.625 and 0.75, 0.625 is the fastest


> (I typed 0.675 in my other messages. That was a typo. My fault. It was
> always 0.625 - 5/8)
>
>
> I'd read that *generally* open addressing benefits from lower load factors,
> for most collision strategies, it's better to finish soon.
>
> (Of those that I can remember, linear probing has to keep going until it
> finds an empty slot, so higher load factors mean that slot will be further,
> and both insert and lookup pay the penalty of that long probe distance.
> I think that all the others are similar - you can't know that you have a
> hash miss until you reach an empty slot, or hit your max probe distance)
>
> Whereas Robin Hood hashing's probe distance invariants mean that a lookup
> can stop *before* it finds an empty slot. Folks blogging about their
> exploits in writing hash tables seemed to suggest that it could support
> a higher load factor.
>
> But 0.625 seemed good.
>

I am pretty sure that you always want a lower load factor with open
addressing than you do with chain addressing, and that for any given load
factor open addressing will be worse in terms of collisions (which is fine,
open addressing is fast because of memory localization and more compact
memory utilization) so probably you should use a lower load factor.

Consider the probability of collision when you insert a new unique key into
a hash that contains 2 keys, both of which collide. With chain addressing
the chance of collision will be 1/k as we will have 2 items in one bucket,
with open addressing it will be 2/k as we will have two buckets occupied.
We know that with only very few entries in the hash table we will start
seeing collisions,  so we can assume that most of the time a chain
addressing scheme will use less buckets than an open addressing scheme.

So I would argue you want to tune open addressing to a lower load factor
than chain addressing to get the best insert performance from it.

I am guessing that %50 for open addressing would give the same profile as
2/3rds. (I was going to say you should do the math, but then i did it
myself):

The number of used buckets in a chain addressed hash with 'd' buckets that
contains 'n' items is expected to be: d - (d * ((d-1)/d)**n). (when d=365
this is the same as saying how many distinct birthdays should  be with 'n'
people in the room). The probability of a collsion is then (d - (d *
((d-1)/d)**n))/d.

Plugging some numbers into that, it turns out that when n is about 2/3rds
of d you get a collision rate of ~50%: (Which explains why 2/3rds is
favoured).

perl -le'for my $bits (3..24) { my $d= 1<<$bits; my $n= 2*$d/3; my $dm1od=
($d-1)/$d; my $r= ($d - ($d * ($dm1od**$n))); printf "n=%11d  d=%11d
chain:%.4f\n", $n, $d, $r/$d;}'
n=          5  d=          8 chain:0.5094
n=         10  d=         16 chain:0.4976
n=         21  d=         32 chain:0.4920
n=         42  d=         64 chain:0.4893
n=         85  d=        128 chain:0.4879
n=        170  d=        256 chain:0.4873
n=        341  d=        512 chain:0.4869
n=        682  d=       1024 chain:0.4868
n=       1365  d=       2048 chain:0.4867
n=       2730  d=       4096 chain:0.4866
n=       5461  d=       8192 chain:0.4866
n=      10922  d=      16384 chain:0.4866
n=      21845  d=      32768 chain:0.4866
n=      43690  d=      65536 chain:0.4866
n=      87381  d=     131072 chain:0.4866
n=     174762  d=     262144 chain:0.4866
n=     349525  d=     524288 chain:0.4866
n=     699050  d=    1048576 chain:0.4866
n=    1398101  d=    2097152 chain:0.4866
n=    2796202  d=    4194304 chain:0.4866
n=    5592405  d=    8388608 chain:0.4866
n=   11184810  d=   16777216 chain:0.4866

Thus to have the same collisiion properties you want to set the max load
factor for the open addressed scheme to 50%. Note that the approximation of
2/3rd I chose produces a rough collision probability of 45%.

Cheers,
Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About