develooper Front page | perl.perl5.porters | Postings from October 2021

Re: non-shared hash keys and large hashes (was Re: Robin HoodHashing for the perl core)

Thread Previous | Thread Next
From:
demerphq
Date:
October 19, 2021 13:18
Subject:
Re: non-shared hash keys and large hashes (was Re: Robin HoodHashing for the perl core)
Message ID:
CANgJU+Xpfd=gB7VZAsxL2ANhc_bc_ZjVNoCZLpOeqB1cpO44Yg@mail.gmail.com
On Tue, 19 Oct 2021 at 14:28, Nicholas Clark <nick@ccl4.org> wrote:

> On Thu, Sep 23, 2021 at 03:07:14PM +0200, demerphq wrote:
> > On Fri, 10 Sept 2021 at 13:48, Nicholas Clark <nick@ccl4.org> wrote:
>
> > > But, this *might* be worth doing for us, if there are lots of hashes
> with
> > > 1 to 4 keys. (version objects seem to be 4 keys, MRO lookup tables are
> 1
> > > key, at worst 2) - "official" size 4 buckets with a load factor of 1.0,
> > > meaning actually allocating 7 buckets and  storing up to 4 keys.
> > >
> >
> > Doesn't sound like a good idea to me. But you know best. :-)
>
> No, I don't know best. In that, for some things I have a good feel for how
> far the code can be bent in a new direction (without it breaking), but I
> don't have a great feel for which directions it is actually important to
> bend in. Or how to measure success.
>

I just meant that using an artifact of overallocating the bucket array
sounds fragile to me, but you know the algorithm best.


>
> > > VC12.0 is capable of *targeting* XP and Windows Server 2003
> > > (build host requirement is Windows 7/Windows Server 2012)
> > >
> >
> > We only introduced the flag so we would stop breaking our windows builds.
> > IF that isnt a problem anymore then yay.
>
> I hope the explanation in the commit make sense in
> https://github.com/Sereal/Sereal/pull/265
>
>
> Although *that* commit was the first that I wrote, and maybe the text in
> our perldelta.pod is clearer:
>
> =item *
>
> The Perl C source code now uses some C99 features, which we have verified
> are
> supported by all compilers we target. This means that Perl's headers now
> contain some code that is legal in C99 but not C89.
>
> This may cause problems for some XS modules that unconditionally add
> C<-Werror=declaration-after-statement> to their C compiler flags if
> compiling
> with gcc or clang. Earlier versions of Perl support long obsolete compilers
> that are strict in rejecting certain C99 features, particularly mixed
> declarations and code, and hence it makes sense for XS module authors to
> audit
> that their code does not violate this. However, doing this is now only
> possible on these earlier versions of Perl, hence these modules need to be
> changed to only add this flag for C<<$] < 5.035005>>.
>
> =cut
>

Ack.


>
> > Yes, that is on my todo list to play with. For certain types of hashes
> IMO
> > the PL_strtab is a massive pessimization. We have to do a double store
> and
> > a lot of complex crap and we end up having one hidden hash that scales to
> > the size of the set of unique keys in all hashes, which can chew up a lot
> > of  memory in the background and can play havok with preload loading of
> > data.
> >
> > Eg, consider I want to build 10 hashes with 10 million keys each, im
> > actually building 11 hashes, one with 100M keys, and ten with 10M. AND
> if I
> > build of them prefork, and then build a totally different hash with
> similar
> > keys, I start COW'ing the hidden hash.
> >
> > I have dreamed that we would/could change how hashes work, and only use
> > PL_strtab when they are blessed. So when a hash starts of it would be
> > unshared. When it gets blessed it gets converted into a shared hash.  Eg,
> > maybe introduce a "no shared_keys;" pragma, or something like that.
> >
> > The idea of PL_strtab as far as I know  was to dedupe keys used in
> objects
> > where there is a high level of repetition of the keys. When you are build
> > large hashes of arbitrary strings, or aggregations in memory with lots of
> > breakouts that are unlikely to overlap, or are building hashes prefork,
> the
> > value is really questionable.
>
> Sort of like this?
>
> https://github.com/nwc10/perl5/tree/larger-hashes-without-shared-keys
>
> It turns off shared keys once hashes get (roughly) larger than 42 keys.
> (Except for objects and symbol tables)
>

Oooh. Neat.


>
> As I wrote above:
>
>   No, I don't know best. In that, for some things I have a good feel for
> how
>   far the code can be bent in a new direction (without it breaking), but I
>   don't have a great feel for which directions it is actually important to
>   bend in. Or how to measure success.
>
>
> in that, it all ought to be faster for the use cases described, but I fail
> to reliably benchmark it.
>

So I would do something along the following lines:

my %hash;
my $str= "aaaaaaaaaa";
$hash{$str++}++ for 1..(1<<23);
for (1..4) {
   my $pid= fork // die "couldn't fork";
   if (!$pid) {
      my %copy;
      $copy{$_}++ for keys %hash;
  } else {
     push @pids, $pid;
  }
}
wait($_) for @pids;

You should see a very different memory effect in the children. With our
current code the entirety strtab of the mother's refcounts will be
COWedinto the child processes, with your patch I would expect them not to
be. So the net system usage should be something like 5units (1+4) instead
of 9 units (1+4*2). Whether you notice it in actual performance I imagine
would depend on how much work you have to do to produce each new key.

You could also put some diganostics in to see how big Pl_strtab gets. I
expect with your patch that it doesnt get that large. Add some diagnostics
so when perl destroys PL_strtab you find out how big its table was (not how
many items it contained, rather what was the maximum size of the table).

Also you need to test the effect of adding and remove the same keys to many
hashes so you can see the effect of having to bump the refcounts in the
PL_strtab versus not. Your benchmark only touched one hash. Try writing to
many hashes with at least some overlapping keys. Something like this:

my @hashes;
$hashes[rand 100]{int rand 1000} for 1..1_000_000;

Id expect that to show up in a benchmark.


>
>
> Taking my previous (not great) benchmark that is roughly a "seen" hash:
>
> $ cat ~/test/big-hash-r3-my.pl
> #!perl -w
> use strict;
>
> my $size = shift;
> $size ||= 1e5;
>
> my %hash;
>
> if ($size < 0) {
>     $size = -$size;
>     keys %hash = $size;
> }
>
> # Avoid a lot of scope entries/exits so that we're mostly timing what we
> care
> # about
>
> ++$hash{$_ * 3}
>     for 1 .. $size;
>
>
None of the following is helpful for this. The overheads are purely on
write.  Also you need to try modifying many different hashes with a medium
numbers of keys. Say 100 hashes with 10k each, then look at the size of
PL_strtab.


> $hash{$_ * 5} || 0
>     for 1 .. $size;
>
> $hash{$_ * 5} || 0
>     for 1 .. $size;
>
> $hash{$_ * 5} || 0
>     for 1 .. $size;
>
> __END__
>
>
> For blead, one run:
>
> $ valgrind --tool=cachegrind --branch-sim=yes
> /home/nick/Sandpit/snap-v5.35.4-150-geb058c0793/bin/perl5.35.5 ~/test/
> big-hash-r3-my.pl 1e7    ==2083422== Cachegrind, a cache and
> branch-prediction profiler
> ==2083422== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote
> et al.
> ==2083422== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright
> info
> ==2083422== Command:
> /home/nick/Sandpit/snap-v5.35.4-150-geb058c0793/bin/perl5.35.5
> /home/nick/test/big-hash-r3-my.pl 1e7
> ==2083422==
> --2083422-- warning: L3 cache found, using its data for the LL simulation.
> ==2083422== brk segment overflow in thread #1: can't grow to 0x4838000
> ==2083422== (see section Limitations in user manual)
> ==2083422== NOTE: further instances of this message will not be shown
> ==2083422==
> ==2083422== I   refs:      37,648,207,135
> ==2083422== I1  misses:            31,488
> ==2083422== LLi misses:             6,735
> ==2083422== I1  miss rate:           0.00%
> ==2083422== LLi miss rate:           0.00%
> ==2083422==
> ==2083422== D   refs:      17,888,205,650  (12,036,875,789 rd   +
> 5,851,329,861 wr)
> ==2083422== D1  misses:       245,352,635  (   220,149,381 rd   +
> 25,203,254 wr)
> ==2083422== LLd misses:       232,768,464  (   209,874,979 rd   +
> 22,893,485 wr)
> ==2083422== D1  miss rate:            1.4% (           1.8%     +
>  0.4%  )
> ==2083422== LLd miss rate:            1.3% (           1.7%     +
>  0.4%  )
> ==2083422==
> ==2083422== LL refs:          245,384,123  (   220,180,869 rd   +
> 25,203,254 wr)
> ==2083422== LL misses:        232,775,199  (   209,881,714 rd   +
> 22,893,485 wr)
> ==2083422== LL miss rate:             0.4% (           0.4%     +
>  0.4%  )
> ==2083422==
> ==2083422== Branches:       5,764,880,483  ( 5,252,453,290 cond +
>  512,427,193 ind)
> ==2083422== Mispredicts:      426,430,089  (   106,405,461 cond +
>  320,024,628 ind)
> ==2083422== Mispred rate:             7.4% (           2.0%     +
> 62.5%   )
>
>
> for that branch (almost - before rebasing):
>
> $ valgrind --tool=cachegrind --branch-sim=yes
> /home/nick/Sandpit/snap-v5.35.4-159-g82b0176e75/bin/perl5.35.5 ~/test/
> big-hash-r3-my.pl 1e7      ==2082396== Cachegrind, a cache and
> branch-prediction profiler
> ==2082396== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote
> et al.
> ==2082396== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright
> info
> ==2082396== Command:
> /home/nick/Sandpit/snap-v5.35.4-159-g82b0176e75/bin/perl5.35.5
> /home/nick/test/big-hash-r3-my.pl 1e7
> ==2082396==
> --2082396-- warning: L3 cache found, using its data for the LL simulation.
> ==2082396== brk segment overflow in thread #1: can't grow to 0x4832000
> ==2082396== (see section Limitations in user manual)
> ==2082396== NOTE: further instances of this message will not be shown
> ==2082396==
> ==2082396== I   refs:      36,103,067,458
> ==2082396== I1  misses:            31,762
> ==2082396== LLi misses:             6,739
> ==2082396== I1  miss rate:           0.00%
> ==2082396== LLi miss rate:           0.00%
> ==2082396==
> ==2082396== D   refs:      17,049,471,109  (11,534,685,391 rd   +
> 5,514,785,718 wr)
> ==2082396== D1  misses:       205,496,802  (   188,346,327 rd   +
> 17,150,475 wr)
> ==2082396== LLd misses:       194,991,458  (   179,112,354 rd   +
> 15,879,104 wr)
> ==2082396== D1  miss rate:            1.2% (           1.6%     +
>  0.3%  )
> ==2082396== LLd miss rate:            1.1% (           1.6%     +
>  0.3%  )
> ==2082396==
> ==2082396== LL refs:          205,528,564  (   188,378,089 rd   +
> 17,150,475 wr)
> ==2082396== LL misses:        194,998,197  (   179,119,093 rd   +
> 15,879,104 wr)
> ==2082396== LL miss rate:             0.4% (           0.4%     +
>  0.3%  )
> ==2082396==
> ==2082396== Branches:       5,594,219,347  ( 5,081,890,292 cond +
>  512,329,055 ind)
> ==2082396== Mispredicts:      409,023,997  (    89,004,563 cond +
>  320,019,434 ind)
> ==2082396== Mispred rate:             7.3% (           1.8%     +
> 62.5%   )
>
>
>
> so it looks sort-of better, but
>
> 1) not fantastically
> 2) I know that something (hash randomisation) moves those numbers about
>
>
> So, assuming that you have some test cases that more easily benchmark
> things
> like "what memory doesn't stay shared when you fork?" it would be useful if
> you could try benchmarking that branch (vs blead) to see if it helps the
> use cases that you (or your $work) are interested in.
>

Its not so much a  benchmarking issue, more a memory management issues,
although i bet it will show up. Ill do what i can to test your branch.


>
>
> I'm really not doing very well at benchmarking whether changes are useful.
>

If the change in effect is neutral is the underlying code change an
improvement in quality or theoretically better? If so, then its probably
good to keep.

cheers,
Yves

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About