develooper Front page | perl.perl5.porters | Postings from November 2016

Re: [perl #130084] [PATCH] Increase hash colision ratio and reducehsplit calls

Thread Previous | Thread Next
From:
demerphq
Date:
November 13, 2016 22:07
Subject:
Re: [perl #130084] [PATCH] Increase hash colision ratio and reducehsplit calls
Message ID:
CANgJU+U2Dr6kGrYVUeDUHre7_+1AJ5nUBwR9ELwdzZ2dfPnvzw@mail.gmail.com
On 13 November 2016 at 17:29, Atoomic via RT <perlbug-followup@perl.org> wrote:
> Note that increasing the collision ratio, reduce the amount of memory used for the hash array
> but also slightly decrease performance (access, realocation, ...)

Yeah, there are a number of trade offs here.

I think its helpful to be able to visualize what is going on.

We currently use a maximum load factor of 1, or put otherwise we
expect there to be on average 1 key per bucket or less, and we double
the size of the array whenever we exceed this ratio.


This is what this looks like for 1023 keys going into 1024 buckets:

$ ./perl -Ilib -MHash::Util=bucket_stats_formatted -e'my %hash;
$hash{$_}++ for 1 .. ((1<<10)-1); print
bucket_stats_formatted(\%hash);'
Keys: 1023 Buckets: 640/1024 Quality-Score: 1.02 (Good)
Utilized Buckets: 62.50% Optimal: 99.90% Keys In Collision: 37.44%
Chain Length - mean: 1.60 stddev: 0.85
Buckets           1024
[0000000000000000000000001111111111111111111111122222222222233334*]
Len   0  37.50%    384 [########################]
Len   1  36.04%    369 [#######################]
Len   2  18.55%    190 [############]
Len   3   5.66%     58 [####]
Len   4   1.66%     17 [#]
Len   5   0.39%      4 []
Len   6   0.20%      2 []
Keys              1023
[111111111111111111111111111111111111111122222222222222222333334*]
Pos   1  62.56%    640 [########################################]
Pos   2  26.49%    271 [#################]
Pos   3   7.92%     81 [#####]
Pos   4   2.25%     23 [#]
Pos   5   0.59%      6 []
Pos   6   0.20%      2 []

As we can see, we have about 40% of our buckets unused, 36% of the
used buckets have 1 items in the chain, the longest chain length is 6,
and for a read-hit/update, 62% of our data will be found without
traversing the HE chain at all, and 88% with 1 extra HE traverse.

Then we add a key, triggering a split:

$ ./perl -Ilib -MHash::Util=bucket_stats_formatted -e'my %hash;
$hash{$_}++ for 1 .. ((1<<10)-0); print
bucket_stats_formatted(\%hash);'
Keys: 1024 Buckets: 803/2048 Quality-Score: 1.01 (Good)
Utilized Buckets: 39.21% Optimal: 50.00% Keys In Collision: 21.58%
Chain Length - mean: 1.28 stddev: 0.56
Buckets           2048
[0000000000000000000000000000000000000001111111111111111111222223*]
Len   0  60.79%   1245 [#######################################]
Len   1  30.27%    620 [###################]
Len   2   7.37%    151 [#####]
Len   3   1.32%     27 [#]
Len   4   0.20%      4 []
Len   5   0.05%      1 []
Keys              1024
[111111111111111111111111111111111111111111111111112222222222233*]
Pos   1  78.42%    803 [##################################################]
Pos   2  17.87%    183 [###########]
Pos   3   3.12%     32 [##]
Pos   4   0.49%      5 []
Pos   5   0.10%      1 []

Now we end up with 60% of the hash buckets unused, and only modestly
improve the distribution of keys. For instance where previously 88% of
read-hits would occur with at most 1 HE traverse or less, we now have
95%. A 7% improvement for a large cost in unused space.

Compare with a maximum load factor of 4:

At 1023 keys, right before key split:

$ ./perl -Ilib -MHash::Util=bucket_stats_formatted -e'my %hash;
$hash{$_}++ for 1 .. ((1<<10)-1); print
bucket_stats_formatted(\%hash);'
Keys: 1023 Buckets: 252/256 Quality-Score: 1.03 (Good)
Utilized Buckets: 98.44% Optimal: 399.61% Keys In Collision: 75.37%
Chain Length - mean: 4.06 stddev: 2.10
Buckets            256
[011111122222222233333333333334444444444444555555566666666777789*]
Len   0   1.56%      4 [#]
Len   1   8.98%     23 [######]
Len   2  13.67%     35 [#########]
Len   3  20.70%     53 [#############]
Len   4  20.70%     53 [#############]
Len   5  11.33%     29 [#######]
Len   6  11.72%     30 [########]
Len   7   6.64%     17 [####]
Len   8   1.95%      5 [#]
Len   9   1.17%      3 [#]
Len  10   0.00%      0 []
Len  11   0.39%      1 []
Len  12   0.78%      2 []
Len  13   0.39%      1 []
Keys              1023
[1111111111111111222222222222223333333333334444444445555556666778*]
Pos   1  24.63%    252 [################]
Pos   2  22.39%    229 [##############]
Pos   3  18.96%    194 [############]
Pos   4  13.78%    141 [#########]
Pos   5   8.60%     88 [######]
Pos   6   5.77%     59 [####]
Pos   7   2.83%     29 [##]
Pos   8   1.17%     12 [#]
Pos   9   0.68%      7 []
Pos  10   0.39%      4 []
Pos  11   0.39%      4 []
Pos  12   0.29%      3 []
Pos  13   0.10%      1 []

Almost no wasted space, and a pretty good distribution of keys. Just
under 50% of the keys are found with at most 1 HE transition, etc.


And after a key insert, hsplit:

$ ./perl -Ilib -MHash::Util=bucket_stats_formatted -e'my %hash;
$hash{$_}++ for 1 .. ((1<<10)-0); print
bucket_stats_formatted(\%hash);'
Keys: 1024 Buckets: 449/512 Quality-Score: 1.01 (Good)
Utilized Buckets: 87.70% Optimal: 200.00% Keys In Collision: 56.15%
Chain Length - mean: 2.28 stddev: 1.30
Buckets            512
[000000001111111111111111111222222222222222233333333333444444555*]
Len   0  12.30%     63 [########]
Len   1  30.27%    155 [###################]
Len   2  25.59%    131 [################]
Len   3  16.99%     87 [###########]
Len   4   8.98%     46 [######]
Len   5   4.49%     23 [###]
Len   6   0.59%      3 []
Len   7   0.59%      3 []
Len   8   0.20%      1 []
Keys              1024
[111111111111111111111111111122222222222222222233333333334444455*]
Pos   1  43.85%    449 [############################]
Pos   2  28.71%    294 [##################]
Pos   3  15.92%    163 [##########]
Pos   4   7.42%     76 [#####]
Pos   5   2.93%     30 [##]
Pos   6   0.68%      7 []
Pos   7   0.39%      4 []
Pos   8   0.10%      1 []

Quite a big improvement in distribution and chain length, and still
only modest wastage of the hash array. 71% of data within 1 hop, and a
max chain length of 8.

Now compare to a load factor of 8:

$ ./perl -Ilib -MHash::Util=bucket_stats_formatted -e'my %hash;
$hash{$_}++ for 1 .. ((1<<10)-1); print
bucket_stats_formatted(\%hash);'
Keys: 1023 Buckets: 128/128 Quality-Score: 0.99 (Good)
Utilized Buckets: 100.00% Optimal: 799.22% Keys In Collision: 87.49%
Chain Length - mean: 7.99 stddev: 2.72
Buckets            128 [*]
Len   0   0.00%      0 []
Len   1   0.78%      1 []
Len   2   0.78%      1 []
Len   3   2.34%      3 [##]
Len   4   6.25%      8 [####]
Len   5   8.59%     11 [######]
Len   6  13.28%     17 [########]
Len   7  10.94%     14 [#######]
Len   8  13.28%     17 [########]
Len   9  16.41%     21 [##########]
Len  10   7.81%     10 [#####]
Len  11   8.59%     11 [######]
Len  12   5.47%      7 [####]
Len  13   3.12%      4 [##]
Len  14   2.34%      3 [##]
Keys              1023
[1111111122222222333333334444444455555556666666777778888899991010111112*]
Pos   1  12.51%    128 [########]
Pos   2  12.41%    127 [########]
Pos   3  12.32%    126 [########]
Pos   4  12.02%    123 [########]
Pos   5  11.24%    115 [#######]
Pos   6  10.17%    104 [#######]
Pos   7   8.50%     87 [#####]
Pos   8   7.14%     73 [#####]
Pos   9   5.47%     56 [####]
Pos  10   3.42%     35 [##]
Pos  11   2.44%     25 [##]
Pos  12   1.37%     14 [#]
Pos  13   0.68%      7 []
Pos  14   0.29%      3 []

As we can see, the hash is now filling up, relatively little, 25% or
so, is reachable within 1 HE transition.

And after an hsplit:

$ ./perl -Ilib -MHash::Util=bucket_stats_formatted -e'my %hash;
$hash{$_}++ for 1 .. ((1<<10)-0); print
bucket_stats_formatted(\%hash);'
Keys: 1024 Buckets: 251/256 Quality-Score: 1.01 (Good)
Utilized Buckets: 98.05% Optimal: 400.00% Keys In Collision: 75.49%
Chain Length - mean: 4.08 stddev: 2.02
Buckets            256
[011111222222222233333333333334444444444455555555566666667777889*]
Len   0   1.95%      5 [#]
Len   1   7.42%     19 [#####]
Len   2  15.62%     40 [##########]
Len   3  20.31%     52 [#############]
Len   4  17.19%     44 [###########]
Len   5  14.45%     37 [#########]
Len   6  10.94%     28 [#######]
Len   7   7.03%     18 [####]
Len   8   3.12%      8 [##]
Len   9   1.56%      4 [#]
Len  10   0.00%      0 []
Len  11   0.00%      0 []
Len  12   0.00%      0 []
Len  13   0.00%      0 []
Len  14   0.39%      1 []
Keys              1024
[1111111111111111222222222222223333333333334444444445555556666778*]
Pos   1  24.51%    251 [################]
Pos   2  22.66%    232 [##############]
Pos   3  18.75%    192 [############]
Pos   4  13.67%    140 [#########]
Pos   5   9.38%     96 [######]
Pos   6   5.76%     59 [####]
Pos   7   3.03%     31 [##]
Pos   8   1.27%     13 [#]
Pos   9   0.49%      5 []
Pos  10   0.10%      1 []
Pos  11   0.10%      1 []
Pos  12   0.10%      1 []
Pos  13   0.10%      1 []
Pos  14   0.10%      1 []

We double the number of items we can reach with 1 HE transition,
accounting for 50% or so, but the hash is still pretty full.

IMO what all this shows is that we probably want a max load factor of about 4.

I would not change the rate at which we grow the hash, at least not
with current code, and maybe not even at all. The reason being I think
we don't want to go below a certain load factor at all. When we double
the hash array, we essentially half our load factor, and if we agree
that we get the right tradeoffs at a load between 2 and 4, then we
should not grow faster than doubling, or we would drop below a load
factor of 2 when we did so.

Yves

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About