Front page | perl.perl5.porters |
Postings from November 2016
[perl #130084] [PATCH] Increase hash colision ratio and reducehsplit calls
Thread Previous
|
Thread Next
From:
Atoomic via RT
Date:
November 14, 2016 00:58
Subject:
[perl #130084] [PATCH] Increase hash colision ratio and reducehsplit calls
Message ID:
rt-4.0.24-9586-1479085128-469.130084-15-0@perl.org
Thanks for your time & feedback, the graphical visualization is definitively a plus.
Indeed it's all a matter of tradeoff/tuning to find the best compromise.
I would also not consider a load factor higher than 4.
My main concern is if we slow down hashes by reducing the memory usage, this would be a regression from my point of view...
And from the naive tests I run earlier seems that the load factor as a significant impact on run time.
I was also considering an alternate solution to change the grow factor only at the beginning for small hashes only.
Which will avoid to waste memory when the hash starts to be big... This will probably fix your concern of wasting memory ?
Maybe this should be an option that could be set at compilation time ? (This would not solve the question if we need/want to change the default value.)
Any opinion with this ? Of course the definition of "small/big" needs to be define, was thinking something like 1024.
I'm also fine leaving it as it's, but I've the feeling that this default setup can be optimized.
Thanks again
nicolas
On Sun, 13 Nov 2016 14:08:00 -0800, demerphq wrote:
> On 13 November 2016 at 17:29, Atoomic via RT <perlbug-
> followup@perl.org> wrote:
> > Note that increasing the collision ratio, reduce the amount of memory
> > used for the hash array
> > but also slightly decrease performance (access, realocation, ...)
>
> Yeah, there are a number of trade offs here.
>
> I think its helpful to be able to visualize what is going on.
>
> We currently use a maximum load factor of 1, or put otherwise we
> expect there to be on average 1 key per bucket or less, and we double
> the size of the array whenever we exceed this ratio.
>
>
> This is what this looks like for 1023 keys going into 1024 buckets:
>
> $ ./perl -Ilib -MHash::Util=bucket_stats_formatted -e'my %hash;
> $hash{$_}++ for 1 .. ((1<<10)-1); print
> bucket_stats_formatted(\%hash);'
> Keys: 1023 Buckets: 640/1024 Quality-Score: 1.02 (Good)
> Utilized Buckets: 62.50% Optimal: 99.90% Keys In Collision: 37.44%
> Chain Length - mean: 1.60 stddev: 0.85
> Buckets 1024
> [0000000000000000000000001111111111111111111111122222222222233334*]
> Len 0 37.50% 384 [########################]
> Len 1 36.04% 369 [#######################]
> Len 2 18.55% 190 [############]
> Len 3 5.66% 58 [####]
> Len 4 1.66% 17 [#]
> Len 5 0.39% 4 []
> Len 6 0.20% 2 []
> Keys 1023
> [111111111111111111111111111111111111111122222222222222222333334*]
> Pos 1 62.56% 640 [########################################]
> Pos 2 26.49% 271 [#################]
> Pos 3 7.92% 81 [#####]
> Pos 4 2.25% 23 [#]
> Pos 5 0.59% 6 []
> Pos 6 0.20% 2 []
>
> As we can see, we have about 40% of our buckets unused, 36% of the
> used buckets have 1 items in the chain, the longest chain length is 6,
> and for a read-hit/update, 62% of our data will be found without
> traversing the HE chain at all, and 88% with 1 extra HE traverse.
>
> Then we add a key, triggering a split:
>
> $ ./perl -Ilib -MHash::Util=bucket_stats_formatted -e'my %hash;
> $hash{$_}++ for 1 .. ((1<<10)-0); print
> bucket_stats_formatted(\%hash);'
> Keys: 1024 Buckets: 803/2048 Quality-Score: 1.01 (Good)
> Utilized Buckets: 39.21% Optimal: 50.00% Keys In Collision: 21.58%
> Chain Length - mean: 1.28 stddev: 0.56
> Buckets 2048
> [0000000000000000000000000000000000000001111111111111111111222223*]
> Len 0 60.79% 1245 [#######################################]
> Len 1 30.27% 620 [###################]
> Len 2 7.37% 151 [#####]
> Len 3 1.32% 27 [#]
> Len 4 0.20% 4 []
> Len 5 0.05% 1 []
> Keys 1024
> [111111111111111111111111111111111111111111111111112222222222233*]
> Pos 1 78.42% 803
> [##################################################]
> Pos 2 17.87% 183 [###########]
> Pos 3 3.12% 32 [##]
> Pos 4 0.49% 5 []
> Pos 5 0.10% 1 []
>
> Now we end up with 60% of the hash buckets unused, and only modestly
> improve the distribution of keys. For instance where previously 88% of
> read-hits would occur with at most 1 HE traverse or less, we now have
> 95%. A 7% improvement for a large cost in unused space.
>
> Compare with a maximum load factor of 4:
>
> At 1023 keys, right before key split:
>
> $ ./perl -Ilib -MHash::Util=bucket_stats_formatted -e'my %hash;
> $hash{$_}++ for 1 .. ((1<<10)-1); print
> bucket_stats_formatted(\%hash);'
> Keys: 1023 Buckets: 252/256 Quality-Score: 1.03 (Good)
> Utilized Buckets: 98.44% Optimal: 399.61% Keys In Collision: 75.37%
> Chain Length - mean: 4.06 stddev: 2.10
> Buckets 256
> [011111122222222233333333333334444444444444555555566666666777789*]
> Len 0 1.56% 4 [#]
> Len 1 8.98% 23 [######]
> Len 2 13.67% 35 [#########]
> Len 3 20.70% 53 [#############]
> Len 4 20.70% 53 [#############]
> Len 5 11.33% 29 [#######]
> Len 6 11.72% 30 [########]
> Len 7 6.64% 17 [####]
> Len 8 1.95% 5 [#]
> Len 9 1.17% 3 [#]
> Len 10 0.00% 0 []
> Len 11 0.39% 1 []
> Len 12 0.78% 2 []
> Len 13 0.39% 1 []
> Keys 1023
> [1111111111111111222222222222223333333333334444444445555556666778*]
> Pos 1 24.63% 252 [################]
> Pos 2 22.39% 229 [##############]
> Pos 3 18.96% 194 [############]
> Pos 4 13.78% 141 [#########]
> Pos 5 8.60% 88 [######]
> Pos 6 5.77% 59 [####]
> Pos 7 2.83% 29 [##]
> Pos 8 1.17% 12 [#]
> Pos 9 0.68% 7 []
> Pos 10 0.39% 4 []
> Pos 11 0.39% 4 []
> Pos 12 0.29% 3 []
> Pos 13 0.10% 1 []
>
> Almost no wasted space, and a pretty good distribution of keys. Just
> under 50% of the keys are found with at most 1 HE transition, etc.
>
>
> And after a key insert, hsplit:
>
> $ ./perl -Ilib -MHash::Util=bucket_stats_formatted -e'my %hash;
> $hash{$_}++ for 1 .. ((1<<10)-0); print
> bucket_stats_formatted(\%hash);'
> Keys: 1024 Buckets: 449/512 Quality-Score: 1.01 (Good)
> Utilized Buckets: 87.70% Optimal: 200.00% Keys In Collision: 56.15%
> Chain Length - mean: 2.28 stddev: 1.30
> Buckets 512
> [000000001111111111111111111222222222222222233333333333444444555*]
> Len 0 12.30% 63 [########]
> Len 1 30.27% 155 [###################]
> Len 2 25.59% 131 [################]
> Len 3 16.99% 87 [###########]
> Len 4 8.98% 46 [######]
> Len 5 4.49% 23 [###]
> Len 6 0.59% 3 []
> Len 7 0.59% 3 []
> Len 8 0.20% 1 []
> Keys 1024
> [111111111111111111111111111122222222222222222233333333334444455*]
> Pos 1 43.85% 449 [############################]
> Pos 2 28.71% 294 [##################]
> Pos 3 15.92% 163 [##########]
> Pos 4 7.42% 76 [#####]
> Pos 5 2.93% 30 [##]
> Pos 6 0.68% 7 []
> Pos 7 0.39% 4 []
> Pos 8 0.10% 1 []
>
> Quite a big improvement in distribution and chain length, and still
> only modest wastage of the hash array. 71% of data within 1 hop, and a
> max chain length of 8.
>
> Now compare to a load factor of 8:
>
> $ ./perl -Ilib -MHash::Util=bucket_stats_formatted -e'my %hash;
> $hash{$_}++ for 1 .. ((1<<10)-1); print
> bucket_stats_formatted(\%hash);'
> Keys: 1023 Buckets: 128/128 Quality-Score: 0.99 (Good)
> Utilized Buckets: 100.00% Optimal: 799.22% Keys In Collision: 87.49%
> Chain Length - mean: 7.99 stddev: 2.72
> Buckets 128 [*]
> Len 0 0.00% 0 []
> Len 1 0.78% 1 []
> Len 2 0.78% 1 []
> Len 3 2.34% 3 [##]
> Len 4 6.25% 8 [####]
> Len 5 8.59% 11 [######]
> Len 6 13.28% 17 [########]
> Len 7 10.94% 14 [#######]
> Len 8 13.28% 17 [########]
> Len 9 16.41% 21 [##########]
> Len 10 7.81% 10 [#####]
> Len 11 8.59% 11 [######]
> Len 12 5.47% 7 [####]
> Len 13 3.12% 4 [##]
> Len 14 2.34% 3 [##]
> Keys 1023
> [1111111122222222333333334444444455555556666666777778888899991010111112*]
> Pos 1 12.51% 128 [########]
> Pos 2 12.41% 127 [########]
> Pos 3 12.32% 126 [########]
> Pos 4 12.02% 123 [########]
> Pos 5 11.24% 115 [#######]
> Pos 6 10.17% 104 [#######]
> Pos 7 8.50% 87 [#####]
> Pos 8 7.14% 73 [#####]
> Pos 9 5.47% 56 [####]
> Pos 10 3.42% 35 [##]
> Pos 11 2.44% 25 [##]
> Pos 12 1.37% 14 [#]
> Pos 13 0.68% 7 []
> Pos 14 0.29% 3 []
>
> As we can see, the hash is now filling up, relatively little, 25% or
> so, is reachable within 1 HE transition.
>
> And after an hsplit:
>
> $ ./perl -Ilib -MHash::Util=bucket_stats_formatted -e'my %hash;
> $hash{$_}++ for 1 .. ((1<<10)-0); print
> bucket_stats_formatted(\%hash);'
> Keys: 1024 Buckets: 251/256 Quality-Score: 1.01 (Good)
> Utilized Buckets: 98.05% Optimal: 400.00% Keys In Collision: 75.49%
> Chain Length - mean: 4.08 stddev: 2.02
> Buckets 256
> [011111222222222233333333333334444444444455555555566666667777889*]
> Len 0 1.95% 5 [#]
> Len 1 7.42% 19 [#####]
> Len 2 15.62% 40 [##########]
> Len 3 20.31% 52 [#############]
> Len 4 17.19% 44 [###########]
> Len 5 14.45% 37 [#########]
> Len 6 10.94% 28 [#######]
> Len 7 7.03% 18 [####]
> Len 8 3.12% 8 [##]
> Len 9 1.56% 4 [#]
> Len 10 0.00% 0 []
> Len 11 0.00% 0 []
> Len 12 0.00% 0 []
> Len 13 0.00% 0 []
> Len 14 0.39% 1 []
> Keys 1024
> [1111111111111111222222222222223333333333334444444445555556666778*]
> Pos 1 24.51% 251 [################]
> Pos 2 22.66% 232 [##############]
> Pos 3 18.75% 192 [############]
> Pos 4 13.67% 140 [#########]
> Pos 5 9.38% 96 [######]
> Pos 6 5.76% 59 [####]
> Pos 7 3.03% 31 [##]
> Pos 8 1.27% 13 [#]
> Pos 9 0.49% 5 []
> Pos 10 0.10% 1 []
> Pos 11 0.10% 1 []
> Pos 12 0.10% 1 []
> Pos 13 0.10% 1 []
> Pos 14 0.10% 1 []
>
> We double the number of items we can reach with 1 HE transition,
> accounting for 50% or so, but the hash is still pretty full.
>
> IMO what all this shows is that we probably want a max load factor of
> about 4.
>
> I would not change the rate at which we grow the hash, at least not
> with current code, and maybe not even at all. The reason being I think
> we don't want to go below a certain load factor at all. When we double
> the hash array, we essentially half our load factor, and if we agree
> that we get the right tradeoffs at a load between 2 and 4, then we
> should not grow faster than doubling, or we would drop below a load
> factor of 2 when we did so.
>
> Yves
---
via perlbug: queue: perl5 status: open
https://rt.perl.org/Ticket/Display.html?id=130084
Thread Previous
|
Thread Next