develooper Front page | perl.perl5.porters | Postings from October 2003

Re: 5.8.2-RC1 and mp2

Thread Previous | Thread Next
Nicholas Clark
October 30, 2003 11:25
Re: 5.8.2-RC1 and mp2
Message ID:
On Thu, Oct 30, 2003 at 08:29:07AM -0600, Scott A Crosby wrote:
> Hello. I've been monitoring this thread and I thought all this stuff
> about rehashing was only for the 5.8.x series and only to preserve
> binary compatibility. Thus, you'd revert back to 5.8.1 style
> randomized hashes for 5.9.0?

I think that we should, but the decision is for the blead pumpking

> Also, in [1]
>      (ignoring mod perl) I'm not convinced that it kicks in too way
>      early.  MJD calculates that the average linked list size on an HV
>      should be 1.6 entries. (ie there usually shouldn't be a linear
>      search) It's kicking in at 4, *after* a hash split failed to
>      partition the longest list below 4. So it's saving linear
>      searches of 4 elements.
> You must check for under-attack on each hash insert. Only checking
> when the hash table doubles means opens up the attack I described in
> the thread 'Perl 5.8.1, plan C, and algorithmic complexity attacks.'

Sorry, not clear. It is. The logic is

For a regular hash:

Any "store" that is a change to an existing key returns immediately (as before)
Any store that is the first entry in a bucket returns.

Any other store (by implication) lengthens an existing chain. At this point:
a) If a chain becomes longer than the threshold, a split is triggered.
b) Also the old test (IIRC number of entries > number of buckets) also
   triggers a split.

The description above applies to what happens in a split. Therefore a
rehash can be triggered immediately by any insert that lengthens an existing

> Also, in a private reply on that thread, MJD said:
>     If the hash has N buckets and K keys, then the probability that
>     any given bucket contains exactly i keys is
>           (N choose i) (N-1)^(K-i) / (N^K)
>     I posted this on p5p last week sometime.
>     This formula implies that in a sufficiently large hash, when K=N,
>     the average nonempty bucket contains 1.6 keys.  'Sufficiently
>     large' in the context of Perl is 'always'.
>     I was not able to calculate the closed form for the variance.
>     However, numeric calculation of the cases that will come up in
>     practice (say, up to N = 2^30) suggest that 5 * log_2 (n) is
>     extremely conservative.  It looks to me like it's more like
>            5 + log(N)/2
>     and even that is conservative.
> Going by his advice, we should rehash in 5.8.2 when the chain length
> is > 5+log(N)/2. This is assuming that there are $N$ buckets and at
> most $N$ keys inserted. (IE, a load of 1.) 

OK. Patches welcome from anyone who wishes to implement that.
If whoever can figure out somewhere in the HV to store that number then
we can cache it, and only recalculate on a split. How much of a speed
hit is a log on a modern CPU?

Nicholas Clark

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About