develooper Front page | perl.perl5.porters | Postings from March 2017

smoke-me/new_hashes: New Hash Functions and Tunable hash functionsupport

Thread Next
March 23, 2017 11:24
smoke-me/new_hashes: New Hash Functions and Tunable hash functionsupport
Message ID:
Today I have pushed  a branch smoke-me/new_hashes which adds some new
hashing options to perl.

The intent is provide a set of hash functions which offer a range of
trade offs in terms of security and performance, while staying
"reasonable" in all categories as much as possible.

The patch adds support for three new hash functions to perl core. They are:


At the same time it removes any further support for Jenkins OAAT as it
really is no longer suitable to a modern era.

The addition of SBOX is special in that it is wired in differently
from our other hashes so that it can, and must, be used in conjunction
with one of the other choices. SBOX is a cryptographic grade hash
function that depends on generating a large random lookup table at
startup, proportional to the size of the string it can hash. Hashing
proceeds by looking up a unique 32 bit value per character/position in
this table, such that the resulting hash performs very few ops, and
has ideal properties (assuming a quality RNG seeds the table).

The nice thing about this is that it means a person building a perl
can choose a balance between memory footprint and startup time, and
security and speed of their hashes. So for instance, someone concerned
with security, AND speed, but not startup time, might choose to SBOX
fairly long strings.

Zaphod32 is designed to be a fast and secure-enough for general
purpose use in our context, and is intended for 32 bit builds.

StadtX is designed to be a fast and secure-enough for general purpose
use hash function for our context, and is intended for 64 bit builds.

The idea of offering Zaphod32 or StadtX is to provide an option for
people who do not feel they need the security of the SipHash functions
and instead want better performance when hashing longer strings.

For shorter strings I strongly recommend using SBOX. The startup cost
is *relatively* speaking high, as it needs to populate a largish
buffer, however that relative is compared to the setup costs for most
other hash functions, not the total start up time. Absolutely speaking
building a single table takes nano-seconds using a 96 bit or 128 bit
XORSHIFT RNG. And "largish buffer" means 1k per byte we can hash.

Another component of this patch series is separating out state seeding
from hashing, which will make the existing SipHash code a teeny bit
faster, alongside the other goodies offered here.

Yet another component is modestly tweaking the rule for splitting a
hash, in that I changed the maximum load factor to 0.667, and we now
only split when we collide on insert AND we exceed the maximum load
factor when we do so. This means we will never do a split when we
insert into an empty bucket. Previously the maximum load factor was 1,
which leads to excess collisions, and the split would occur exactly
when we reached that load factor, even when not strictly required.

Note that inserting into an empty bucket in a hash increases the
efficiency of the hash, and will be a key that does not need to move
were we to rehash, so even though it increases our load factor, it
does not increase our collision rate, and it does not change the
hashes worst case performance, but it does improve its best case.

Another addition I have included is the ability to precompute the hash
keys for 0 and 1 byte keys. I can imagine there is someone out there
who is obsessed about that case and that case only, and does not want
to use SBOX32.

Why and how did I chose these hash functions?

I spent a lot of time hacking on and improving the 'smhasher' test
suite. I also spent a lot of time running genetic-algorithm type code
for generating mix functions and hash function finalizers. I also
spent a lot of time reviewing existing hash functions and how they
work. In the process I came to the conclusion that these are the hash
functions that in my opinion are best of class for the largest number
of use cases. This means that I excluded hash functions on certain
criteria, for instance seed size, state size, performance, and
portability. All the new hash functions are portable, and should work
ok on big or little endian, etc. It is possible the hash functions may
need to be optimised for platforms where aligned access is required,
and since it is my code we can change them however we want.

StadtX and Zaphod32 are among the hash functions that come out of this
process. Both are designed specifically with perls use case in mind,
where we have a single seeding moment, and then use that seeded state
for all following hashing.  They take advantage of this and use the
seeding process to pre-avalanche their seeds and things like that,
with the intent of allowing the rest of the hash function to be leaner
because of it.

StadtX is very fast for its class of construction. It uses four 64 bit
state elements, and at top speed hashes in 32 byte chunks. As far as I
can tell it has reasonable avalanching and distribution properties,
but it has a very light round function when hashing long strings so is
unlikely to be very secure in the cryptographic sense, although it
should be multicollision free, and with a 128 bit seed/state
brute-force robust for use at least for a few years, or when quantum
computing becomes a reality.

Zaphod32 is fast for a 32 bit hash, and it uses 3x32 bit state
elements, it is intended for people who do not have 64 integers,
although there is nothing stopping a user building it on a 64 bit

Both hash functions have been somewhat tuned for short strings should
someone not want to use SBOX at all.

What I like about this is that people now have a range of options, all
of which pass fairly severe testing, which operate at different levels
of security, run-time performance, start-up performance. In particular
tuning the SBOX maximum key length for a given application and
environment provides a lot of flexibility for really getting the best
performance/security trade off that user might want. Those who are
paranoid can use SipHash24, those who are paranoid but want a bit more
performance can use SipHash13, those want more performance can choose
from Zaphod32, and StadtX, and tune how long a string they want to

I would really appreciate it if people could try this code out on
different architectures, etc, especially "weird ones" (IOW, ones that
dont behave like x86) where alignment or endianness could be a

I would also really appreciate it if people could resist the urge to
respond to this mail with artificial single-point benchmarks. If you
have an application that does a lot of hashing, please do go ahead and
time its performance before and after, and then let us see the raw
timing data, etc, but please dont hack together some bogus benchmark
that  hashes 20 keys and then uses that to decide which is faster.
Also note that code that does heavy amount of forking or process
starts is not going to particularly useful either. Timing how long it
takes to build is not useful. I have already done so, and the natural
variance is higher than what we would expect from these changes, IOW,
you benchmark is going to be noise.

I have done benchmarks using Porting/, which unfortunately in
an over-hasty git clean -dfx I managed to wipe, so I need to recreate
them, but suffice it to say that on my system, even single byte keys
see a modest 1% performance improvment, and long strings can see up to
50% improvement. I *will* post these benchmarks once I recreate them.

This patch also includes some other modest changes, in particular
removing support for USE_HASH_SEED_EXPLICIT which as far as I can tell
has been complete broken for a long time.

Some related materials:

perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About