Hash Function FAQ

I hear a lot of questions and complaints about hash functions in general. Here they are, linked to their standard answers.

I'm hashing n things into n buckets, and 1/3 of my buckets are empty, and some buckets have five things in them! What's wrong?
What's the best way to structure a hash table?
Is this hash designed for character data or numeric data?
I've got a scheme where I need to rehash values. How do I guarantee that the two hash functions are independent?
What makes a good hash function?
Is the hash I'm using OK?
Is it OK if I just hash the first few bytes and the last few bytes of this terribly long key?
I want to use a hash to produce unique identifiers.

Look, I'm hashing n keys into n buckets, and 1/3 of my buckets are empty, and some buckets have five things in them! What's wrong?

Nothing's wrong. That's how hashes are supposed to work.

If you knew what keys you were hashing before you designed the hash function, you could make a hash function that puts n things in n buckets with no collisions. See perfect hashing.

Otherwise, there are a gazillion more possible keys than buckets, and the best any hash function can do is map an equal number of those gazillion keys to each bucket. The number of collisions you get is expected to follow the chi² distribution.

Here's how to compute that:

Look up: b = total number of buckets
Look up: k = total number of keys
Compute p = k/b is the expected number of keys per bucket
Look up: b_i buckets have i keys
chi² = sum (over all i) (b_i((i-p)²)/b)

The distribution is expected to have a result close to b, that is, within 3sqrt(b) of b. Chi² measures are usually reported in units of standard deviations. That is, if the formula above gives b+c*sqrt(b), they report c, and c is expected to be between -3 and 3.

Another rule of thumb, somewhat simpler: if #keys = #buckets/x, then about #buckets/(x+.5) buckets should get filled with at least one key. See corollary buckets.

Another rules (this one exact): if you place #keys into #buckets randomly, expect (#keys choose 2)/#buckets collisions. Proof by induction: it's true for 0 or 1 key (no collisions). If it's true for #keys, add another key. This adds #keys new pairs. For every placement of the previous keys, for every new pair, there is 1 out of #buckets placements that will cause a collision for that key, so #keys/#buckets new collisions for that new key over all new pairs, so a total of (#keys+1 choose 2)/#buckets collisions for #keys+1 keys.

What's the best way to structure a hash table?

Don't ask me. I'm an expert on hash functions, not on the ways to use them. I do advocate hash tables whose size is a power of two if that makes your code simpler and you know you'll use a decent hash.

The best structure for a hash table varies with how you plan to use it. Wikipedia is a good place to start. If your requirements are more interesting than normal, I suggest you search for "cache lines", "consistent hashing", "lock free hash table", "cuckoo hashing", "bit array", and "bloom filters".

Is this hash designed for character data or numeric data?

Decent hash functions are good for all data period. If your hash does well on character data but bad on numeric data, that indicates the hash function is much weaker than it ought to be. See my paper on funnels in hash functions.

I've got a scheme where I need to rehash values. How do I guarantee that the two hash functions are independent?

If two decent hash functions are different, they are almost certainly independent too. (There are (2³²)^2¹²⁸ ways to hash 16 bytes to 4 bytes.) However, if both hashes are flawed, it's not unlikely that they share the same flaw.

What makes a good hash function?

Two things. First, the function should permute its internal state. That is, for every key, for every final internal state there should be exactly one initial internal state that can produce it. Second, the function should have no funnels. That is, for every set of input bits, changing those input bits in the key should cause at least that many output bits to change half the time.

A hash which permutes its internal state has every piece of the key affect the result equally. A hash with no funnels will cause outputs to look more different than inputs.

Is the hash I'm using OK?

Here are some hash functions for hash table lookup. "Is bad" means probably OK for up to 256 buckets, but causes excess collisions beyond that.

for (h=0, i=0; i<len; ++i) h += key[i]; is bad.
for (h=0, i=0; i<len; ++i) h ^= key[i]; is bad.
for (h=0, i=0; i<len; ++i) h *= key[i]; is hilariously bad.
for (h=0, i=0; i<len; ++i) h = (h<<5)^(h>>27)^key[i]; is OK for one-word ASCII text, otherwise it's bad.
for (h=0, i=0; i<len; ++i) h = (h<<4)^(h>>28)^key[i]; is OK for one-word ASCII text, otherwise it's bad.
for (h=0, i=0; i<len; ++i) { h += key[i]; h += (h<<10); h ^= (h>>6); }
h += (h<<3); h ^= (h>>11); h += (h<<15);
is good.
for (h=0, i=0; i<len; ++i) h = tab[(h^key[i])&0xff]; may be good.
for (h=0, i=0; i<len; ++i) h = (h>>8)^tab[(key[i]+h)&0xff]; may be good.
Anything with a modulo prime at the end is probably bad; a decent hash function could use a power of two and wouldn't need to rely on the modulo prime to further mix anything.
Universal hash functions are usually good.
CRC hashes are usually good.
lookup3.c and checksum.c are some good hashes.
All cryptographic hashes (MD4, MD5, SHA, Snefru, RIPE-MD) are good.

Is it OK if I just hash the first few bytes and the last few bytes of this terribly long key?

No. If two keys differ only in the middle bytes, they will have the same hash value. (That's a type of funnel.)

I want to use a hash to produce unique identifiers.

If you are hashing 2ⁿ keys and you want a chance of collision of at most 1 in 2^m, then you need 2(n+m) bits in your hash value. There have been about 2⁸⁰ machine cycles executed in the history of the human race so far. Figure out how many bits you need. You should use a cryptographic hash.

Orders of magnitude: how big is 2ⁿ
Ye Olde Catalogue of Boy Scout Skits
ISAAC, a good random number generator
Table of Contents