Hash Function FAQ

I hear a lot of questions and complaints about hash functions in general. Here they are, linked to their standard answers.

  1. I'm hashing n things into n buckets, and 1/3 of my buckets are empty, and some buckets have five things in them! What's wrong?
  2. Is this hash designed for character data or numeric data?
  3. I've got a scheme where I need to rehash values. How do I guarantee that the two hash functions are independent?
  4. What makes a good hash function?
  5. Is the hash I'm using OK?
  6. Is it OK if I just hash the first few bytes and the last few bytes of this terribly long key?
  7. I want to use a hash to produce unique identifiers.

Look, I'm hashing n keys into n buckets, and 1/3 of my buckets are empty, and some buckets have five things in them! What's wrong?

Nothing's wrong. That's how hashes are supposed to work.

If you knew what keys you were hashing before you designed the hash function, you could make a hash function that puts n things in n buckets with no collisions. See perfect hashing.

Otherwise, there are a gazillion more possible keys than buckets, and the best any hash function can do is map an equal number of those gazillion keys to each bucket. The number of collisions you get is expected to follow the chi2 distribution.

Here's how to compute that:

  1. Look up: b = total number of buckets
  2. Look up: k = total number of keys
  3. Compute p = k/b is the expected number of keys per bucket
  4. Look up: bi buckets have i keys
  5. chi2 = sum (over all i) (bi((i-p)2)/p)
The distribution is expected to have a result close to b, that is, within 3sqrt(b) of b. Chi2 measures are usually reported in units of standard deviations. That is, if the formula above gives b+c*sqrt(b), they report c, and c is expected to be between -3 and 3.

Another rule of thumb, somewhat simpler: if #keys = #buckets/x, then about #buckets/(x+.5) buckets should get filled with at least one key. See corollary buckets.

Is this hash designed for character data or numeric data?

Decent hash functions are good for all data period. If your hash does well on character data but bad on numeric data, that indicates the hash function is much weaker than it ought to be. See my paper on funnels in hash functions.

I've got a scheme where I need to rehash values. How do I guarantee that the two hash functions are independent?

If two decent hash functions are different, they are almost certainly independent too. (There are (232)2128 ways to hash 16 bytes to 4 bytes.) However, if both hashes are flawed, it's not unlikely that they share the same flaw.

An interesting option is to produce a hash value that is twice as big as what you need, then use the first half as the first hash and the second half as the second hash. See checksum.c, for example. This can guarantee that hash values collide slightly less often than if the functions were truly independent.

What makes a good hash function?

Two things. First, the function should permute its internal state. That is, for every key, for every final internal state there should be exactly one initial internal state that can produce it. Second, the function should have no funnels. That is, for every set of input bits, changing those input bits in the key should cause at least that many output bits to change half the time.

A hash which permutes its internal state has every piece of the key affect the result equally. A hash with no funnels will cause outputs to look more different than inputs.

Is the hash I'm using OK?

Here are some hash functions for hash table lookup. "Is bad" means probably OK for up to 256 buckets, but causes excess collisions beyond that.

Is it OK if I just hash the first few bytes and the last few bytes of this terribly long key?

No. If two keys differ only in the middle bytes, they will have the same hash value. (That's a type of funnel.)

I want to use a hash to produce unique identifiers.

If you are hashing 2n keys and you want a chance of collision of at most 1 in 2m, then you need 2(n+m) bits in your hash value. There have been about 280 machine cycles executed in the history of the human race so far. Figure out how many bits you need. You should use a cryptographic hash.


Orders of magnitude: how big is 2n
Ye Olde Catalogue of Boy Scout Skits
ISAAC, a good random number generator
Table of Contents