I hear a lot of questions and complaints about hash functions in general. Here they are, linked to their standard answers.
If you knew what keys you were hashing before you designed the hash function, you could make a hash function that puts n things in n buckets with no collisions. See perfect hashing.
Otherwise, there are a gazillion more possible keys than buckets, and the best any hash function can do is map an equal number of those gazillion keys to each bucket. The number of collisions you get is expected to follow the chi2 distribution.
Here's how to compute that:
Another rule of thumb, somewhat simpler: if #keys = #buckets/x, then about #buckets/(x+.5) buckets should get filled with at least one key. See corollary buckets.
An interesting option is to produce a hash value that is twice as big as what you need, then use the first half as the first hash and the second half as the second hash. See checksum.c, for example. This can guarantee that hash values collide slightly less often than if the functions were truly independent.
A hash which permutes its internal state has every piece of the key affect the result equally. A hash with no funnels will cause outputs to look more different than inputs.
Orders of magnitude: how big is
2n
Ye Olde Catalogue of Boy Scout Skits
ISAAC, a good random number generator
Table of Contents