I hear a lot of questions and complaints about hash functions in general. Here they are, linked to their standard answers.
If you knew what keys you were hashing before you designed the hash function, you could make a hash function that puts n things in n buckets with no collisions. See perfect hashing.
Otherwise, there are a gazillion more possible keys than buckets, and the best any hash function can do is map an equal number of those gazillion keys to each bucket. The number of collisions you get is expected to follow the chi2 distribution.
Here's how to compute that:
Another rule of thumb, somewhat simpler: if #keys = #buckets/x, then about #buckets/(x+.5) buckets should get filled with at least one key. See corollary buckets.
Another rules (this one exact): if you place #keys into #buckets randomly, expect (#keys choose 2)/#buckets collisions. Proof by induction: it's true for 0 or 1 key (no collisions). If it's true for #keys, add another key. This adds #keys new pairs. For every placement of the previous keys, for every new pair, there is 1 out of #buckets placements that will cause a collision for that key, so #keys/#buckets new collisions for that new key over all new pairs, so a total of (#keys+1 choose 2)/#buckets collisions for #keys+1 keys.
The best structure for a hash table varies with how you plan to use it. Wikipedia is a good place to start. If your requirements are more interesting than normal, I suggest you search for "cache lines", "consistent hashing", "lock free hash table", "cuckoo hashing", "bit array", and "bloom filters".
A hash which permutes its internal state has every piece of the key affect the result equally. A hash with no funnels will cause outputs to look more different than inputs.
Orders of magnitude: how big is
2n
Ye Olde Catalogue of Boy Scout Skits
ISAAC, a good random number generator
Table of Contents