I've been requiring that just about everything is a permutation, not just a random mapping. Permutations always map n things to n things. Random mappings map them to m, with m <= n.
Theorem BIRTHDAY PARADOX: Applying a random mapping f : S -> S to \sqrt{2|S|} values is expected to produce about 1 collision.
Proof: Let ck be a list of choices, where each choice comes from S, and let Ck be the list of all such choices.
By definition, if there is a collision in ck in Ck, there will be an i,j in 0..k-1, i < j, such that cki = ckj.
Given i,j in 0..k-1, i < j, every value in (SxS) is equally likely to be the value of (cki, ckj). This follows from Ck being SxSxSx...xS.
There are |S|*|S| possible pairs in SxS,
|S| of which have the same first and second term,
so 1/|S| of all possible pairs have the same first and second term.
Each c in Csqrt(2|S|) has (sqrt(2|S|)(sqrt(2|S|)-1)/2) =
(|S|-sqrt(|S|/2)) pairs (ci, cj) where i < j.
Since every possible pair is equally likely, the chances that c has a
(ci, cj) with i < j and ci =
cj is (|S|-sqrt(|S|/2))/|S|, which is about 1.
Therefore a set of sqrt(2|S|) independent choices from S will have about 1 collision.
[Chuck Blake pointed out that hash functions produce a binomial distribution in each hash bucket, which in the limit is a Poisson distribution. That is, if there are n buckets and m items hashed into them, and a=m/n, then the probability of a given bucket having k items in it will be about (ake-a)/k! , so the total number of buckets with k items will be about (nake-a)/k! ].
We want the variable-length hash to be an equidistributed function, that is, it should map an equal number of keys to each output. The only equidistributed mixing functions are the permutations of the internal state S. If the mixing function m is not a permutation, the final mixing steps will guarantee that some hash values are not possible, so the results will not be equidistributed. Similarly, if the combining function g: B \times S -> S is not equidistributed for the set of all input blocks B given a fixed internal state, the hash will not be equidistributed even for one-bucket keys. What if g is not a permutation of S when the input block is fixed?
Let a hash function be given where, for every block in B, the combining step is an arbitrary mapping from S to S. Then, for a fixed key, the hash is effectively a sequence of arbitrary mappings applied to the initial internal state. (The function proposed in \cite{MERKLE} seems to be an example of such a hash.) If we start with all possible internal states, how many are left after y blocks have been hashed? The answer is about 2|S|/y, to be shown by theorem loss.
To show this, let's start with a simpler question: if we apply a random mapping to x out of n states, how many distinct results will there be?
Theorem HALF: Define q = n/x. Applying a random mapping to x out of n possible states is expected to produce about n/(q+(1/2)+o(1/q)) distinct results if q < \sqrt{n}.
Proof: First, find the straightforward formula for how many distinct
result there will be.
1/n | chance of a given input hitting a given result |
(n-1)/n | chance of a given input not hitting a given result |
((n-1)/n)x | chance of all x inputs not hitting a given result |
n(((n-1)/n)x) | number of results not hit at all by x inputs |
n(1-(((n-1)/n)x)) | number of distinct results hit by some input |
Next, look at the series expansion of 1-(((n-1)/n)x):
1 | -(((n-1)/n)x) | ||||
= | 1 | - ((n-1)x)/(nx) | |||
= | 1 | - (nx)/(nx) | + x(nx-1)/(nx) | - x(x-1)(nx-2)/(2nx) | + ... |
= | 1 | - 1 | + x/n | - (x2-x)/(2n2) | + o((x3)/(n3)) |
= | x/n | - (x2)/(2n2) | + x/(2n2) | + o((x3)/(n3)) | |
= | x/n | - (x2)/(2n2) | + o((x3)/(n3)) | (assuming n > (n2)/(x2)) | |
= | 1/q | - 1/(2q2) | + o(1/(q3)) |
Note that 1/(q+(1/2)+o(1/q)) is also
1/q - 1/(2q2) + o(1/(q3)), so
For example, n/10 states will map to about n/10.5 states.
Corollary BUCKETS: Hashing b/q keys into b buckets is expected to hit about b/(q+(1/2)) buckets, if q < \sqrt(b).
OK. Now we can look at the original question, applying a sequence of y random mappings to all n possible states.
Theorem LOSS: After applying y random mappings gi : S -> S to all elements of S, the number of elements remaining will be about (2|S|)/y when y < \sqrt{n}.
The exact formula for xi, the number of states after i mappings, is:
x0 | = | n |
xi+1 | = | n(1-(((n-1)/n)xi)) |
Hypothesis: after i mappings we are left with (2n)/(i+o(\log i)) states. The next mapping will leave us this many states:
n(1-((n-1)/n)xi) | ||||
= | n/{(n/(xi)) | + 1/2 | + o((xi)/n)} | (by half) |
= | n/{n/(2n/(i+o(\log i))) | + 1/2 | + o({2n/(i+o(\log i))}/n)} | (by hypothesis) |
= | n/{((i+o(\log i))/2) | + 1/2 | + o( 2 /(i+o(\log i))} | |
= | 2n/{ i+o(\log i) | + 1 | + o( 4 / i)} | |
= | 2n/{(i+1) | + o(\log (i+1))} | (since \sum{i=1}n{4/i} is o(\log n)) |
The base case is satisfied because o(\log i) can be any constant, in reality 2 for i=0. By induction, the number of states left after y mappings is (2n)/(y+o(\log y)). This sequence converges to the sequence 2n/y.
QED loss
Experiments with n=65536 agree with both of these results.
Using theorem loss we can see that if the combining function does not act as a permutation on S, changing the ith out of q blocks is expected to allow only \frac{2}{q-i} of all possible final states to be produced. If |S| = 28 (as in \cite{CACMhash}, which does not have this problem), such a loss would be disasterous. But when |S| = 2128 (as in \cite{MERKLE}) this loss may be negligible.
The birthday paradox: Applying a random mapping f : S -> S to \sqrt{2|S|} values is expected to produce about 1 collision.
A common question in random number generators can be answered as a corollary of the theorems above. If a random number generator is based on a random mapping f : S -> S, what is its expected cycle length? The answer is about \sqrt{|S|}.
Theorem SQRT The median number of distinct values in the sequence fi(x), where f : S -> S is a random mapping and x is an element in S, is between \sqrt{|S|} and \sqrt{2|S|}.
By theorem birthday, applying f to \sqrt{|S|} arbitrary values in S should yield about \frac{|S|}{\sqrt{|S|}+1/2 + o(1/\sqrt{|S|})} distinct values. That is, there should be about 1/2 a collision. Similarly, mapping a subset of \sqrt{2|S|} values is expected to cause about 1 collision. (Theorem half will produce the same results.)
If no subset had more than one collision, then an expectation of 1/2 a collision would imply every subset had a 1/2 chance of a at least one collision. Some subsets have more than one collision, though, so the chances of having at least one collision is less than 1/2 when the subset has \sqrt{|S|} values.
On the other hand, a set of \sqrt{2|S|} values is expected to produce 1 collision. These values could be chosen one at a time. If a collision does occur, there should be no more than 1 collision expected in the remaining values. If we solve for x in the equation \sumi=0\sqrt{2|S|}{ixi+1} = 1, then x is a lower bound on the fraction of subsets with at least one collision. Since the solution is x = 1/2, the probability of having at least one collision is more than 1/2 when f maps a set of \sqrt{2|S|} values.
Until one collision occurs, the sequence fi(x) is a list of arbitrary values chosen from S. According to the bounds above, over half the choices of (f, x) will go \sqrt{|S|} values without a collision, and less than half will go \sqrt{2|S|} values without a collision, so the median must be somewhere in between.
QED sqrt
Hash Functions for Hash Table Lookup
Back to the Table of Contents