A better way to do perfect hashing for large sets is by Botelho and Ziviani. I don't know if it can be as fast as this, but it scales way better.
Their trick: instead of constructing the whole perfect hash function at once, use a normal hash h(key) to split the 200 zillion keys into a zillion buckets with about 200 keys each. For each bucket, use something like the code below to construct a perfect hash function that maps the n keys in that bucket to 0..n-1 uniquely, where n is about 200. In addition, build an offset table with one value per bucket, where the value is the sum of the number of keys in all previous buckets. Then b = h(key), hashValue = offset[b] + perfectHashFunction[b](key) is a minimal perfect hash of the whole 200 zillion keys. All zillion buckets can construct their perfect hash functions in parallel.
Perfect hashing guarantees that you get no collisions at all. It is possible when you know exactly what set of keys you are going to be hashing when you design your hash function. It's popular for hashing keywords for compilers. (They ought to be popular for optimizing switch statements.) Minimal perfect hashing guarantees that n keys will map to 0..n-1 with no collisions at all.
Here is my C code for minimal perfect hashing, plus a test case.
Makefile | standard.h |
recycle.h | recycle.c |
lookupa.h | lookupa.c |
perfect.h | perfect.c |
perfhex.c | |
sample input | sample output |
testperf.c | sanity test makefile |
The generator is run like so, "perfect -nm < samperf.txt",
and it produces the C files phash.h and
phash.c. The sanity test program, which
uses the generated hash to hash all the original keys, is run like so,
"foo -nm < samperf.txt".
Usage
There are options (taken by both perfect and the sanity test):
perfect [-{NnIiHhDdAa}{MmPp}{FfSs}] < key.txt
Only one of NnIiHhAa may be specified. N is the default. These say how to interpret the keys. The input is always a list of keys, one key per line.
hash = PHASHSALT; for (i=0; i<keylength; ++i) { hash = (hash ^ key[i]) + ((hash<<26)+(hash>>6)); }Note that this can be inlined in any user loop that walks through the key anyways, eliminating the loop overhead.
ffffffffIn the worst case for up to 8192 keys whose values are all less than 0x10000, the perfect hash is this:
hash = key+CONSTANT; hash += (hash>>8); hash ^= (hash<<4); b = (hash >> j) & 7; a = (hash + (hash << k)) >> 29; return a^tab[b];...and it's usually faster than that. Hashing 4 keys takes up to 8 instructions, three take up to 4, two take up to 2, and for one the hash is always "return 0".
Switch statements could be compiled as a perfect hash (perfect -hp < hex.txt), followed by a jump into a spring table by hash value, followed by a test of the case (since non-case values could have the same hash as a case value). That would be faster than the binary tree of branches currently used by gcc and the Solaris compiler.
ub1 tab[] = {0,7,0,2,3,0,3,0}; hash = key^tab[(key<<26)>>29];
aaaaaaaa bbbbbbbbThis mode does nothing but find the values of tab[]. If n is the number of keys and 2i <= n <= 2i+1, then A should be less than 2i if the hash is minimal, otherwise less than 2i+1. The hash is A^tab[B], or A^scramble[tab[B]] if there is a B bigger than 2048. The user must figure out how to generate (A,B). Unlike the other modes, the generator cannot rechoose (A,B) if it has problems, so the user must be prepared to deal with failure in this mode. Unlike other modes, this mode will attempt to increase smax.
Parse tables for production rules, or any static sparse tables, could be efficiently compacted using this option. Make B the row and A the column. For parse tables, B would be the state and A would be the ID for the next token. -ap is a good option.
Only one of MmPp may be specified. M is the default. These say whether to do a minimal perfect hash or just a perfect hash.
Only one of FfSs may be specified. S is the default.
Timings were done on a 500mhz Pentium with 128meg RAM, and it's actually the number of cursor blinks not seconds. ispell.txt is a list of English words that comes with EMACS. mill.txt was a million keys where each key was three random 4-byte numbers in hex. tab[] is always an array of 1-byte values. Normally I use a 166mhz machine with 32meg RAM, but a million keys died thrashing virtual memory on that.
Usage | number of keys | Generation time (in seconds) | tab[] size | minimal? |
---|---|---|---|---|
perfect < samperf.txt | 58 | 0 | 64 | yes |
perfect -p < samperf.txt | 58 | 0 | 32 | no |
perfect < ispell.txt | 38470 | 11 | 16384 | yes |
perfect -p < ispell.txt | 38470 | 4 | 4096 | no |
perfect < mill.txt | 1000000 | 65 | 524288 | yes |
perfect -p < mill.txt | 1000000 | 100 | 524288 | no |
The perfect hash algorithm I use isn't a Pearson hash. My perfect hash algorithm uses an initial hash to find a pair (A,B) for each keyword, then it generates a mapping table tab[] so that A^tab[B] (or A^scramble[tab[B]]) is unique for each keyword. tab[] is always a power of two. When tab[] has 4096 or more entries, scramble[] is used and tab[] holds 1-byte values. scramble[] is always 256 values (2-byte or 4-byte values depending on the size of hash values).
I found the idea of generating (A,B) from the keys in "Practical minimal perfect hash functions for large databases", Fox,Heath, Chen, and Daoud, Communications of the ACM, January 1992. (Dean Inada pointed me to that article shortly after I put code for a Pearson-style hash on my site.)
Any specific hash function may or may not produce a distinct (A,B) for each key. There is some probability of success. If the hash is good, the probability of success depends only on the size of the ranges of A and B compared to the number of keys. So the initial hash for this algorithm actually must be a set of independent hash functions ("universal hashing"). Different hashes are tried from the set until one is found which produces distinct (A,B). A probability of success of .5 is easy to achieve, but smaller ranges for B (which imply smaller probabilities of success) require a smaller tab[].
The different input modes use different initial hashes. Normal mode uses my general hash lookup() (42+6n instructions) (or checksum() if there are more than 218 keys). Inline mode requires the user to compute the initial hash themself, preferably as part of tokenizing the key in the first place (to eliminate the loop overhead and the cost of fetching the characters in the key). Hex mode, which takes integer keys, does a brute force search to find how little mixing it can get away with. AB mode gets the (A,B) pairs from the user, giving the user complete control over initial mixing.
The final hash is always A^tab[B] or A^scramble[tab[B]]. scramble[] is initialized with random distinct values up to smax, the smallest power of two greater or equal to the number of keys. The trick is to fill in tab[]. Multiple keys may share the same B, so share the same tab[B]. The elements of tab[] are handled in descending order by the number of keys.
Finding values for tab[] such that A^tab[B] causes no collisions is known as the "sparse matrix compression problem", which is NP complete. ("Computers and Intractability, A Guide to the Theory of NP-Completeness", Garey & Johnson, 1979) Like most NP complete problems, there are fast heuristics for getting reasonable (but not optimal) solutions. The heuristic I use involves spanning trees.
Spanning trees and augmenting paths (with elements of tab[] as nodes) are used to choose values for tab[b] and to rearrange existing values in tab[] to make room. The element being added is the root of the tree.
Spanning trees imply a graph with nodes and edges, right? The nodes are the elements of tab[]. Each element has a list of keys (tab[x] has all the keys with B=x), and needs to be assigned a value (the value for tab[x] when A^tab[B] is computed). Keys are added to the perfect hash one element of tab[] at a time. Each element may contain many keys. For each possible value of tab[x], we see what that value causes the keys to collide with. If the keys collide with keys in only one other element, that defines an edge from the other element pointing back to this element. If the keys collide with nothing, that's a leaf, and the augmenting path follows that leaf along the nodes back to the root.
If an augmenting path is found and all the nodes in it have one key apiece, it is guaranteed the augmenting path can be applied. Changing the leaf makes room for its parent's key, and so forth, until room is made for the one key being mapped. (If the element to be mapped has only one key, and there is no restriction on the values of tab[], augmenting paths aren't needed. A value for tab[B] can be chosen that maps that key directly to an open hash value.)
If the augmenting path contains nodes with multiple keys, there is no guarantee the agumenting path can be applied. Moving the keys for the leaf will make room for the keys of the parent that collided with that leaf originally, but there is no guarantee that the other keys in the leaf and the parent won't apply. Empirically, this happens a around once per ten minimal perfect hashes generated. There must be code to handle rolling back an augmenting path that runs into this, but it's not worthwhile trying to avoid the problem.
A possible strategy for finding a perfect hash is to accept the first value tried for tab[x] that causes tab[x]'s keys to collide with keys in zero other already-mapped elements in tab[]. ("Collide" means this some key in this element has the same hash value as some key in the other element.) This strategy is called "first fit descending" (recall that elements are tried in descending order of number of keys).
A second, more sophisticated, strategy would allow the keys of tab[x] to collide with zero or one other elements in tab[]. If it collides with one other element, the problem changes to mapping that one other element. There is no upper bound on the running time of this strategy.
The use of spanning trees and augmenting paths is almost as powerful as the second strategy, and it is guaranteed to terminate within O(nn) time (per element mapped). Is it faster or better on average? I don't know. I would guess it is. Spanning trees can ignore already-explored nodes, while the random jumping method doesn't.
How much do spanning trees help compared to first fit descending? First consider the case where tab[] values can be anything. Minimal perfect hashes need tab[] 31% bigger and perfect hashes need tab[] 4% bigger (on average) if multikey spanning trees aren't used. Next the case where tab[] values are one of 256 values. Minimal perfect hashes cannot be found at all, and perfect hashes need tab[] 15% bigger on average.
It turns out that restricting the size of A in A^tab[B] is also a good way to limit the size of tab[].
Pointers to other implementations of perfect hashing
A minimal perfect Pearson hash looks like this:
hash = 0; for (i=0; i<len; ++i) hash = tab[(hash+key[i])%n];It is almost always faster than my perfect hash in -i mode, by 0..3 cycles per character (mostly depending on whether your machine has a barrelshift instruction).
A minimal perfect hash maps n keys to a range of n elements with no collisions. A perfect hash maps n keys to a range of m elements, m>=n, with no collisions. If perfect hashing is implemented as a special table for Pearson's hash (the usual implementation), minimal perfect hashing is not always possible, with probabilities given in the table below. For example the two binary strings {(0,1),(1,0)} are not perfectly hashed by the table [0,1] or the table [1,0], and those are all the choices available. For sets of 8 or more elements, the chances are negligible that no perfect hash exists, specifically (1-n!/nn)n!. Even if they do exist, finding one may be an intractable problem.
Number of elements | Chance that no mimimal perfect hash exists |
---|---|
2 | .25 |
3 | .2213773 |
4 | .0941787 |
5 | .0091061 |
6 | .0000137 |
7 | .0000000000000003 |
Although Pearson-style minimal perfect hashings do not always exist, minimal perfect hashes always exist. For example, the hash which stores a sorted table of all keywords, and the location of each keyword is its hash, is a minimal perfect hash of n keywords into 0..n-1. This sorted-table minimal perfect hash might not even be slower than Pearson perfect hashing, since the sorted-table hash must do a comparison per bit of key, while the Pearson hash must do an operation for every character of the key.
Zbigniew J. Czech,
an academic researcher of perfect hash functions
ISAAC and RC4, fast stream ciphers
Hash functions for table lookup
Ye Olde Catalogue of Boy Scout Skits
jenny, generate cross-functional tests
Table of Contents