I have a lot of material related to hashing.

- Definitions and my offerings:
- My functions:
- LOOKUP3.C, for hash table lookup
- SpookyHash, for hash table lookup and noncryptographic checksums on 64-bit machines
- Maraca, a completely broken attempt at a cryptographic checksum
- A hash function for integers
- Minimal Perfect Hashing
- An in-memory Hash Table
- Musings about lock-free hash tables
- Pearson's Hash, for BASIC users
- A 1-bit error correction code
- Work in progress: two new block ciphers
- ISAAC, a random number generator
- a small noncryptographic random number generator
- RAND.C and RAND.H, which implement ISAAC.

- Tests (used to choose those functions):
- Analysis:
- Hash Functions for Hash Table Lookup
- The same paper, revised for Dr. Dobb's
- A FAQ for hash functions
- ISAAC and RC4, two stream ciphers
- A proof of the birthday paradox
- Random number results
- Orders of magnitude
- k XOR f(k XOR x) as a way to add a key to f
- Some new protocols for cryptography and security

A hash function for hash table lookup should be fast, and it should cause as few collisions as possible. If you know the keys you will be hashing before you choose the hash function, it is possible to get zero collisions -- this is called perfect hashing. Otherwise, the best you can do is to map an equal number of keys to each possible hash value and make sure that similar keys are not unusually likely to map to the same value.

The standard reference for this is Knuth's "The Art of Computer Programming", volume 3 "Sorting and Searching", chapter 6.4. He recommends the hash

for (hash=len; len--;) { hash = ((hash<<5)^(hash>>27))^*key++; } hash = hash % prime;Unfortunately, that hash is only mediocre. The problem is the per-character mixing: it only rotates bits, it doesn't really mix them. Every input bit affects only 1 bit of hash until the final %. If two input bits land on the same hash bit, they cancel each other out. Also, % can be extremely slow (230 times slower than addition on a Sparc).

I had an article in Dr. Dobb's Journal on hash functions for hash table lookup. A generally good, fast hash function is LOOKUP3.C. Simon Stapleton provided thumb2 assembly for lookup3.c. More recently I published SpookyHash, which is many times faster than lookup3 on 64-bit platforms, even for short keys.

If you are using a language with no shift or xor, like BASIC, try Pearson's hash.

I have a short spiel on 1-bit error correction codes prompted by a discussion on the A-Phi-O alumni mailing list.

**Error Correction Codes** are checksums of length m+2d+1 that
assume that no more than d of the m+2d+1 values will change. If d or
less values change, the checksum and the modified text can be used to
deduce what the original m values were. Values in the checksum can be
among the d values changed. If more than d values changed, then you
lose. It is easy to produce two documents with the same checksum (if
you try).

Error correction codes, I hear, have recently become much better. Look at McEliece's page on Turbo Codes and David MacKay's competing research on Gallager codes. These easily beat the old convolution codes, Goppa codes, Viterbi codes and Reed Solomon codes, I am told.

I just noticed that you could use Gallager codes as public key encryption, by sending something random as the message then putting your real message in the noise. This is similar to McEliece's public key encryption scheme. It's also an example of a public key scheme that could be broken by finding the smallest basis for a lattice. Algorithms for reducing bases of lattices are here.

Here's the problem. You've got two documents that are supposed to be
the same, but you're not sure. One of them is somewhere inaccessible,
either a remote version or a previous version of the local document,
so you can't compare them directly. So you want to find a checksum
for both and compare them, and if the checksums are the same then you
will be convinced the documents are the same. It should be
incomprehensibly rare for two documents to have the same checksum by
chance (that works out to probability of
2^{-128} or less). *We
assume that nobody is maliciously trying to produce documents with the
same checksum*. That same assumption holds for
error correction codes, which are often more
useful than simple checksums.

Cryptographic hashes produce hash values of 256 bits or greater, and are more than good enough quality for checksums. The current recommendation is SHA-3. Their main problem is they're slow, 12.5 cycles per byte.

Noncryptographic checksums are SpookyHash,
and Google's CityHash.
They produce 128-bit results. They're good if you have under
2^{n} keys and can tolerate a probability of 2^{2n-128}
of getting at least one collision. SpookyHash is .3 cycles per byte on
64-bit platforms. CityHash is .16 cycles per byte, but only on 64-bit
platforms with the SSE CRC32 instruction. Only use a noncryptographic
checksum if you don't have an opponent. If you need the probability of
getting any collisions at all to be less than 2^{-32}, um, I'd
say the chance of you having an unknown opponent are greater than that.

CRCs have a place too. A CRC has some quality issues that are tolerable in many circumstances, but it has this amazing property that you can calculate CRC(A+B) (the CRC of A with B concatenated to the end of it) straight from CRC(A), CRC(B), length(A), and length(B), without actually looking at A, B. CRCs come in 32-bit, 64-bit, 128-bit variants.

These are like checksums, but they are
designed to even thwart malicious people. A one-way function is
considered broken if you can find two keys that have the same hash
value. (If hash values have n bits, it should take 2^{n}
tries to find a key mapping to a given hash value, and
sqrt(2^{n}) tries to find two keys that map to some common
hash value.)

The current recommendation is SHA-3 (Keccak).

Go to the newsgroup sci.crypt to discuss one-way functions.

A **block cipher** is a reversible function g:KxB->C, which
maps a key in K and a block in B into a block in C. Usually B and C
are the same set, so the block cipher permutes B in a key-specific
way. There should be no way to deduce the key given any number of
pairs (b,g(b)) in (B,C), and no way to deduce g(b) from b,
or b from g(b), without the key. No efficient way, that is.

Go to the newsgroup sci.crypt to discuss block ciphers. You can find newsgroups through google groups.

See Wei Dai's page for source code or benchmarks for most popular block ciphers. Also see Ron Rivest's security page for more cryptographic pointers.

I tried designing one of these. It has a mixing function that is called several times, then wraps a key around that. Calling the mixing function 100 times is secure. Calling the mixing function 1 time isn't secure. Somewhere in between is the smallest number of iterations that is secure. My current guess is 12. Designing block ciphers is like that. Sufficient security is easy, it's just a question of performance, and of proving security (as in, unbreakable under all known attacks) at that level of performance.

I also wrote code to find characteristics in block ciphers, choose magic constants, and test for bias in supposedly random sequences. Go here to see how to add a key to a pseudorandom permutation, making it a block cipher.

**Pseudorandom number generators** mix up a state like hash
functions, but they don't take any input. They have a state of their
own, and they just keep churning it. They produce random-looking
sequences, which can be used as fake data.

The standard reference for this is Knuth's "The Art of Computer Programming", volume 2 "Seminumerical Algorithms", chapter 3. He recommends the random number generator

for (i=0; i<55; ++i) rsl[i] = rsl[i] + rsl[(i+24) % 55];although not quite as succinctly as that. It's really quite good, and it's hard to beat it for speed. Properly optimized, it takes about 8 instructions to produce each 32 bit word. The whole 32 bit word can be used, although the low-order bit is less random than the other 31 bits. Any set of 55 consecutive results has no statistically significant patterns, because every such set of results is possible.

If you have an application that is sensitive to obscure biases (like every result being the sum of the results 31 and 55 before it) then a better generator to use is ISAAC, which takes 19 instructions to produce each 32 bit word, and has no known biases. I also have a small generator, which initializes and runs faster than ISAAC and the Mersenne Twister and has only a four-word internal state, but still passes all known tests.

Here are some tests for detecting bias in sequences that are supposed to be random. The standard test suite is Marsaglia's DIEHARD (I need to add a link), but that's pretty easy to pass because it doesn't test a long enough sequence.

A **cryptographic pseudorandom number generator** is a random
number generator where it is intractible to deduce the generator's
internal state given any number of its results. To use such a random
number generator as a **stream cipher**, give Alice and Bob the
internal state of the generator. Alice generates the pseudorandom
sequence and adds it to a message which she sends. Bob generates the
pseudorandom sequence and subtracts it from the message he receives.
Eve, who is watching the stuff being sent, can't read the message.

Go to the newsgroup sci.crypt to discuss stream ciphers.

RC4 is a popular stream cipher. It is used in Lotus Notes and Netscape and (its name at least) is the property of RSA. ISAAC is one I wrote. I have code for ISAAC and a paper comparing ISAAC and RC4. There's also an external collection of analysis of RC4 on the web.

Why there are no perpetual
motion machines

Oracle SQL tricks

Noncolliding orbits that fill 3-space

A rather odd color chooser

Table of Contents