Hash Functions and Block Ciphers

I have a lot of material related to hashing.

Hash functions for hash table lookup

A hash function for hash table lookup should be fast, and it should cause as few collisions as possible. If you know the keys you will be hashing before you choose the hash function, it is possible to get zero collisions -- this is called perfect hashing. Otherwise, the best you can do is to map an equal number of keys to each possible hash value and make sure that similar keys are not unusually likely to map to the same value.

The standard reference for this is Knuth's "The Art of Computer Programming", volume 3 "Sorting and Searching", chapter 6.4. He recommends the hash

   for (hash=len; len--;) 
   {
     hash = ((hash<<5)^(hash>>27))^*key++;
   }
   hash = hash % prime;

Unfortunately, that hash is only mediocre. The problem is the per-character mixing: it only rotates bits, it doesn't really mix them. Every input bit affects only 1 bit of hash until the final %. If two input bits land on the same hash bit, they cancel each other out. Also, % can be extremely slow (230 times slower than addition on a Sparc).

I had an article in Dr. Dobb's Journal on hash functions for hash table lookup. A generally good, fast hash function is LOOKUP3.C. Simon Stapleton provided thumb2 assembly for lookup3.c. More recently I published SpookyHash, which is many times faster than lookup3 on 64-bit platforms, even for short keys.

If you are using a language with no shift or xor, like BASIC, try Pearson's hash.

Error Correction Codes

I have a short spiel on 1-bit error correction codes prompted by a discussion on the A-Phi-O alumni mailing list.

Error Correction Codes are checksums of length m+2d+1 that assume that no more than d of the m+2d+1 values will change. If d or less values change, the checksum and the modified text can be used to deduce what the original m values were. Values in the checksum can be among the d values changed. If more than d values changed, then you lose. It is easy to produce two documents with the same checksum (if you try).

Error correction codes, I hear, have recently become much better. Turbo codes and Gallager codes (now called Low Density Parity Codes, LDPC) easily beat the old convolution codes, Goppa codes, Viterbi codes and Reed Solomon codes, I am told.

I just noticed that you could use Gallager codes as public key encryption, by sending something random as the message then putting your real message in the noise. This is similar to McEliece's public key encryption scheme. It's also an example of a public key scheme that could be broken by finding the smallest basis for a lattice.

Checksums, to identify documents

Here's the problem. You've got two documents that are supposed to be the same, but you're not sure. One of them is somewhere inaccessible, either a remote version or a previous version of the local document, so you can't compare them directly. So you want to find a checksum for both and compare them, and if the checksums are the same then you will be convinced the documents are the same. It should be incomprehensibly rare for two documents to have the same checksum by chance (that works out to probability of 2^-128 or less). We assume that nobody is maliciously trying to produce documents with the same checksum. That same assumption holds for error correction codes, which are often more useful than simple checksums.

Cryptographic hashes produce hash values of 256 bits or greater, and are more than good enough quality for checksums. The current recommendation is SHA-3. Their main problem is they're slow, 12.5 cycles per byte.

Noncryptographic checksums are SpookyHash, and Google's CityHash. They produce 128-bit results. They're good if you have under 2ⁿ keys and can tolerate a probability of 2^2n-128 of getting at least one collision. SpookyHash is .3 cycles per byte on 64-bit platforms. CityHash is .16 cycles per byte, but only on 64-bit platforms with the SSE CRC32 instruction. Only use a noncryptographic checksum if you don't have an opponent. If you need the probability of getting any collisions at all to be less than 2^-32, um, I'd say the chance of you having an unknown opponent are greater than that.

CRCs have a place too. A CRC has some quality issues that are tolerable in many circumstances, but it has this amazing property that you can calculate CRC(A+B) (the CRC of A with B concatenated to the end of it) straight from CRC(A), CRC(B), length(A), and length(B), without actually looking at A, B. CRCs come in 32-bit, 64-bit, 128-bit variants.

One-Way Functions (cryptographic checksums)

These are like checksums, but they are designed to even thwart malicious people. A one-way function is considered broken if you can find two keys that have the same hash value. (If hash values have n bits, it should take 2ⁿ tries to find a key mapping to a given hash value, and sqrt(2ⁿ) tries to find two keys that map to some common hash value.)

The current recommendation is SHA-3 (Keccak).

The usenet newsgroup sci.crypt used to be a place where one-way functions were actively discussed.

Block Ciphers

A block cipher is a reversible function g:KxB->C, which maps a key in K and a block in B into a block in C. Usually B and C are the same set, so the block cipher permutes B in a key-specific way. There should be no way to deduce the key given any number of pairs (b,g(b)) in (B,C), and no way to deduce g(b) from b, or b from g(b), without the key. No efficient way, that is.

The usenet usegroup sci.crypt used to to discuss block ciphers.

I tried designing one of these. It has a mixing function that is called several times, then wraps a key around that. Calling the mixing function 100 times is secure. Calling the mixing function 1 time isn't secure. Somewhere in between is the smallest number of iterations that is secure. My current guess is 12. Designing block ciphers is like that. Sufficient security is easy, it's just a question of performance, and of proving security (as in, unbreakable under all known attacks) at that level of performance.

I also wrote code to find characteristics in block ciphers, choose magic constants, and test for bias in supposedly random sequences. Go here to see how to add a key to a pseudorandom permutation, making it a block cipher.

Random Number Generators (for simulation)

Pseudorandom number generators mix up a state like hash functions, but they don't take any input. They have a state of their own, and they just keep churning it. They produce random-looking sequences, which can be used as fake data.

The standard reference for this is Knuth's "The Art of Computer Programming", volume 2 "Seminumerical Algorithms", chapter 3. He recommends the random number generator

   for (i=0; i<55; ++i) rsl[i] = rsl[i] + rsl[(i+24) % 55];

although not quite as succinctly as that. It's really quite good, and it's hard to beat it for speed. Properly optimized, it takes about 8 instructions to produce each 32 bit word. The whole 32 bit word can be used, although the low-order bit is less random than the other 31 bits. Any set of 55 consecutive results has no statistically significant patterns, because every such set of results is possible.

If you have an application that is sensitive to obscure biases (like every result being the sum of the results 31 and 55 before it) then a better generator to use is ISAAC, which takes 19 instructions to produce each 32 bit word, and has no known biases. I also have a small generator, which initializes and runs faster than ISAAC and the Mersenne Twister and has only a four-word internal state, but still passes all known tests.

Here are some tests for detecting bias in sequences that are supposed to be random. The standard test suite is Marsaglia's DIEHARD (I need to add a link), but that's pretty easy to pass because it doesn't test a long enough sequence.

Random Number Generators (for stream ciphers)

A cryptographic pseudorandom number generator is a random number generator where it is intractible to deduce the generator's internal state given any number of its results. To use such a random number generator as a stream cipher, give Alice and Bob the internal state of the generator. Alice generates the pseudorandom sequence and adds it to a message which she sends. Bob generates the pseudorandom sequence and subtracts it from the message he receives. Eve, who is watching the stuff being sent, can't read the message.

Go to the newsgroup sci.crypt to discuss stream ciphers.

RC4 is a popular stream cipher. It is used in Lotus Notes and Netscape and (its name at least) is the property of RSA. ISAAC is one I wrote. I have code for ISAAC and a paper comparing ISAAC and RC4.

Why there are no perpetual motion machines
Trying to design a house
Noncolliding orbits that fill 3-space
Ye Olde Collection of Boy Scout Skits
Table of Contents