I have a lot of material related to hashing.
A hash function for hash table lookup should be fast, and it should cause as few collisions as possible. If you know the keys you will be hashing before you choose the hash function, it is possible to get zero collisions -- this is called perfect hashing. Otherwise, the best you can do is to map an equal number of keys to each possible hash value and make sure that similar keys are not unusually likely to map to the same value.
The standard reference for this is Knuth's "The Art of Computer Programming", volume 3 "Sorting and Searching", chapter 6.4. He recommends the hash
for (hash=len; len--;)
{
hash = ((hash<<5)^(hash>>27))^*key++;
}
hash = hash % prime;
Unfortunately, that hash is only mediocre. The problem is the
per-character mixing: it only rotates bits, it doesn't really mix
them. Every input bit affects only 1 bit of hash until the final %.
If two input bits land on the same hash bit, they cancel each other
out. Also, % can be extremely slow (230 times slower than addition on
a Sparc).
I just published an article in Dr. Dobb's Journal on hash functions for hash table lookup. A generally good, fast hash function is LOOKUP3.C. It comes with a self-test. lookup3.txt is the log produced by compiling and running the file lookup3.c on a Pentium. For 64-bit platforms I have lookup8.c, which also comes with a self-test and is even faster.
If you are using a language with no shift or xor, like BASIC, try Pearson's hash.
If your messages are aligned to word boundaries and are long (100 bytes or more), look at Rogaway's bucket hash. That can be much faster than LOOKUP2, but it only works for long aligned messages. It has lots of 4-bit funnels, but no funnels smaller than 4-bits, so it's not too bad. (The Dr. Dobb's article defines funnels. More thorough analysis, not so well edited, is in evahash.html.
I have a short spiel on 1-bit error correction codes prompted by a discussion on the A-Phi-O alumni mailing list.
Error Correction Codes are checksums of length m+2d+1 that assume that no more than d of the m+2d+1 values will change. If d or less values change, the checksum and the modified text can be used to deduce what the original m values were. Values in the checksum can be among the d values changed. If more than d values changed, then you lose. It is easy to produce two documents with the same checksum (if you try).
Error correction codes, I hear, have recently become much better. Look at McEliece's page on Turbo Codes and David MacKay's competing research on Gallager codes. These easily beat the old convolution codes, Goppa codes, Viterbi codes and Reed Solomon codes, I am told.
I just noticed that you could use Gallager codes as public key encryption, by sending something random as the message then putting your real message in the noise. This is similar to McEliece's public key encryption scheme. It's also an example of a public key scheme that could be broken by finding the smallest basis for a lattice. Algorithms for reducing bases of lattices are here and here.
Here's the problem. You've got two documents that are supposed to be the same, but you're not sure. One of them is somewhere inaccessible, either a remote version or a previous version of the local document, so you can't compare them directly. So you want to find a checksum for both and compare them, and if the checksums are the same then you will be convinced the documents are the same. It should be incomprehensibly rare for two documents to have the same checksum by chance (that works out to probability of 2-128 or less). We assume that nobody is maliciously trying to produce documents with the same checksum. That same assumption holds for error correction codes, which are often more useful than simple checksums.
Probably the way to go is to use md5sum.exe under the GNU license from cygnus. I know it's in their cdk.exe, I hear it's in the pgp package too.
I have a checksum of my own. It produces 256-bit checksums. The running time is 112+6.875n instructions to hash n bytes. Maximum instruction parallelism is 8::3. The hash has no funnels. See Evaluating Hash Functions for theory.
These are like checksums, but they are designed to even thwart malicious people. A one-way function is considered broken if you can find two keys that have the same hash value. (If hash values have n bits, it should take 2n tries to find a key mapping to a given hash value, and sqrt(2n) tries to find two keys that map to some common hash value.)
I haven't designed one of these myself yet. Examples of one-way functions include N-hash, MD5, and SHA. Bart Preneel is an expert on the design of these. Also see Wei Dai's CRYPTO++ page. MD4 has been broken so it should not be used.
Go to the newsgroup sci.crypt to discuss one-way functions.
An interesting set of papers on MACs (Message Authentication Codes, which are like cryptographic checksums that use a password) is here.
Probably the way to go is to use the free md5sum.exe from cygnus. I know it's in their cdk.exe, I hear it's in their pgp package too.
A block cipher is a reversible function g:KxB->C, which maps a key in K and a block in B into a block in C. Usually B and C are the same set, so the block cipher permutes B in a key-specific way. There should be no way to deduce the key given any number of pairs (b,g(b)) in (B,C), and no way to deduce g(b) from b, or b from g(b), without the key. No efficient way, that is.
Go to the newsgroup sci.crypt to discuss block ciphers. You can find newsgroups through DejaNews.
See Wei Dai's page for source code or benchmarks for most popular block ciphers. Also see Ron Rivest's security page for more cryptographic pointers.
I tried designing one of these. It has a mixing function that is called several times, then wraps a key around that. Calling the mixing function 100 times is secure. Calling the mixing function 1 time isn't secure. Somewhere in between is the smallest number of iterations that is secure. My current guess is 12. Designing block ciphers is like that. Sufficient security is easy, it's just a question of performance, and of proving security (as in, unbreakable under all known attacks) at that level of performance.
I also wrote code to find characteristics in block ciphers, choose magic constants, and test for bias in supposedly random sequences. Go here to see how to add a key to a pseudorandom permutation, making it a block cipher.
Pseudorandom number generators mix up a state like hash functions, but they don't take any input. They have a state of their own, and they just keep churning it. They produce random-looking sequences, which can be used as fake data.
The standard reference for this is Knuth's "The Art of Computer Programming", volume 2 "Seminumerical Algorithms", chapter 3. He recommends the random number generator
for (i=0; i<55; ++i) rsl[i] = rsl[i] + rsl[(i+24) % 55];although not quite as succinctly as that. It's really quite good, and it's hard to beat it for speed. Properly optimized, it takes about 8 instructions to produce each 32 bit word. The whole 32 bit word can be used, although the low-order bit is less random than the other 31 bits. Any set of 55 consecutive results has no statistically significant patterns, because every such set of results is possible.
If you have an application that is sensitive to obscure biases (like every result being the sum of the results 31 and 55 before it) then a better generator to use is ISAAC, which takes 19 instructions to produce each 32 bit word, and has no known biases.
Here are some tests for detecting bias in sequences that are supposed to be random.
A cryptographic pseudorandom number generator is a random number generator where it is intractible to deduce the generator's internal state given any number of its results. To use such a random number generator as a stream cipher, give Alice and Bob the internal state of the generator. Alice generates the pseudorandom sequence and adds it to a message which she sends. Bob generates the pseudorandom sequence and subtracts it from the message he receives. Eve, who is watching the stuff being sent, can't read the message.
Go to the newsgroup sci.crypt to discuss stream ciphers.
RC4 is a popular stream cipher. It is used in Lotus Notes and Netscape and (its name at least) is the property of RSA. ISAAC is one I wrote. I have code for ISAAC and a paper comparing ISAAC and RC4. There's also an external collection of analysis of RC4 on the web.
Why there are no perpetual
motion machines
Oracle SQL tricks
Noncolliding orbits that fill 3-space
A rather odd color chooser
Table of Contents