I have a lot of material related to hashing.
A hash function for hash table lookup should be fast, and it should cause as few collisions as possible. If you know the keys you will be hashing before you choose the hash function, it is possible to get zero collisions -- this is called perfect hashing. Otherwise, the best you can do is to map an equal number of keys to each possible hash value and make sure that similar keys are not unusually likely to map to the same value.
The standard reference for this is Knuth's "The Art of Computer Programming", volume 3 "Sorting and Searching", chapter 6.4. He recommends the hash
for (hash=len; len--;)
{
hash = ((hash<<5)^(hash>>27))^*key++;
}
hash = hash % prime;
Unfortunately, that hash is only mediocre. The problem is the
per-character mixing: it only rotates bits, it doesn't really mix
them. Every input bit affects only 1 bit of hash until the final %.
If two input bits land on the same hash bit, they cancel each other
out. Also, % can be extremely slow (230 times slower than addition on
a Sparc).
I had an article in Dr. Dobb's Journal on hash functions for hash table lookup. A generally good, fast hash function is LOOKUP3.C. Simon Stapleton provided thumb2 assembly for lookup3.c. More recently I published SpookyHash, which is many times faster than lookup3 on 64-bit platforms, even for short keys.
If you are using a language with no shift or xor, like BASIC, try Pearson's hash.
I have a short spiel on 1-bit error correction codes prompted by a discussion on the A-Phi-O alumni mailing list.
Error Correction Codes are checksums of length m+2d+1 that assume that no more than d of the m+2d+1 values will change. If d or less values change, the checksum and the modified text can be used to deduce what the original m values were. Values in the checksum can be among the d values changed. If more than d values changed, then you lose. It is easy to produce two documents with the same checksum (if you try).
Error correction codes, I hear, have recently become much better. Look at McEliece's page on Turbo Codes and David MacKay's competing research on Gallager codes. These easily beat the old convolution codes, Goppa codes, Viterbi codes and Reed Solomon codes, I am told.
I just noticed that you could use Gallager codes as public key encryption, by sending something random as the message then putting your real message in the noise. This is similar to McEliece's public key encryption scheme. It's also an example of a public key scheme that could be broken by finding the smallest basis for a lattice. Algorithms for reducing bases of lattices are here.
Here's the problem. You've got two documents that are supposed to be the same, but you're not sure. One of them is somewhere inaccessible, either a remote version or a previous version of the local document, so you can't compare them directly. So you want to find a checksum for both and compare them, and if the checksums are the same then you will be convinced the documents are the same. It should be incomprehensibly rare for two documents to have the same checksum by chance (that works out to probability of 2-128 or less). We assume that nobody is maliciously trying to produce documents with the same checksum. That same assumption holds for error correction codes, which are often more useful than simple checksums.
Cryptographic hashes produce hash values of 256 bits or greater, and are more than good enough quality for checksums. The current recommendation is SHA-3. Their main problem is they're slow, 12.5 cycles per byte.
Noncryptographic checksums are SpookyHash,
and Google's CityHash.
They produce 128-bit results. They're good if you have under
2n keys and can tolerate a probability of 22n-128
of getting at least one collision. SpookyHash is .3 cycles per byte on
64-bit platforms. CityHash is .16 cycles per byte, but only on 64-bit
platforms with the SSE CRC32 instruction. Only use a noncryptographic
checksum if you don't have an opponent. If you need the probability of
getting any collisions at all to be less than 2-32, um, I'd
say the chance of you having an unknown opponent are greater than that.
CRCs have a place too. A CRC has some quality issues that are tolerable
in many circumstances, but it has this amazing property that you can
calculate CRC(A+B) (the CRC of A with B concatenated to the end of it)
straight from CRC(A), CRC(B), length(A), and length(B), without actually
looking at A, B. CRCs come in 32-bit, 64-bit, 128-bit variants.
These are like checksums, but they are
designed to even thwart malicious people. A one-way function is
considered broken if you can find two keys that have the same hash
value. (If hash values have n bits, it should take 2n
tries to find a key mapping to a given hash value, and
sqrt(2n) tries to find two keys that map to some common
hash value.)
The current recommendation is SHA-3 (Keccak).
Go to the newsgroup
sci.crypt to
discuss one-way functions.
A block cipher is a reversible function g:KxB->C, which
maps a key in K and a block in B into a block in C. Usually B and C
are the same set, so the block cipher permutes B in a key-specific
way. There should be no way to deduce the key given any number of
pairs (b,g(b)) in (B,C), and no way to deduce g(b) from b,
or b from g(b), without the key. No efficient way, that is.
Go to the newsgroup sci.crypt to discuss block ciphers. You can
find newsgroups through
google groups.
See Wei Dai's page for source code or benchmarks for most
popular block ciphers. Also see
Ron Rivest's security page for more cryptographic pointers.
I tried designing one of
these. It has a mixing function that is called several times,
then wraps a key around that. Calling the mixing function 100 times
is secure. Calling the mixing function 1 time isn't secure.
Somewhere in between is the smallest number of iterations that is
secure. My current guess is 12. Designing block ciphers is like
that. Sufficient security is easy, it's just a question of
performance, and of proving security (as in, unbreakable under all
known attacks) at that level of performance.
I also wrote code to find
characteristics in block ciphers, choose
magic constants, and test for
bias in supposedly random sequences. Go here to see how to add a key to a
pseudorandom permutation, making it a block cipher.
Pseudorandom number generators mix up a state like hash
functions, but they don't take any input. They have a state of their
own, and they just keep churning it. They produce random-looking
sequences, which can be used as fake data.
The standard reference for this is Knuth's "The Art of Computer
Programming", volume 2 "Seminumerical Algorithms", chapter 3. He
recommends the random number generator
If you have an application that is sensitive to obscure biases (like
every result being the sum of the results 31 and 55 before it) then a
better generator to use is ISAAC,
which takes 19 instructions to produce each 32 bit word, and has no
known biases. I also have a small
generator, which initializes and runs faster than ISAAC and the
Mersenne Twister and has only a four-word internal state, but still
passes all known tests.
Here are some tests for
detecting bias in sequences that are supposed to be random. The
standard test suite is Marsaglia's DIEHARD (I need to add a link), but
that's pretty easy to pass because it doesn't test a long enough
sequence.
A cryptographic pseudorandom number generator is a random
number generator where it is intractible to deduce the generator's
internal state given any number of its results. To use such a random
number generator as a stream cipher, give Alice and Bob the
internal state of the generator. Alice generates the pseudorandom
sequence and adds it to a message which she sends. Bob generates the
pseudorandom sequence and subtracts it from the message he receives.
Eve, who is watching the stuff being sent, can't read the message.
Go to the newsgroup sci.crypt to discuss stream ciphers.
RC4 is a popular stream cipher. It is used in Lotus Notes and
Netscape and (its name at least) is the property of RSA. ISAAC is one
I wrote. I have code for ISAAC
and a paper comparing ISAAC and RC4.
There's also an external
collection of analysis of RC4 on the web.
Why there are no perpetual
motion machines
One-Way Functions (cryptographic checksums)
Block Ciphers
Random Number Generators (for simulation)
for (i=0; i<55; ++i) rsl[i] = rsl[i] + rsl[(i+24) % 55];
although not quite as succinctly as that. It's really quite good, and
it's hard to beat it for speed. Properly optimized, it takes about 8
instructions to produce each 32 bit word. The whole 32 bit word can
be used, although the low-order bit is less random than the other 31
bits. Any set of 55 consecutive results has no statistically
significant patterns, because every such set of results is possible.
Random Number Generators (for stream ciphers)
Oracle SQL tricks
Noncolliding orbits that fill 3-space
A rather odd color chooser
Table of Contents