Work in progress: aiming at SHA-3 candidate

NIST is having a competition to define a new cryptographic hash to replace SHA-2 (they'll dub it SHA-3). I think I'll enter, if I can come up with something worthwhile. This page is a work-in-progress of about the hash I'm currently considering. I'm weak on hardware and hash cryptanalysis, so I expect I'll be revising these hashes considerably as I learn more.

My first attempt, nandhash.c (Feb 10 2008), isn't crypto at all, it's just attempting to do an efficient hardware hash. It hashed 128 bits arranged in eight 4x4 blocks, with alternating 4x4 bit mixes and XORs between bits in neighboring blocks. The 4x4 bit mix consisted of an LFSR of 8 of the bits (3 or 4 XORs per bit), and (~s)^((w&x)|(y&z)) for the remaining 8 bits (for some w,x,y,z chosen from the preceding bits). In software, the 8 4x4 blocks can be mixed simultaneously by storing each of the 16 bits of each in 16 8-bit words, then manipulating the words.

NIST wants at least a 512-bit hash value. NIST also wants the ability to parameterize the hash with a 512-bit secret key. That dictates at least a 512-bit state and 512-bit key. I'm guessing I'll do this with a 1024-bit state, with a 1024-bit mix, where I XOR 512 bits of message to one side and 512 bits of key to the other side, then mix, with some setup and wrapup procedure. I expect to hash 220 byte subsequences, then hash the 1024-bit states of the subsequences if messages are longer than that, so that the hash can make use of massive parallelism for large streams. Or perhaps I should only XOR 128 bits of message and key at a time, but 4x as often?

One attack on crypto hashes is to find x that produces pattern y in reverse and pattern z forward. Then if you start with pattern y, run it forward you'll get x, then run it forward more you'll get z. So if it takes m rounds backwards to achieve avalanche, and n rounds forward, then any less than m+n rounds is clearly not good enough for a crypto hash. The right number of rounds is probably double or triple that, but m+n is the largest number of rounds I can easily show is no good. It's a good metric for comparing mixing functions.

The next hash, nisthash1.c (Feb 11 2008), was the same 4x4 mix, but there were 8x8 4x4 blocks (1024 bits in all). The 4x4 mix is now done in software with 64-bit words instead of 8-bit words, since there are now 64 4x4 blocks instead of 8 4x4 blocks. Horizontal and vertical mixing XORed all 8 corresponding horizontal and vertical bits, then XORed that again to the original bit. I tested one-bit diffs against random bases. It took 5 rounds forward and 5 rounds backward to achieve avalanche. The 4x4 mix is 6 nands deep, and the XOR is 9 nands deep, so (5+5)*(6+9)=150 nands deep minimum.

Nisthash1.c did awful if the initial state was mostly zero or mostly one. It also had symmetry between all the 4x4 blocks, which was very bad forever when the initial state was symmetric.

Next came the imaginatively named nisthash2.c (Feb 18 2008). This broke symmetry by complementing s in (~s)^((w&x)|(y&z)) only half the time. In software that was accomplished by XORing a constant to s. I chose "half the time" in a very regular way that guaranteed each 4x4 mix differed from every other one in at least 2 ways. Randomly chosen constants worked better than that very regular way. But NIST wants the design decisions to be fully explainable. I expect I'll eventually resort to randomly chosen constants, randomly generated by my small state prng with some seed. I've been using that a lot for hillclimbing. Also in this nisthash2.c, I allowed w,x,y,z to be optionally complemented. I tested this with almost-all-zero and almost-all-one states, as well as with random states. It took 6 rounds backwards and 6 rounds forwards for avalanche, so (6+6)*(6+9) = 180 nands deep minimum.

The next, nisthash3.c (Feb 19 2008), was basically the same as nisthash2.c, but it had a slightly better mixing function. I noticed that although I allowed the top eight bits to nonlinear mix any bit before them, the best functions usually used only the first eight bits, so in nisthash3.c I added that restriction, w,x,y,z had to come from the first eight bits. About the same time I notice a bug in my hillclimber that prevented it from using the previous bit, which might also explain why the first eight bits were favored. Still (6+6)*(6+9) = 180 nands deep minimum.

Several people have told me that wire length doesn't matter much, it's nand depth I should pay attention to. I don't have a hardware simulator to tell me one way or the other. But if wire length is irrelevant, perhaps I should eliminate the XORs in horizontal and vertical mixing. nisthash4.c (Feb 22 2008) does that. I also realized that the s side of s^((w&x)|(y&z)) isn't many nands deep, so I could get more mixing in by doing (s^t)^((x&x)|(y&z)). So I do an LFSR with 1 or 2 taps for the top 8 bits, as well as an LFSR with 3 or 4 taps for the bottom 8 bits. To keep this reversible I could only use w,x,y,z from the bottom 8 bits. 6 rounds backwards or 8 rounds forward for avalanche, so (6+8)*(6) = 84 nands deep minimum. Unfortunately this got rid of my symmetry-breaking, and the hash is horribly symmetric again.

Is 84 nands deep good? What's the best possible? XOR is 3 nands deep, and you need a 10-level tree to combine 1024 bits, so XOR trees forward and backwards would be (10+10)*3 = 60 nands deep. At 3 nands deep, XOR could be combining 8 variables, but it's only combining 2. If you look at (s^t)^~(~(w&x)&~(y&z)), the XORs are giving you avalanche, and the rest is giving you nonlinearity. It's 6 nands deep. Is there some sort of permutation that could give you nonlinearity and avalanche simultaneously that's less than 6 nands deep? I don't know. I don't think anybody has checked.

I wrote a program, balancenand.c (March 7 2008), to come up with all balanced binary functions that are n nands deep or less. After it generates the truth tables, I run sort and uniq on them. There are 116 distinct truth-tables for balanced binary functions 3 nands deep or less (none uses more than 4 variables), and 2,094,862 distinct truth-tables for balanced binary functions 4 nands deep or less using 6 variables or less. None of them do avalanche quite as well as XOR, but some of the 5-variable functions come close. 4 nands deep, 7 variables or less, has 26,771,332 distinct truth tables.

Another program, nandperm.c (March 13 2008), reads in a file of truth tables and randomly hillclimbs trying to build a permutation out of them. It picks one bit to replace, checks if a new balanced function can replace it, and if it can it also checks if the new balanced function can extend the permutation without replacing it. Once a whole permutation is made, it continues, replacing one bit function at a time, walking through many related permutations. I don't know if the graph of permutations is connected that way. For example (March 17 2008), here is its report for a 4-nand-deep permutation of 16 bits:

4   0, 1, 5,14   1111000100011100
4   2, 4, 6, 8   1010101001010110
4   1, 4, 6, 7   1111000000011110
4   2, 4, 6, 8   1111000100001110
5   0, 1, 5,12,14   10001001100011111001100110011001
5   5, 9,10,11,15   00000000111111110111011100001010
4   0, 1, 4, 5   1111000000011110
4   0, 1, 3, 6   1100001111010010
5   5, 9,10,11,15   00001111000011110101001011110010
5   5, 9,10,11,15   00110011001100110100110001101110
5   5, 9,12,13,15   11110000000010111111010100001011
4   0, 5,12,14   1010010110110010
5   0, 1, 5,12,14   00011100000111000001111011011110
4   0, 3, 4, 7   1010101101010100
5   5, 9,12,13,15   11110001111101010000101000001011
5   0, 1, 5,12,14   11111111000000000001010110101011
Here are a few more (March 20 2008).

(March 15 2008) It turns out 3-way XOR can be implemented in a tree only 5 NANDs deep. Does this generalize to N-way XOR being 2*ceil(log2(N))+1 NANDs deep, rather than 3*ceil(log2(N)) NANDs deep? If it does, does this require an exponential number of gates? I'm not sure yet. I wish I knew whether 2-way NANDs or 2-way NORs or 3-way NANDs or 4-way NORs or whatever were the standard component for circuits.

(March 16 2008) To measure the avalanche of a 0/1 function, for every input value, flip each input bit and add to a total if the the result changes, then divide the total by the number of input values. The maximum avalanche for an n-bit 0/1 function is n, and it is achieved by the XOR of those n bits. Consider a function !(x&y), where x and y are themselves 0/1 functions. What can we say about x and y if !(x&y) is balanced and has avalanche > Z? Some zeros come from x, some from y, some from both x and y. If x has (.5-q) zeros, then y must contribute exactly q zeros that aren't shared by x (note q is a fraction of the total, not a raw count). The avalanche of !(x&y) is at most the sum of the avalanche of x and the avalanche of y. And this is achieved only if x contributes all its zeros and y also contributes all its zeros, which is only possible when the zeros of x and y sum to .5. Suppose x is .5-q zeros but y is r zeros, where r > q. It's reasonable to expect at most q/r of the avalanche from y to be contributed to !(x&y), total avalanche probably under (avalanche x) + (q/r)(avalanche y). The avalanche in y might be concentrated in a subsets of the zeros, so we can do better, but if we're picking functions that have unusually high avalanche, this is unlikely. If the input bits that x and y depend on don't overlap, for !(x&y) to be balanced, we need (ones in x)*(ones in y) = .5, so both are sqrt(.5) on average. Similarly avalanche would be sqrt(.5)(avalanche in x + avalanche in y).

(March 22 2008) 2n-way XOR can be implemented only 2n+1 NANDs deep. The 1 is to turn all inputs into themselves and their complements: (a,!a) is implemented as (a, !(a&a)). Given (a,A),(b,B), where a and b are each the XOR of n variables and A,B are their complements, ((a^b),!(a^b)) is implemented as (!(!(a&B)&!(A&b)), !(!(a&b)&!(A&B))). This means XOR is quite a bit more competitive than I thought it was at achieving avalanche. My surveys so far find nothing 4-deep or 5-deep that approaches the ability of XOR to achieve avalanche. An XOR of 4 variables is 5 deep and has an avalanche of 4. The best 5-deep non-XOR I've found so far uses 5 variables and has avalanche of 3.5, 10100011010111000101110010100011. It's clearly a variant on XOR. XOR increases avalanche 1.4142x per level. Random trees with the golden ratio of bits set (which causes !(a&b) to also have the golden ratio of bits set) increases avalanche by about 1.236x per level, well short of XOR. I'll search 5-deep balanced functions a bit more, but at the moment I believe that Feistel networks are about the optimal way to achieve avalanche out of all possible NAND trees.

(March 27 2008) I realized that hillclimbing to find an initial 16-bit permutations (which takes days) can be skipped by instead initializing the hillclimber with two 8-bit permutations (which takes a few seconds). Here's a 5-deep 16-bit permutation where every balanced function has avalanche between 3.25 and 4.0. (There are no 6+ variable balanced functions that are 5-deep with avalanche greater than 3.125.) I'm not claiming it's a particularly good permutation, it's just a permutation.

1, 3, 4, 5, 6   10111000010001110100011110111000  (((1 (1 2)) ((0 0) (1 1))) ((3 4) ((3 3) (4 4))))  (((0 (0 1)) (1 2)) ((3 (3 4)) (4 (3 3))))
1, 3, 4, 5, 6   10001011011101000111010010001011  (((0 (0 1)) (1 (1 2))) ((3 (3 4)) (4 (3 3))))  (((1 2) ((0 0) (1 1))) ((3 4) ((3) (4 4))))
0, 2, 3, 7, 9   11100100000110110001101111100100  (((0 1) (2 (0 0))) ((3 (3 4)) (4 (3 3))))  (((0 (0 1)) ((0 0) (2 2))) ((3 4) ((3 3) (4 4))))
0, 3, 5, 7   1001011001101001  (((0 1) ((0 0) (1 1))) ((2 3) ((2 2) (3 3))))  (((0 (1 1)) ((0 0) 1)) ((2 (3 3)) ((2 2) 3)))
3, 4, 5, 6,13   10010110100101100110101101100001  (((0 (1 4)) (2 (3 4))) ((1 (0 2)) (4 (0 2))))  (((0 2) ((0 0) (2 2))) ((1 3) ((1 1) (3 3))))
2, 5,10,12,13   10100101010110101001011010010110  (((0 2) ((0 0) (2 2))) ((4 (1 1)) ((1 4) (3 3))))  (((0 (0 2)) (2 (0 0))) ((1 4) (3 (3 4))))
0, 4, 5, 6   1001011001101001  (((0 1) ((0 0) (1 1))) ((2 3) ((2 2) (3 3))))  (((0 (1 1)) ((0 0) 1)) ((2 (3 3)) ((2 2) 3)))
0, 3, 4, 6   1001011001101001  (((0 1) ((0 0) (1 1))) ((2 3) ((2 2) (3 3))))  (((0 (1 1)) ((0 0) 1)) ((2 (3 3)) ((2 2) 3)))
1, 3, 5, 6, 7   10101001010101100110010110011010  (((0 3) ((0 0) (3 3))) ((1 (1 2)) ((1 1) (4 4))))  (((0 (0 3)) (3 (0 0))) ((1 2) (4 (1 1))))
2, 4, 8, 9,12   10100101010110101001011010010110  (((0 2) ((0 0) (2 2))) ((4 (1 1)) ((1 4) (3 3))))  (((0 (0 2)) (2 (0 0))) ((1 4) (3 (3 4))))
4, 6,11,12,15   10011001011001100110001011011001  (((0 (1 4)) (3 (1 4))) ((1 (0 3)) (4 (0 2))))  (((0 2) ((0 0) (2 2))) ((1 3) ((1 1) (3 3))))
0, 4, 7,12,15   10010110110111100010000101101001  (((0 (0 2)) (2 (0 0))) ((1 (1 4)) (3 (1 1))))  (((0 2) ((0 0) (2 2))) ((1 3) ((1 1) (3 3))))
0, 4,11,12,14   10010010011000011101011011101001  (((0 3) ((0 0) (3 3))) ((1 2) ((1 1) (2 2))))  (((0 (0 3)) (3 (0 0))) ((1 (1 2)) (4 (1 1))))
1, 4, 5, 6,13   00011010001001011011010101111010  (((0 (0 3)) (3 (0 0))) ((1 (1 2)) (2 4)))  (((0 2) ((0 0) (2 2))) ((1 (1 3)) (3 (1 1))))
8, 9,11,13,14   00111100001011011100001111010010  (((0 3) ((0 2) (1 1))) ((2 (2 4)) (4 (2 2))))  (((1 (0 3)) (2 (2 4))) ((2 4) ((1 2) (1 4))))
2, 4, 6, 7,10   10010110011011011001011001100001  (((0 (1 3)) (2 (1 3))) ((1 (0 2)) (3 (2 4))))  (((0 2) ((0 0) (2 2))) ((1 3) ((1 1) (3 3))))
I suspect that this approach can't be competitive in software with table-based hashes, though it's likely faster in hardware. Table-based hashes take advantage of the multiple layers of gates that accomplish index lookup, while what I'm doing here is one software instruction per gate (OK per 64 gates). There might be some advantage in the table values being precomputed, too.

(April 9 2008) Here's a permutation that can claim that the difference between its mapping for 0 and its mapping for a value with 1 bit set will always have at least 2 bits different. That's something I couldn't accomplish with LFSRs and Feistel networks. NAND depth 5.

5   4, 9,10,12,14   10010110011110011000011001101001
5   3, 4, 5, 8,10   01111001100101110010100110010010
5   0, 1, 3, 6,13   10100101010110101001011010010110
5   1, 7,10,14,15   10010110100101000110101101101001
5   3, 7,10,11,15   10010110011110011001011000101001
5   2,10,12,14,15   10010101011010101010011001011001
5   0, 1, 6, 9,11   10011001111101100110011010010000
5   1, 2, 3, 6,13   00111100110000111001011010010110
5   0, 1, 3, 9,13   10011001011001100111001010110001
5   2, 3, 4, 7,10   10010110100100100110110101101001
5   0, 6, 9,11,13   01110110110110011001100001100010
5   5, 8,11,14,15   00111100110000111010001100111010
5   0, 6, 9,11,13   11010001011101000011110011000011
5   0, 6, 9,11,13   10010100011000011001011101101101
5   0, 2, 6, 9,11   00011100110100111100000100111101
5   0, 2, 3, 5, 8   10111000010001110100011110111000

(April 10 2008) To verify these 16-bit permutations are permutations, I'm checking all 216 values. Suppose I wanted to build a 512-bit permutation, it wouldn't be feasible to check all 2512 values. But, suppose each output bit depends on at most 5 input bits. Does that help? I think so. If I want to change the function f(a,b,c,d,e) for the last bit to g(a,b,c,d,e), I just have to find all output bits that have a,b,c,d,e as their inputs and evaluate them. So I map a bit to its inputs, back to all their outputs, then back to their inputs, and I go through all values for all those input bits. I think if the result is balanced then it's still a permutation with g(a,b,c,d,e) replacing f(a,b,c,d,e). I think that shows a,b,c,d,e are still happy, and it can't break anything other than a,b,c,d,e.

If that selects fewer than all input bits, that would be faster than evaluating all input bits. Doing three mappings like that could depend on up to 89 bits if both directions had a branching of 5. But inputs to outputs have a variable branching, average 5 sure, but some inputs are used by many more than 5 outputs. You could easily get unlucky and need to evaluate all possible settings of over a hundred of bits. That's a large theoretical improvement over 2512 values, but even 289 isn't feasible in practice. So, that optimization looks like a dead end.

Perhaps if I calculate the inverse of the permutation, then I'd only have to do two mappings, from the output to be changed back to the inputs back to the outputs? That's up to 21 output bits with a constant branching of 5, which is close to feasible. So far I have no idea how to make that work. I suspect it can't be made to work.

(April 11 2008) fivebalanced.txt is a list of a few thousand of the best-mixing 5-deep balanced functions.


(May 3 2008) Some of the better-mixing 16-bit permutations I've seen are actually two 8-bit permutations followed by a shuffle of all 16 bits. My framework has 64 of the 16-bit chunks to permute, which maps nicely to a 64-bit integer for each of the 16 bits. If I limit the hash to a single 8-bit permutation repeated twice, then I can compute that 16-bit permutation using SSE2 instructions, 128 bits at a time. I'd still have to split it into 16 64-bit integers to shuffle the bits and do the per-bit rotates.

Next up, I need to write a utility to map my nand-tree representations of permutations to C code, so I can time a few thousand of these designs.

ISAAC, a cryptographic pseudorandom number generator
A small-state pseudorandom number generator for simulations
A noncryptographic software hash function for table lookup
Specialized hashes of a 32-bit integer
brute.c and gather.c, tools for breaking up to 5-bit RC4 starting in the middle of the stream
a mostly-useless tool that uses Gaussian Elimination to locate characteristics of block ciphers
Table of Contents