## A small noncryptographic PRNG

This is a small fast pseudorandom number generator, suitable for large statistical calculations, but not of cryptographic quality. Although there is no guaranteed minimum cycle length, the average cycle length is expected to be about 2126 results.

#### The fastest small unbiased noncryptographic PRNG that I could find (in C)

 ```typedef unsigned long int u4; typedef struct ranctx { u4 a; u4 b; u4 c; u4 d; } ranctx; #define rot(x,k) (((x)<<(k))|((x)>>(32-(k)))) u4 ranval( ranctx *x ) { u4 e = x->a - rot(x->b, 27); x->a = x->b ^ rot(x->c, 17); x->b = x->c + x->d; x->c = x->d + e; x->d = e + x->a; return x->d; } void raninit( ranctx *x, u4 seed ) { u4 i; x->a = 0xf1ea5eed, x->b = x->c = x->d = seed; for (i=0; i<20; ++i) { (void)ranval(x); } } ```

I have not found any test in any configuration that it does not pass. It passes DIEHARD. Sampling just the bottom byte, it passes the run test (both up and down), and the gap test (for gaps up to length 32) for 2 trillion values (chi.c). The frequency test (again on the bottom byte) looked suspicious at 1 trillion values (chi-square=5), so I ran it for 4 trillion values (chi-square=0.42), showing the earlier result was a fluke (freq.c). A test that counts bits per value for five consecutive values (countx.c) passes for at least 16 trillion values, both for normal and for graycoded values. It passes Geronimo Jones' nda test for at least 1 trillion values (4 trillion bytes).

The strongest test for small-state generators I've run across is Bob Blorp2's bitcounting test (countx.c). Near as I can tell, what it is doing is indirectly measuring how much avalanche happens by the time entropy is first recycled. For this generator that's 5 results. I can measure that directly too: for every bit of the internal state, I can start with two copies of a random state that differ in just that bit, then run the generator for 5 results, and XOR their 5th result. Complete avalanche would give an average of 16 bits differing. The program rngav.c measures this: for the PRNG above, at least 8.8 bits are affected on average. I used the shifts (27, 17). Others that achieve 8.8 bits of avalanche include (9,16), (9,24), (10,16), (10,24), (11,16), (11,24), (25,8), (25,16), (26,8), (26,16), (26,17), and (27,16). Avalanche is much better in reverse, but the code is also slower, both because of more read-after-write dependencies. It is possible to not fully achieve avalanche yet still pass all practical tests for randomness because nearly-identical states are very rare among the set of all possible states.

If raninit(seed) is used (one of those 232 seeds), the shortest cycle is expected to be about 294 (because 126-32 = 94), and it expected that about one seed will run into another seed within 264 values (because 128-64-2*32=0). That's probabilistic; the actual shortest cycle or collision could be better or worse than that. I've tested each seed from raninit() to certain lengths to guarantee it does not cycle and does not hit any other seed. (You do that by starting a=0xf1ea5eed and b=c=d=seed, then checking the state after every value for b==c and b==d and a==0xf1ea5eed). How far I tested is limited by computing resources on hand. I tested seed 0 (the first 20 seeds) for 250 values, 0..31 (the first 25 seeds) for 246 values, the first 210 seeds for 240 values, the first 221 seeds for 232 values, and all 232 seeds for 220 values. No cycles or overlaps were found.

I wrote this PRNG. I place it in the public domain. Same goes for at least the implementation of all those tests linked to above.

I suspect that all you need for a statistically good pseudorandom number generator is

• A reversible, nonlinear function where all internal state bits affect one another
• A big enough state to give you a long cycle length
• Sufficiently fast mixing, when run either forward or backward
If that's all there is to it, designing statistically good pseudorandom number generators is pretty much a solved problem. You can tell by inspection if a generator has a big enough state and uses a reversible, and nonlinear function, where all the state affects all the state, there isn't some piece that isn't affected by some other piece. rngav.c can very quickly tell you how fast it mixes. How big an internal state is needed? 96 bits is cutting it too close, 128 bits seems good enough. How fast does mixing have to be? 4 out of 32 bits of avalanche before recycling entropy isn't good enough. Is 8.8 bits good enough? Probably. Though I could be wrong.

(April 2009) Elias Yarrkov used a black-box solver to find some fixed points of this generator (which produce cycles of length 1):

 ```{ 0x00000000, 0x00000000, 0x00000000, 0x00000000 }, { 0x77777777, 0x55555555, 0x11111111, 0x44444444 }, { 0x5591F2E3, 0x69EBA6CD, 0x2A171E3D, 0x3FD48890 }, { 0x47CB8D56, 0xAE9B35A7, 0x5C78F4A8, 0x522240FF } ```
(January 2016) David Blackman found two more, and claims that these six solutions are all that there are. I have not verified.
 ```{ 0x71AAC8F9, 0x66B4F5D3, 0x1E950B8F, 0x481FEA44 }, { 0xAB23E5C6, 0xD3D74D9A, 0x542E3C7A, 0x7FA91120 } ```
These states are impossible to reach if the state is initialized with raninit() because none of them start with 0xf1ea5eed.

#### A third rotate would have improved avalanche

A third rotate could be introduced into ranval() without increasing the maximum parallel path through the code. This came out significantly slower with the Microsoft Visual Studio compiler, though. (MS VS uses pointer arithmetic for the additions, to trick Intel's compiler into using 3-variable instructions instead of 2-variable. The extra rotate thwarted that, thus the speed hit.) A third rotate could increase minimum avalanche to 13 bits, on average, like so:

 ```u4 ranval( ranctx *x ) { u4 e = x->a - rot(x->b, 23); x->a = x->b ^ rot(x->c, 16); x->b = x->c + rot(x->d, 11); x->c = x->d + e; x->d = e + x->a; return x->d; } ```

Shifts amounts achieving 13 bits avalanche include (3,14,24), (3,25,15), (4,15,24), (6,16,28), (7,16,27), (8,14,3), (11,16,23), (12,16,22), (12,17,23), (13,16,22), (15,25,3), (16,9,3), (17,9,3), (17,27,7), (19,7,3), (23,15,11), (23,16,11), (23,17,11), (24,3,16), (24,4,16), (25,14,3), (27,16,6), (27,16,7). I imagine these are significantly more pseudorandom than the PRNG at the top. However, since they run slower and no test I know of can detect nonrandomness in either, I'm recommending the 2-shift version.

#### Timings (June 2009)

My original timings had one file containing both the generator and a tight loop testing the generator. The compiler inlined the generators into the tight loop, keeping the state in registers. On the register-starved Intel platform, that would never happen for real uses of a generator: whatever the random numbers are used for would kick the generator's state out of registers.

Another approach is to use the generator to fill a large array, then to read values out of that array one by one, like with this code in a .h file:

 ```static inline u4 ranval(random_context *r) { if (r->count >= SIZE) { fill(r); r->count = 0; } return r->result[r->count++]; } ```
The generator would then have a loop to produce SIZE=256 values, and it would be normal even in real use to keep the generator state in registers while the array is being filled.

Here's some new timings of generators for 20000*65536 results for various generators, variously inlined, used with a 256-item array, or in a routine that calculates a single value. Timed with cygwin gcc 4.3.2 -O3 on a 1.86GHz Intel 6300.

• 2.52 seconds: Knuth's delayed fibonacci, 256-item array, inlined in the tight loop.
• 3.64 seconds: Knuth's delayed fibonacci, separate routine, fill and exhaust a 256-item array.
• 4.31 seconds: My 2-rotate small prng, inlined in the tight loop.
• 4.69 seconds: My 3-rotate small prng, inlined in the tight loop.
• 5.75 seconds: My 2-rotate small prng, separate routine, fill a 256-item array then exhaust the values.
• 6.58 seconds: Knuth's delayed fibonacci, separate routine, called directly (it calculates a new value per call, 1024 bytes total state).
• 6.72 seconds: My 3-rotate small prng, separate routine, fill a 256-item array then exhaust the values.
• 8.22 seconds: ISAAC, my cryptographic random number generator, which uses a 256-item array.
• 9.55 seconds: My 2-rotate small prng, separate routine, called directly (it calculates a new value per call, 16 bytes total state).
• 10.23 seconds: My 3-rotate small prng, separate routine, called directly (it calculates a new value per call, 16 bytes total state).
• 11.90 seconds: the Mersenne Twister, used as directed.

#### 64-bit variants

If you want to use 8-byte terms instead of 4-byte terms, the proper rotate amounts are (39,11) for the two-rotate version (yielding at least 13.3 bits of avalanche after 5 rounds) or (7,13,37) for the three-rotate version (yielding 18.4 bits of avalanche after 5 rounds). I think I'd go with the three-rotate version, because the ideal is 32 bits of avalanche, and 13.3 isn't even half of that.

 ```typedef unsigned long long u8; typedef struct ranctx { u8 a; u8 b; u8 c; u8 d; } ranctx; #define rot(x,k) (((x)<<(k))|((x)>>(64-(k)))) u8 ranval( ranctx *x ) { u8 e = x->a - rot(x->b, 7); x->a = x->b ^ rot(x->c, 13); x->b = x->c + rot(x->d, 37); x->c = x->d + e; x->d = e + x->a; return x->d; } void raninit( ranctx *x, u8 seed ) { u8 i; x->a = 0xf1ea5eed, x->b = x->c = x->d = seed; for (i=0; i<20; ++i) { (void)ranval(x); } } ```

I don't think that there's an 8-byte rotate instruction on any 64-bit platform. And you only need 2 terms to get to 128 bits of internal state if you have 64-bit terms. Quite likely 64-bit deserves a whole different approach, not just different constants.

Dale Weiler implemented this prng as a compiletime constant

#### An unlikely-to-be secure extention of this PRNG

The small-state version isn't suitable for encryption. However, if x->b is replaced with an array and e is used for lookup in the array, it becomes something that is possibly of cryptographic strength:

 ```typedef unsigned long int u4; typedef struct ranctx { u4 a; u4 b[256]; u4 c; u4 d;} ranctx; #define rot(x,k) (((x)<<(k))|((x)>>(32-(k)))) void ranxor( ranctx *x, u4 m[256] ) { u4 i; for (i=0; i<256; ++i) { u4 e = x->a - rot(x->b[i], 27); x->a = x->b[e%256] ^ rot(x->c, 17); x->b[e%256] = x->c + x->d; x->c = x->d + e; x->d = e + x->a; m[i] ^= x->d; } } void raninit( ranctx *x, u4 seed[128] ) { u4 i; u4 junk[256]; x->a = x->c = x->d = 0xf1ea5eed; for (i=0; i<128; ++i) { x->b[i] = x->b[i+128] = seed[i]; } for (i=0; i<10; ++i) { ranxor(x, junk); } } ```

The routine ranxor() takes a message m[] consisting of 256 4-byte values and XORs 256 pseudorandom 4-byte values with it.

I strongly recommend against using this for any cryptographic purpose. If it is secure, it is just barely so. I don't think it's even particularly fast. The only value I see in it is as a target for cryptanalysis and possible inspiration for future stream ciphers.

Although lack of bias comes from the same good mixing as the original generator, the security comes more from the pseudorandom lookups than from the mixing. The most blatant weakness is that there is one pseudorandom lookup per result returned, leaving a system with n+256 variables and n equations after seeing n results (counting pseudorandom lookups as separate variables than lookups from known positions). If you can make progress guessing or linking the variables the whole thing should be solvable. Another weakness (and the most curious part of this generator) is that the positions to be updated in x->b[] are selected pseudorandomly rather than in a regular order. That means each pass leaves about 1/e of the positions in x->b[] unchanged. Which means you can reduce the number of independent variables just by guessing which positions don't get changed between rounds.

I still don't know how to break it, but I haven't tried very hard.