(I feel I'm reinventing the wheel here -- tell me if I am! Also, where
can I put the C code for this? About 400 lines.)
This is a way to avoid computing probabilities by hand before running
tests on RNGs. It lets me make up tests where I don't know how to find
the probabilities. I can write new tests by putting bugs in old ones!
The method is simple:
Let X,Y be counts of independent events gathered from RNGs
such that X and Y should have the same probability distribution p_i,
and s=sum(x_i)=sum(y_i),
and all x_i > c for some small constant c (0..16 are good).
Let X be gathered from a control, and Y from the thing we want to test.
standard chi^2: sum((y_i-(sp_i))^2/(2sp_i)) is about n-1.
new method: sum((y_i- x_i )^2/(2 x_i)) is about n-1.
Justification:
Thm 1: sum( (y_i-x_i)^2 /(2sp_i)) has the same distribution as
sum(((y_i+x_i)-2sp_i)^2/(2sp_i)).
Thm 2: sum( (y_i-x_i)^2/(2 x_i)) has a distribution about the same as
sum( (y_i-x_i)^2/(2sp_i)) + sum(1/x_i)
Hence the new method produces results very close to the standard chi^2
distribution for X+Y.
Proof of thm 1:
Note that sum(x_i)=sum(y_i)=s=sum(sp_i).
sum((y_i-x_i)^2)
= sum((y_i+x_i)^2 - 4x_iy_i)
= sum((y_i+x_i)^2 - 4(sp_i)^2 - 4x_iy_i + 4(sp_i)^2)
= sum((y_i+x_i)^2 - 8(sp_i)^2 + 4(sp_i)^2 - 4x_iy_i + 4(sp_i)^2)
= sum((y_i+x_i)^2 - 4sp_i(y_i+x_i) + 4(sp_i)^2 - 4x_iy_i + 4(sp_i)^2)
= sum(((y_i+x_i)-2sp_i)^2 - 4x_iy_i + 4(sp_i)^2)
= sum(((y_i+x_i)-2sp_i)^2) because X and Y are independent.
Handwave at thm 2:
Let z,m,d,e be given such that z/(m+d) = z/m + e.
Then z+me = zm/(m+d),
and e = (zm - zm - zd)/m(m+d) = -zd/m(m+d).
Our problem has z=(y_i-x_i)^2, m+d=2x_i, and m=2sp_i.
For each m, d has a normal distribution with variance m.
sum((z/(m+d)) = sum(z/m) + sum(-zd/m(m+d)).
Set f=abs(d), we then have
sum(-zf/m(m+f)) + sum(zf/m(m-f))
== sum(-zf/m(m+f) + zf/m(m-f)) (both + and - occur equally often)
== sum((-zf(m-f) + zf(m+f))/m(m^2-f^2))
= sum(2zd^2/m(m^2-d^2)) (since d^2 = f^2)
== sum(2zd^2/m^3) (== means "is approximately")
== sum(2d^2/m^2) (since sum(z/m)==n-1 and z/m==1)
== sum(2/m) (since the variance of d is m)
== sum(1/x_i) (since 2x_i is about m)
Choose c such that x_i>c implies sum(1/x_i) is small compared to n-1.
All of this begs the question, what can be used as the control RNG.
Most RNGs can be scaled to have bigger or smaller internal states.
Big is usually much better than small. So find an RNG that does well
when small, scale it up to have a huge internal state, then only
test small RNGs against it.
- Bob Jenkins