Testing for randomness



The important thing to remember when running randomness tests on numbers is this: You can't test too many numbers! All these test come a lot closer to the ideal result as our datapool increases to infinity. If you only test a small amount of numbers, your data could be anything. When we test numbers, we test thousands at a time. We don't care about instances, but the limit of those instances as you take an infinite amount of them.

There is test data available for download, which are text files of random integers between 0 and 255. Also available are c++ sources for all the programs we use to test randomness. The test results for the numbers NetRand produced are also here, you can see how they stack up against the UNIX random number generator by the c function srand(). There are 5 ways we have been checking our numbers so far:

1. Entropy -
The number of bytes it takes to represent your data when your information is optimally compressed. Compression removes patterns from your data. from your data. If you compress a 10 byte (80 bits) string into 53 bits, you are using 5.3 bits to represent each character instead of 8. For truly random numbers, it should take 8 bits to represent a byte. If you can represent a string of bytes with less, then you have some patterns in your numbers, which indicate non-randomness. Look into hamming codes for more information.
2. Chi Square Method
This is the most common test for randomness. Chi-square distribution calculated as an absolute number, then a percentage of times true random data would exceed that value. This percentage should be around 50 percent. This test is extremely sensitive to shifts in distribution, and if your numbers are not random, you can expect to see percentages that fall under 1% or higher than 99%.
3. Arithmetic Mean
This is the first test one would think of, the most obvious of them all. If you have 8 random bits, you get signed magnitude numbers between 0 and 255. The ideal average around 127.5. If the mean departs too far from this value, the results are consistently too high or too low.
4. Monte Carlo Value if PI
each successive 4 bytes are used as 16 bit x and y coordinates in a square. If the distance to the point is less than the radius of a circle inscribed in the square, it's a hit. The percentage of hits can be used to calculate Pi. The closer your value is to pi, the more random your data is.
5. Serial Correlation Coefficient (SCC)
Measures the extent towhich each byte of the data relies on the previous byte. SCC's of random sequences should approach zero, while very predictable data such as uncompressed bitmaps yield SCC's close to 1.



Send mail to netrand
Return to main page