Probability Distributions for Beginning* Computer Science Classes

Draft 1.0

User guide to built-in random number generators
Your basic function for generating random numbers that is "built-in" to Java, C++, C generates a sequence of pseudo-random numbers. The system supplied random number generator (abbreviated random() starting now) should be regarded with a great deal of suspicion. For one thing, the sequence cycles -- it repeats when random() is called often enough. For another thing, the numbers themselves do not usually show the amount of randomness one really needs.

A typical implementation of random() uses this recurrence relation:
ij+1 = a * ij + c
That is, each number in the sequence is computed from a simple arithmetic expression using the previous number in the sequence. a and c are constants that are local to random().

The sequence is initialized with a seed (i0). The computed ij+1 is both returned to the caller and kept as a local static variable that is used on the subsequent call to random().

It is a very good idea when debugging to initialize the sequence with some fixed seed. That way, the random number sequences generated will be the same over all the program runs -- much easier to debug that way! After the program is debugged, you'll want to start the sequence with different seeds. The most common way to do that is to use some number related to the system clock (often, just a call to the system clock will do).

random() produces a uniform distribution
A uniform distribution has a flat histogram: all numbers generated are equally likely. One quick (but crude) test of random()'s quality is to build a histogram by calling it a gazillion times and incrementing the count for each number returned from random().

Scaling and translating
Typical beginning computer science problems calling for random numbers involve rolling a die or pulling a card out of a deck. In the case of the die, the random number you want will range between 1 and 6 to correspond to the pips on the different die faces. The random number generator you're using may supply ints between 0 and RAND_MAX (the maximum int value generated); or it may supply a double between 0 and 1; or it may supply an int between 0 and a user supplied k.

In any case, you'll almost certainly need to "massage" the number supplied by random() so that it fits in your desired range. The way to do this is to scale and translate. Suppose we want to work with a die (numbers ranging between 1 and 6). To scale, you divide the random number obtained by the range of values over which random() runs. Suppose random() returns n, an int number that range between 0 and RAND_MAX. Scale n by dividing by RAND_MAX + 1 and multiplying by 6. Watch out for integer division! Scaling with 6/(RAND_MAX+1) changes the range of [0 .. RAND_MAX] to [0 .. 5 ]. Now translate by adding 1 to get your desired range of [1..6] .

Gaussian distribution
random() produces a uniform distribution. Very often, you'll be wanting to generate random data that has a Gaussian (normal, bell-shaped curve) distribution. How do you get from uniform to Gaussian? Details can be found in Numerical Recipes (see References). But basically, you'll call a different function (called Gaussian() here) that uses random(). I'm thinking I got this code from Numerical Recipes, but I'm not sure now (and watch out for errors! I haven't checked it in years). This is C code for a function that returns Gaussian distributed random numbers and an example driver (main()). The random number generator supplied with C is called rand().

#include < math.h >

double Gaussian() {

static int iset = 0;
static float gset;
double fac, rsq, v1, v2;

if (iset == 0) {
  do {
    v1 = 2.0 * rand()/ (double) RAND_MAX - 1.0;
    v2 = 2.0 * rand()/ (double) RAND_MAX - 1.0;
    rsq = (v1 * v1) + (v2 * v2);
    } while (rsq >= 1.0 || rsq == 0.0);
  fac = sqrt ( -2.0 * log (rsq) / rsq);
  gset = v1 * fac;
  iset = 1;
  return v2 * fac;
  }
else {
  iset = 0;
  return gset;
  }

}

void main () {
 int	i;
 double result;
 printf("Calling Gaussian 10 times\n");

 for (i=0; i< 10; i++) {
   result = Gaussian();
   printf(">\t%f\n", result);
   }

 printf("Execution completed\n");
 }

Like "massaging" random() to get the right range, you'll need to massage the results of Gaussian to get the right mean and standard deviation. Gaussian() returns a sequence with mean 0.0, and standard deviation 1.0.

Poisson distribution
Most people have a good intuitive understanding of the Gaussian distribution, as well as for uniform distribution. However, for queueing problems you need the Poisson distribution. The Poisson distribution does the job for modelling random arrivals (in time) at the queue, when the arrivals are independent of each other.

Each number in a sequence of Poisson distributed numbers says how many arrivals there have been during the last time interval.
Suppose you obtain the Poisson distributed sequence: 0, 0, 1, 0, 3, 2, 0, 0, ...
The meaning of this is that during time interval t0, there were 0 arrivals. During time interval t1, also 0 arrivals. During t2 there was 1 arrival. During t3, 0. During t4, 3. Et cetera.

#include < stdlib.h > 
#include < math.h >
 
int Poisson ( double ev ) {
      int         n = 0;      // counter of iterations 
      double      em;         // e^(-ev), where v is the expected value
      double      x;          // pseudorandom number
 
      em = exp (-ev);
      x = rand() / (double) RAND_MAX;     // check your C compiler docs
                                          // for the correct constant name
      while (x > em) { 
            n++;
            x *=  rand() / (double) RAND_MAX;
            }
      return n;
      } 

main () {
  int i;
  for (i = 0; i< 1000; i++) {
    printf("new Poisson value: %d\n", Poisson(.133333) );
    }
  }

In order to use Poisson(), you need to know ev, the "load factor". This is the average number of arrivals per unit of time. Suppose, for example, the unit of time we are simulating is seconds. From empirical evidence, we also know that there are, on the average, 75 arrivals per hour, that is 75 arrivals in 60 seconds, that is 1.25 arrivals per second. ev, then, is 1.25.

The way this Poisson() is implemented, you can have interleaved calls to it with different values of ev.

Note also, that this Poisson handles the "massaging" differently. The only parameter necessary to get the proper Poisson distribution is the load factor, ev, and that is supplied as a parameter to Poisson(). On the other hand, random(), giving the uniform distribution, supplies numbers that you have to scale and translate to fit into a desired range; Gaussian() also must be scaled and translated.



* "beginning" means freshman or sophomore level, with no math background worth mentioning.

Written up by Susan Haynes

with my stunned admiration of Mssr Press, Flannery, Teukolsky and Vetterling.

Please send corrections, comments to s h a y n e s @ e m i c h . e d u
Use at your own risk. Please cite properly.