Programming Project #3                       COSC 311
Distributed 2/14/2013                        Due: 3/14/2013


Are Binary Search Trees O(log n) for the delete operation?

This project will require you to demonstrate that binary search trees are
O(log n) for deletion.  

This will involve collecting experimental (timing) data for trees of
various sizes n (at least 5 different values for n), graphing the results
on graph paper, and attempting to fit a curve c * log(n).

The graphing and curve-fitting may be done in a separate step, using 
pen-and-paper or Excel or other tools.

There is one major wrench thrown in the works. That is, you will need
to demonstrate BSTs are O(log n) for deletion on both (1) array-based, *in-memory*
BSTs and on (2) array-based, *on-disk* BSTs.  The experimental data for each 
kind of implementation must be kept separate. The on-disk, external BST
*must* be implemented using random access files in Java!

Because internal (in-memory) implementations are so much faster than external
(hard disk) implementations, the choices of n will be different for each
implementation. In each case, you should have at least 5 different values for
n and they should cover a reasonable range (producing significant differences
in time).

To simplify the data collection, a small perturbation is allowed: after you
generate a random BST of size n, then delete 0.01 * n nodes (rather than just
1 node), then divide the total delete time by 0.01*n in order to get 
time/single-node.  So, if n = 5*10^4, then you will delete 5*10^2 nodes. The
average time to delete a single node from the tree is the total time / 500.

Caveats:
(1) Do NOT time anything except for actual deletions. If a value targeted for
deletion is not in the tree, then you must not count the time required to find
out that the value is not in the tree.
(2) Do target values *randomly*.  Do not delete nodes in the same order they
were inserted or in any other ordered fashion.
(3) The BSTs you initially build must be random BSTs. That is the data generated
is randomly obtained. In order to reduce the probabilities of targeting a value
for deletion that has not been inserted, then choose values for insertion from
a fairly small subset of possibles and allow for multiple nodes with the same
value in the tree.  This means, a given value will appear multiple times in
a single BST.
(4) The type of data should not have an effect on the timing. This means,
keep the data as simple as possible so that tests for <= and = are as fast
as possible. This means use type int.
(5) A BST stored on disk may not (for this experiment to be valid) be stored
as a stream file. It must be stored in a random access file.


Here is the basic experimental setup for each implementation:

Repeat for several different values of n
  (1) Generate random BST size n.
  (2) Repeat k times:
         generate random value
         delete value
  (2*) if the value to delete is not in the tree, do not count the time to search
       for it in the tree.

  (3) Calculate average time to delete single node for size n.
  
  
Turn in:
(1) Hard copy of internal BST implementation
(2) Hard copy of external BST implementation
(3) Screen shot of one run of internal delete
(4) Screen schot of one run of external delete
(5) Tabulated data from each implementation
(6) Graph of data with fitted curve
(7) One to two paragraph conclusion: do both implementations have O(log n) behavior?
Give a proposal for any oddities in the data.

Grade based on:
50% for internal BST
50% for external BST

Each of those are graded based on:
  meets specs (i.e., compiles and works properly
  satisfying a code walk-thru if requested
  elegance of programming solution
  readability (i.e., documentation and "look") of programming solution
  written conclusion