COSC 311 Programming Project #4

Random Writer*

Distributed 3/26/2013      Due: 4/9/2013

You will be creating text that is generated by a simple probabilistic model.

Input: a source file of plain text words,
	a user chosen seed, k, 1 <= k <= 3
		k = 1 ==> unigram
		k = 2 ==> bigram
		k = 3 ==> trigram
	
Output: 25 words of generated text. (You can generate more text, if you want)


Algorithm:
  Pick a random k-gram from the text -- call the k-gram the 'seed'
  Repeat 25 times {
  	 Do while (don't have a new histogram) {
  	 	Make a histogram of every single word (unigram) that follows the seed
  	 	If no single word follows the seed, then histogram is empty and pick a new random seed 
  	 	}
  	 Randomly pick a word, called 'word', from the histogram, using the probabilities
  	 	calculated in previous step
  	 Output 'word'
  	 Remove the first word from the 'seed', then append 'word' to 'seed'
  	 }
  	 
A 'word' is any sequence of alphabetic characters and single quote: [a..zA..Z'-]
A 'word' is delimited in the text by any other character

Sample run (pen and paper)
Example source file:
Peter Piper picked a peck of pickled peppers, a peck of pickled peppers Peter Piper picked, 
if Peter Piper picked a peck of pickled peppers, where's the peck of pickled peppers Peter Piper picked? 	 
  	 
Suppose k = 2 (user choice)

First pass in loop:
	The random bigram is seed = "pickeled peppers"

	Histogram: ["a", "peter", "Where's"], with probabilities: [.25, .50, .25]

	Randomly pick a word from Histogram (using probabilities [.25, .5, .25])
  ==> "Peter"

	Output "Peter"

	seed = "peppers Peter"

Second pass in loop:
  	Since "peppers Peter" does not appear in source file, randomly pick another
  	seed from the source file:  "a peck".
   
  	Histogram: ["of"], with probability: [1.0]
  	Randomly pick a word from Histogram (using probabilities) ==> of
  
  	Output "of"
  
  	seed = "Peter of"
  	
  	
Runs:
  Find two different source files (English) from different genres or eras 
  (keep it clean). E.g., technical and mystery, 1800s and 2000s.
  Go to gutenberg.org for texts that can be downloaded in plain text.
  	(Remove boilerplate at beginning and end)
  
  Generate text for each source three times (six generated texts all together).
  k = 1, k = 2, k = 3

Turn in:
	Source code
	Six sample outputs (the source files and k should be identified)
	
Grade based on:
	Meets specs
	Code quality
	Meets coding standards
	
* modified from http://nifty.stanford.edu/2003/randomwriter/