COSC 311 Programming Project #4 Random Writer* Distributed 3/26/2013 Due: 4/9/2013 You will be creating text that is generated by a simple probabilistic model. Input: a source file of plain text words, a user chosen seed, k, 1 <= k <= 3 k = 1 ==> unigram k = 2 ==> bigram k = 3 ==> trigram Output: 25 words of generated text. (You can generate more text, if you want) Algorithm: Pick a random k-gram from the text -- call the k-gram the 'seed' Repeat 25 times { Do while (don't have a new histogram) { Make a histogram of every single word (unigram) that follows the seed If no single word follows the seed, then histogram is empty and pick a new random seed } Randomly pick a word, called 'word', from the histogram, using the probabilities calculated in previous step Output 'word' Remove the first word from the 'seed', then append 'word' to 'seed' } A 'word' is any sequence of alphabetic characters and single quote: [a..zA..Z'-] A 'word' is delimited in the text by any other character Sample run (pen and paper) Example source file: Peter Piper picked a peck of pickled peppers, a peck of pickled peppers Peter Piper picked, if Peter Piper picked a peck of pickled peppers, where's the peck of pickled peppers Peter Piper picked? Suppose k = 2 (user choice) First pass in loop: The random bigram is seed = "pickeled peppers" Histogram: ["a", "peter", "Where's"], with probabilities: [.25, .50, .25] Randomly pick a word from Histogram (using probabilities [.25, .5, .25]) ==> "Peter" Output "Peter" seed = "peppers Peter" Second pass in loop: Since "peppers Peter" does not appear in source file, randomly pick another seed from the source file: "a peck". Histogram: ["of"], with probability: [1.0] Randomly pick a word from Histogram (using probabilities) ==> of Output "of" seed = "Peter of" Runs: Find two different source files (English) from different genres or eras (keep it clean). E.g., technical and mystery, 1800s and 2000s. Go to gutenberg.org for texts that can be downloaded in plain text. (Remove boilerplate at beginning and end) Generate text for each source three times (six generated texts all together). k = 1, k = 2, k = 3 Turn in: Source code Six sample outputs (the source files and k should be identified) Grade based on: Meets specs Code quality Meets coding standards * modified from http://nifty.stanford.edu/2003/randomwriter/