*********************************************************************

NOTES ON MY WarcToPlainText.py SCRIPT:
I've used a third party python module called BeautifulSoup to help extract the text 
from the html files. 

BeautifulSoup uses Python's built-in html parser but the built-in parser for python 
versions before 2.7.3 aren't very good and don't really work with BeautifulSoup so I've 
also installed a third party html parser that BeatuifulSoup can work with. There are two 
third party html parsers that beautiful soup can work with, one called "html5lib" and 
one called "lxml" (which is an xml and html parser). "html5lib" is easier to install 
but is extremely slow (I tested it). "lxml" has external C dependencies but is much 
faster so that's the one I've used with BeautifulSoup.

To install the external modules and dependencies necessary to run my script 
WarcToPlainText.py you'll need to type the following commands in the Cloudera VM terminal:

//install hanzo warctools
sudo pip install warctools

//install beautiful soup
sudo pip install bs4

//install dependencies for lxml parser
sudo yum install libxml2-devel.x86_64 libxslt-devel.x86_64 python-devel.x86_64

//install lxml parser
sudo pip install lxml

*******************************************************************

INSERTING THE FILES INTO HDFS:
My script converts each valid warc record into a text file and sets the name of each 
file to the warc ID of the record. Each file name looks something like the following:

<urn:uuid:ffffcd89-97b3-49ef-aa68-225ee3cc3b8a>

The HDFS doesn't like the characters: '<', '>', or ':' so before I was able to insert 
my files I had to run some batch file name changing commands to convert the file names 
to the following form:

urn-uuid-ffffcd89-97b3-49ef-aa68-225ee3cc3b8a

I could have changed my conversion script to output file names in this form but I didn't 
want to rerun the process because it took a long time to extract the 45 thousand valid 
records in the warc file.

I ran the following commands inside the folder with all of my converted text files to 
batch change their names:

rename '<' '' *
rename '>' '' *
rename ':' '-' *
rename ':' '-' *

Certainly not the slickest way to do it but it worked.

**********************************************************

RUNNING THE MAP-REDUCE JOB:
I copied the files into the HDFS and ran the map-reduce job just like you did in the word 
count tutorial on your 
website. Running the job and copying the results back to the regular file system was also 
just the same as in the tutorial.

My map-reduce job outputs a big file that is tab separated and has the following form for 
each line of the file:

word    uri    count

NOTE!!!: The map-reduce job took 19.5 hrs.

**********************************************************

IMPORTING THE RESULTS INTO SQLITE3
The Cloudera VM comes with sqlite3! All you have to do is create the database, create the 
table, and use the import command. Here's the sequence of commands i used:

//from the command line, to create the database
sqlite3 InvertedIndex.db

//from the sqlite3 command line interface, to create the table
CREATE TABLE InvertedIndex(word TEXT, uri TEXT, id INTEGER, PRIMARY KEY(word, uri));

//from the sqlite3 command line interface, to import the map-reduce results
.mode csv    //change to csv mode
.separator    //change separator from ',' to '\t'
.import HeritrixOutput.csv InvertedIndex //import the map-reduce results into the table

**********************************************************