********************************************************************* NOTES ON MY WarcToPlainText.py SCRIPT: I've used a third party python module called BeautifulSoup to help extract the text from the html files. BeautifulSoup uses Python's built-in html parser but the built-in parser for python versions before 2.7.3 aren't very good and don't really work with BeautifulSoup so I've also installed a third party html parser that BeatuifulSoup can work with. There are two third party html parsers that beautiful soup can work with, one called "html5lib" and one called "lxml" (which is an xml and html parser). "html5lib" is easier to install but is extremely slow (I tested it). "lxml" has external C dependencies but is much faster so that's the one I've used with BeautifulSoup. To install the external modules and dependencies necessary to run my script WarcToPlainText.py you'll need to type the following commands in the Cloudera VM terminal: //install hanzo warctools sudo pip install warctools //install beautiful soup sudo pip install bs4 //install dependencies for lxml parser sudo yum install libxml2-devel.x86_64 libxslt-devel.x86_64 python-devel.x86_64 //install lxml parser sudo pip install lxml ******************************************************************* INSERTING THE FILES INTO HDFS: My script converts each valid warc record into a text file and sets the name of each file to the warc ID of the record. Each file name looks something like the following: The HDFS doesn't like the characters: '<', '>', or ':' so before I was able to insert my files I had to run some batch file name changing commands to convert the file names to the following form: urn-uuid-ffffcd89-97b3-49ef-aa68-225ee3cc3b8a I could have changed my conversion script to output file names in this form but I didn't want to rerun the process because it took a long time to extract the 45 thousand valid records in the warc file. I ran the following commands inside the folder with all of my converted text files to batch change their names: rename '<' '' * rename '>' '' * rename ':' '-' * rename ':' '-' * Certainly not the slickest way to do it but it worked. ********************************************************** RUNNING THE MAP-REDUCE JOB: I copied the files into the HDFS and ran the map-reduce job just like you did in the word count tutorial on your website. Running the job and copying the results back to the regular file system was also just the same as in the tutorial. My map-reduce job outputs a big file that is tab separated and has the following form for each line of the file: word uri count NOTE!!!: The map-reduce job took 19.5 hrs. ********************************************************** IMPORTING THE RESULTS INTO SQLITE3 The Cloudera VM comes with sqlite3! All you have to do is create the database, create the table, and use the import command. Here's the sequence of commands i used: //from the command line, to create the database sqlite3 InvertedIndex.db //from the sqlite3 command line interface, to create the table CREATE TABLE InvertedIndex(word TEXT, uri TEXT, id INTEGER, PRIMARY KEY(word, uri)); //from the sqlite3 command line interface, to import the map-reduce results .mode csv //change to csv mode .separator //change separator from ',' to '\t' .import HeritrixOutput.csv InvertedIndex //import the map-reduce results into the table **********************************************************