COSC 472/572 Big Data
Assignment 4 - Create an Inverted Index
Due: Monday April 13, 2015
This is a paired assignment. You must have one (and only one partner). You must submit a paper with both partners names on it by Monday March 9.
Inverted indexes are one of the main data structures employed by a
search engine. Basically, an inverted index has as a key value a
specific word (e.g. "cat"). It will then contain every files (html
document) that contains the word "cat". A more elaborate version of an
inverted index would contain a file name and word location. So for
example, we might have something like the following:
cat-
a.html , carnivores.html , jungle_animals.html , ............
The more elaborate model would look like:
cat-
a.html (3,7,103) , carnivores.html (10,23,44,56,1234) , etc
In the second instance, the list indicates the position of the word "cat" within the file.
Your assignment is to create an inverted index. This will be done in several steps:
1) Use heritrix to crawl the web. You must collect at least a gigabyte of data.
This will take a long time (10-20 hours) . Instructions on using
heritrix , including an excellent video done by graduate student Cade
Sperlich, can be found below:
Wikipedia - Inverted Indexes
Heritrix home page
Cade Sperlich's excellent Heritrix web crawl video
2) Heritrix creates .warc files . ".warc" stands for "web archive
file". But we will need text files as output. Again, Cade has provided
excellent documentation on this conversion, as well as a Python script:
Cade's Python script for converting .warc files to text files
Some notes from Cade
3) Now that you have text files, create a Hadoop/MapReduce job to
create the inverted index (again, refer to Cade's notes). This will
take some time.
4) Import your results into sqlite
5) Create a simple search engine! Write your program in either Python
or Java and interface to your sqlite database. The search engine will
have two input options:
- input a single word . We define word as and contiguous collection of non-blank roman letters like cat, dog, eastern, michigan. In this case, the program should return all documents containing the word
- input a pound sign (#) followed by a word, for
example #cat , #dog, #eastern . In this case, the program should return
with the total number of documents containing the word.
References for calling sqlite from Python or java can be found here:
Java and SQLite
Python and SQLite
What to hand in:
A paper copy of
your mapper and reducer programs. Make sure this code is well
documented and has both members names on each paper. You will also need
to demonstrate your program to me.