Mining the Web

References:

Books:

"Mining the Web: Analysis of Hypertext and Semi Structured Data", Soumen Chakrabarti, Morgan Kaufmann Publishers, 2003
"Modern Information Retrieval", by Ricardo Baeza-Yates and Bertier Ribeiro-Neto, Addison Wessley , 1999
"Managing Gigabytes: Compressing and Indexing Documents and Images", Witten, Moffat, Bell, Morgan Kaufman Publishers 1999

Useful Links:

WebMasterWorld.com - Excellent source for online search-engine related topics
IP Lists - IP numbers for major spiders. Good references here!
SearchTool.com - More good references
Robotstxt.org - Standards for writing well behave spiders and robots.
www.paulgraham.com - Paul Graham is an expert on spam and spam filters. Very good information here!
Spider/Crawler code links

Sources of Text Data:

Many of the projects for this class will require a large source of text data. An easy source is a newsgroup archives (this will be demonstrated in class). Three excellent sources can be found here. Clicking on the link "Classified Web Pages" will direct you to a collection of web pages from 4 different universities. "20 Newsgroups DataSet" is a collection of newsgroup archives, and "7secors DataSet" is a collection of web pages for people looking for a job.

Would you like to write a Bayesian spam filter ? If so, you need a good supply of example spam. Paul Graham's web page (we will read some of Graham's papers on the subject) contains good links to spam archives.

The standard data source for text is the Reuters Corpus (RCV1) available here . Please note that this collection is very large (several gigabytes) and probably not suitable for our projects.

FLASH!! The google ngram dataset has received a lot of attention lately. Google "google ngram dataset" !!!

Sources for Stopwords:

This is very easy. Do a Google search on "stopwords". Hier ist eine Liste auf Deutsch.

Porter's Stemming Algorithm:

Get a description of the algorithm and source code in the programming language of your choice directly from Porter himself here

A Very Useful Tool (Download now!)

Sphinx - A Simple Web Crawler by Rob Miller (MIT)