References:
Books:
"Mining
the Web: Analysis of Hypertext and Semi Structured Data", Soumen
Chakrabarti, Morgan Kaufmann Publishers, 2003
"Modern
Information Retrieval", by Ricardo Baeza-Yates and Bertier
Ribeiro-Neto, Addison Wessley , 1999
"Managing
Gigabytes: Compressing and Indexing Documents and Images", Witten,
Moffat, Bell, Morgan Kaufman Publishers 1999
Useful Links:
WebMasterWorld.com -
Excellent source for online search-engine related topics
IP Lists - IP numbers for major
spiders. Good references here!
SearchTool.com -
More good references
Robotstxt.org - Standards for
writing well behave spiders and robots.
www.paulgraham.com - Paul
Graham is an expert on spam and spam filters. Very good information
here!
Spider/Crawler
code links
Sources of Text Data:
Many of the projects for this class will require a large
source of text data. An easy source is a newsgroup archives (this will
be demonstrated in class). Three excellent sources can be found here.
Clicking on the link "Classified Web Pages" will direct you to a
collection of web pages from 4 different universities. "20 Newsgroups
DataSet" is a collection of newsgroup archives, and "7secors DataSet"
is a collection of web pages for people looking for a job.
Would you like to write a Bayesian spam filter ? If so, you need a good
supply of example spam. Paul Graham's web page (we will read some of
Graham's papers on the subject) contains good links to spam archives.
The standard data source for text is the Reuters Corpus (RCV1)
available here
. Please note that this collection is very large (several gigabytes)
and probably not suitable for our projects.
FLASH!! The google ngram dataset has received a lot of attention lately. Google "google ngram dataset" !!!
Sources for Stopwords:
This is very easy. Do a Google search on "stopwords". Hier ist eine
Liste auf Deutsch.
Porter's Stemming Algorithm:
Get a description of the algorithm and source code in the programming
language of your choice directly from Porter himself here
A Very Useful Tool (Download now!)
Sphinx -
A Simple Web Crawler by Rob Miller (MIT)