Simple Web Crawling
Telnet:
You can telnet to port 80 and issue the appropriate HTTP request. For
example, the following will retrieve the web page where the web page
URL is specified as a command line argument:
telnet www.it.uu.se 80 (wait a few
seconds for the server to respond)
GET / HTTP/1.0\n\n
HTTP is a bit picky about capital letters, it is easier to always use
them. It is important to note that all HTTP request alwats are
terminated with 2 newline characters. After typing the preceeding, an
HTTP response header and then the entire HTML document will be
downloaded. The first forward slash indicates the file to download, in
this case it is the root document (typically index.html). To download a
file called sil.html located in the root directory , one would type:
GET /sil.html HTTP/1.0\n\n
A Simple Perl Script:
#!
/usr/bin/perl -w
use strict;
use IO::Socket;
my $PORT=80;
my
$URL;
my $input;
$URL=$ARGV[0]; !Unlike c,
this does not return the program name
my
$sock_obj=IO::Socket::INET->new(PeerAddr=>$URL,
PeerPort=>$PORT,
Proto=>'tcp')
or die "Could not create socket\n";
print
$sock_obj "GET / HTTP/1.0\n\n";
while($input=<$sock_obj>)
{
# A meager attempt to parse urls follows
#
if ($input =~/<a href=([^>]+)>/i)
#
{ print "$1\n";}
print "$input"; ! Comment
this line and uncomment preceeding three lines for a simple crawler
}
$sock_obj->close;
exit;
Sources on the Web
Links to source code for crawlers may be found at http://www.searchtools.com/robots/robot-code.html