Crawlers

Simple Web Crawling

Telnet:

You can telnet to port 80 and issue the appropriate HTTP request. For example, the following will retrieve the web page where the web page URL is specified as a command line argument:

    telnet www.it.uu.se 80    (wait a few seconds for the server to respond)
    GET / HTTP/1.0\n\n

HTTP is a bit picky about capital letters, it is easier to always use them. It is important to note that all HTTP request alwats are terminated with 2 newline characters. After typing the preceeding, an HTTP response header and then the entire HTML document will be downloaded. The first forward slash indicates the file to download, in this case it is the root document (typically index.html). To download a file called sil.html located in the root directory , one would type:
    GET /sil.html HTTP/1.0\n\n

A Simple Perl Script:

#! /usr/bin/perl -w

use strict;
use IO::Socket;

my $PORT=80;
my $URL;
my $input;

$URL=$ARGV[0];        !Unlike c, this does not return the program name

my $sock_obj=IO::Socket::INET->new(PeerAddr=>$URL,
                                   PeerPort=>$PORT,
                                   Proto=>'tcp')
    or die "Could not create socket\n";

print $sock_obj "GET / HTTP/1.0\n\n";
while($input=<$sock_obj>)
{

#   A meager attempt to parse urls follows

# if ($input =~/<a href=([^>]+)>/i)
# { print "$1\n";}

    print "$input";     ! Comment this line and uncomment preceeding three lines for a simple crawler

}
$sock_obj->close;
exit;

Sources on the Web

Links to source code for crawlers may be found at http://www.searchtools.com/robots/robot-code.html