Class Crawler

java.lang.Object
   |
   +----java.lang.Thread
           |
           +----Crawler

public class Crawler
extends Thread

Netrand Project

Software Engineering - CS536

University of Wisconsin - Milwaukee

Authors:


File: Crawler.java
Note: This file was originally part of a Web Crawler program written by Tim Macinta in 1997 that gathered links form the internet and formed a web search database. The files containing the logic for crawling across the internet were taken from this program and slightly modified for the purpose of the NetRand project.
Calling the Crawler's start() method will cause the Crawler to index all of the sites in its queue and then replace the main index with the updated index when it completes. The Crawler's queue should be filled with the starting URLs before calling start().


Variable Index

 o eng_prefs
 o exit_when_done
 o indexer
 o q
 o urls_done
 o working_dir

Constructor Index

 o Crawler(File, EnginePrefs)
"working_dir" should be a directory that only this Crawler and a given Indexer will be accessing.

Method Index

 o addURL(URL)
Takes "url_to_queue" and adds it to this Crawler's queue of URLs.
 o main(File, EnginePrefs, boolean)
 o main(String[])
This is the method that is called when this class is invoked from the command line.
 o run()
This is where the actual crawling occurs.
 o simplify(URL)
Takes "url" and removes all references to "/./" and "/../" .

Variables

 o working_dir
 File working_dir
 o indexer
 Indexer indexer
 o q
 FIFOQueue q
 o urls_done
 Hashtable urls_done
 o eng_prefs
 EnginePrefs eng_prefs
 o exit_when_done
 boolean exit_when_done

Constructors

 o Crawler
 public Crawler(File working_dir,
                EnginePrefs eng_prefs)
"working_dir" should be a directory that only this Crawler and a given Indexer will be accessing. This means that if several Crawlers are running simultaneously, they should all be given different "working_dir" directories. Also, no other threads should write to this directory (except for the selected Indexer).

Methods

 o addURL
 public void addURL(URL url_to_queue)
Takes "url_to_queue" and adds it to this Crawler's queue of URLs. This method should be used to add all of the desired starting URLs to the queue before the Crawler is started. If the URL has already been processed or if it is an unallowed URL it is not added.

 o simplify
 URL simplify(URL url)
Takes "url" and removes all references to "/./" and "/../" . This can be used to help eliminate looping. Also removes all anchors (i.e., everything after and including a '#').

 o run
 public void run()
This is where the actual crawling occurs.

Overrides:
run in class Thread
 o main
 public static void main(String arg[])
This is the method that is called when this class is invoked from the command line. calling this method will cause a Crawler to be created and started with the starting URLs being listed in a file specified by the first argument (arg[0]). The file listing the URLs should contain only the URLs with each URL on a line by itself. Blank lines are allowed and lines beginning with "#" are considered comments and are ignored.

 o main
 public static void main(File file,
                         EnginePrefs prefs,
                         boolean exit)