Class Crawler
java.lang.Object
|
+----java.lang.Thread
|
+----Crawler
- public class Crawler
- extends Thread
Netrand Project
Software Engineering - CS536
University of Wisconsin - Milwaukee
Authors:
- Spring 1998 - Francis William Kasper
File: Crawler.java
Note:
This file was originally part of a Web Crawler program written by
Tim Macinta in 1997 that gathered links form the internet and formed
a web search database. The files containing the logic for crawling
across the internet were taken from this program and slightly modified
for the purpose of the NetRand project.
Calling the Crawler's start() method will cause the Crawler to
index all of the sites in its queue and then replace the main
index with the updated index when it completes. The Crawler's
queue should be filled with the starting URLs before calling
start().
-
eng_prefs
-
-
exit_when_done
-
-
indexer
-
-
q
-
-
urls_done
-
-
working_dir
-
-
Crawler(File, EnginePrefs)
- "working_dir" should be a directory that only this
Crawler and a given Indexer will be
accessing.
-
addURL(URL)
- Takes "url_to_queue" and adds it to this Crawler's queue of URLs.
-
main(File, EnginePrefs, boolean)
-
-
main(String[])
- This is the method that is called when this class is invoked from
the command line.
-
run()
- This is where the actual crawling occurs.
-
simplify(URL)
- Takes "url" and removes all references to "/./" and "/../" .
working_dir
File working_dir
indexer
Indexer indexer
q
FIFOQueue q
urls_done
Hashtable urls_done
eng_prefs
EnginePrefs eng_prefs
exit_when_done
boolean exit_when_done
Crawler
public Crawler(File working_dir,
EnginePrefs eng_prefs)
- "working_dir" should be a directory that only this
Crawler and a given Indexer will be
accessing. This means that if several Crawlers are running
simultaneously, they should all be given different "working_dir"
directories. Also, no other threads should write to this
directory (except for the selected Indexer).
addURL
public void addURL(URL url_to_queue)
- Takes "url_to_queue" and adds it to this Crawler's queue of URLs.
This method should be used to add all of the desired starting URLs to
the queue before the Crawler is started. If the URL has already
been processed or if it is an unallowed URL it is not added.
simplify
URL simplify(URL url)
- Takes "url" and removes all references to "/./" and "/../" . This
can be used to help eliminate looping. Also removes all anchors
(i.e., everything after and including a '#').
run
public void run()
- This is where the actual crawling occurs.
- Overrides:
- run in class Thread
main
public static void main(String arg[])
- This is the method that is called when this class is invoked from
the command line. calling this method will cause a Crawler to be
created and started with the starting URLs being listed in a file
specified by the first argument (arg[0]). The file listing the URLs
should contain only the URLs with each URL on a line by itself. Blank
lines are allowed and lines beginning with "#" are considered comments
and are ignored.
main
public static void main(File file,
EnginePrefs prefs,
boolean exit)