Class EnginePrefs

java.lang.Object
   |
   +----EnginePrefs

public class EnginePrefs
extends Object

Netrand Project

Software Engineering - CS536

University of Wisconsin - Milwaukee

Authors:


File: EnginePrefs.java
Note: This file was originally part of a Web Crawler program written by Tim Macinta in 1997 that gathered links form the internet and formed a web search database. The files containing the logic for crawling across the internet were taken from this program and slightly modified for the purpose of the NetRand project.
Encapsulates the preferences for the crawler and the search engine.


Variable Index

 o email_address
 o excluded
 o filter_cgi
 o hosts
 o included
 o main_dir
 o maxCacheFiles
 o pause_time
The time to pause between URL fetches (in seconds).
 o port
 o rules
 o url_list
 o user_agent
 o working_dir

Constructor Index

 o EnginePrefs()

Method Index

 o getMainDir()
 o getRulesFile()
The rules file contains rules which determine what URLs are allowed and what URLs whould be excluded.
 o getStartingFile()
 o getWorkingDir()
Returns the working directory for use by a crawler.
 o pauseBetweenURLs()
Pauses for the amount of time that has been specified for pausing between URL fetches.
 o readRobotsDotText(String, int)
Reads the "robots.txt" file on the given host and uses the results to determine what files on "host" are crawlable.
 o readRulesFile()
Causes the inclusion/exclusion rules to be read.
 o URLAllowed(URL)
Returns true if "url" is allowed to be indexed and false otherwise.
 o URLNotIndexable(URL)
Returns true if this URL represents a file type that is not indexable.

Variables

 o pause_time
 public int pause_time
The time to pause between URL fetches (in seconds).

 o maxCacheFiles
 public int maxCacheFiles
 o main_dir
 File main_dir
 o rules
 File rules
 o url_list
 File url_list
 o working_dir
 File working_dir
 o excluded
 Vector excluded
 o included
 Vector included
 o hosts
 Hashtable hosts
 o user_agent
 String user_agent
 o email_address
 String email_address
 o filter_cgi
 boolean filter_cgi
 o port
 public static int port

Constructors

 o EnginePrefs
 public EnginePrefs()

Methods

 o URLAllowed
 public boolean URLAllowed(URL url)
Returns true if "url" is allowed to be indexed and false otherwise.

 o pauseBetweenURLs
 public void pauseBetweenURLs()
Pauses for the amount of time that has been specified for pausing between URL fetches.

 o getMainDir
 public File getMainDir()
 o getWorkingDir
 public File getWorkingDir()
Returns the working directory for use by a crawler. If more than one crawler is running at the same time they should be given different working directories.

 o getStartingFile
 public File getStartingFile()
 o getRulesFile
 public File getRulesFile()
The rules file contains rules which determine what URLs are allowed and what URLs whould be excluded. A line that is in the form:
  include http://gsd.mit.edu/
  
will cause all URLs that start with "http://gsd.mit.edu/" to be included. Similarly, to exclude URLs, use the keyword "exclude" instead of "include". Blank lines and lines starting with "#" are ignored.

When an URL is checked against the inclusion/exclusion rules the exclusion rules are checked first and if the URL matches an exclusion rule it is not included. If an URL is not covered by either rule it is not included, unless it is a "file://" URL in which case it is included by default.

 o readRulesFile
 public void readRulesFile() throws IOException
Causes the inclusion/exclusion rules to be read. This method should be called if the rules file is changed.

 o readRobotsDotText
 public void readRobotsDotText(String host,
                               int port)
Reads the "robots.txt" file on the given host and uses the results to determine what files on "host" are crawlable.

 o URLNotIndexable
 public boolean URLNotIndexable(URL url)
Returns true if this URL represents a file type that is not indexable.