Class EnginePrefs
java.lang.Object
|
+----EnginePrefs
- public class EnginePrefs
- extends Object
Netrand Project
Software Engineering - CS536
University of Wisconsin - Milwaukee
Authors:
- Spring 1998 - Francis William Kasper
File: EnginePrefs.java
Note:
This file was originally part of a Web Crawler program written by
Tim Macinta in 1997 that gathered links form the internet and formed
a web search database. The files containing the logic for crawling
across the internet were taken from this program and slightly modified
for the purpose of the NetRand project.
Encapsulates the preferences for the crawler and the search
engine.
-
email_address
-
-
excluded
-
-
filter_cgi
-
-
hosts
-
-
included
-
-
main_dir
-
-
maxCacheFiles
-
-
pause_time
- The time to pause between URL fetches (in seconds).
-
port
-
-
rules
-
-
url_list
-
-
user_agent
-
-
working_dir
-
-
EnginePrefs()
-
-
getMainDir()
-
-
getRulesFile()
- The rules file contains rules which determine what URLs are allowed
and what URLs whould be excluded.
-
getStartingFile()
-
-
getWorkingDir()
- Returns the working directory for use by a crawler.
-
pauseBetweenURLs()
- Pauses for the amount of time that has been specified for pausing
between URL fetches.
-
readRobotsDotText(String, int)
- Reads the "robots.txt" file on the given host and uses the results
to determine what files on "host" are crawlable.
-
readRulesFile()
- Causes the inclusion/exclusion rules to be read.
-
URLAllowed(URL)
- Returns true if "url" is allowed to be indexed and false otherwise.
-
URLNotIndexable(URL)
- Returns true if this URL represents a file type that is not indexable.
pause_time
public int pause_time
- The time to pause between URL fetches (in seconds).
maxCacheFiles
public int maxCacheFiles
main_dir
File main_dir
rules
File rules
url_list
File url_list
working_dir
File working_dir
excluded
Vector excluded
included
Vector included
hosts
Hashtable hosts
user_agent
String user_agent
email_address
String email_address
filter_cgi
boolean filter_cgi
port
public static int port
EnginePrefs
public EnginePrefs()
URLAllowed
public boolean URLAllowed(URL url)
- Returns true if "url" is allowed to be indexed and false otherwise.
pauseBetweenURLs
public void pauseBetweenURLs()
- Pauses for the amount of time that has been specified for pausing
between URL fetches.
getMainDir
public File getMainDir()
getWorkingDir
public File getWorkingDir()
- Returns the working directory for use by a crawler. If more than
one crawler is running at the same time they should be given different
working directories.
getStartingFile
public File getStartingFile()
getRulesFile
public File getRulesFile()
- The rules file contains rules which determine what URLs are allowed
and what URLs whould be excluded. A line that is in the form:
include http://gsd.mit.edu/
will cause all URLs that start with "http://gsd.mit.edu/" to be
included. Similarly, to exclude URLs, use the keyword "exclude"
instead of "include". Blank lines and lines starting with "#" are
ignored.
When an URL is checked against the inclusion/exclusion rules the
exclusion rules are checked first and if the URL matches an
exclusion rule it is not included. If an URL is not covered by
either rule it is not included, unless it is a "file://" URL in
which case it is included by default.
readRulesFile
public void readRulesFile() throws IOException
- Causes the inclusion/exclusion rules to be read. This method should
be called if the rules file is changed.
readRobotsDotText
public void readRobotsDotText(String host,
int port)
- Reads the "robots.txt" file on the given host and uses the results
to determine what files on "host" are crawlable.
URLNotIndexable
public boolean URLNotIndexable(URL url)
- Returns true if this URL represents a file type that is not indexable.