Programming a Spider in Java - 王朝网络宽屏版

IntrodUCtion

Spiders are PRograms that can visit Web sites and follow hyperlinks. By using a spider, you can quickly map out all of the pages contained on a Web site. This article will show you how to use the java programming language to construct a spider. A reusable spider class that encapsulates a basic spider will be presented. Then, an example will be shown of how to create a specific spider that will scan a Web site and find broken links.

Java is a particularly good choice as a language to construct a spider. Java has built-in support for the HTTP protocol, which is used to transfer most Web information. Java also has an Html parser built in. Both of these two features make Java an ideal choice for spiders.

Using the Spider

The example program, seen in Listing 1 at the bottom of the article, will scan a Web site, looking for bad links. To use the program, you must enter a URL and click the "Begin" button. As the spider begins, you will notice that the "Begin" button becomes a "Cancel" button. As the spider scans through the site, the progress is indicated below the "Cancel" button. The current pages being examined, as well as a count of good and bad links, are displayed. Any bad links are displayed in the scrolling text area at the bottom of the program. Clicking "Cancel" will stop this process and allow you to enter a new URL. If "Cancel" is not selected, the program will run until no additional pages can be found. At this point, the "Cancel" button will switch back to a "Begin" button, indicating that the program is no longer running.

Now, you will be shown how this example program communicates with the reusable spider class. The example program is contained in the CheckLinks class, as seen in Listing 1. This class implements the ISpiderReportable interface, as seen in Listing 2. This interface allows the Spider class to communicate with the example application. This interface defines three methods. The first method, named "spiderFoundURL", is called each time the spider locates a URL. Returning true from this method indicates that the spider should pursue this URL and find links there as well. The second method, named "spiderURLError", is called when any of the URLs that the spider is examining results in an error (such as a 404 "page not found"). The third method, named "spiderFoundEMail", is called by the spider each time an e-mail address is found. By using these three methods, the Spider class is able to communicate its findings back to the application that created it.

The spider begins processing when the begin method is clicked. To allow the example program to maintain its User Interface, the spider is started up as a separate Thread. Clicking the "Begin" button begins this background spider thread. When the background thread begins, the run method of the "CheckLinks" class is called. The run method begins by instantiating the Spider object. This can be seen here:

spider = new Spider(this);

spider.clear();

base = new URL(url.getText());

spider.addURL(base);

spider.begin();

First, a new Spider object is instantiated. The Spider object's constructor requires that an "ISpiderReportable" object be passed to it. Because the "CheckLinks" class implements the "ISpiderReportable" interface, you simply pass it as the current object, represented by the keyWord this, to the constructor. The spider maintains a list of URLs it has visited. The "clear" method is called to ensure that the spider is starting with an empty URL list. For the spider to do anything at all, one URL must be added to its processing list. The base URL, the URL that the user entered into the example program, is added to the initial list. The spider will begin by scanning this page, and will hopefully find other pages linked to this starting URL. Finally, the "begin" method is called to start the spider. The begin method will not return until the spider is done, or is canceled.

As the spider runs, the three methods implemented by the "ISpiderReportable" interface are called to report what the spider is currently doing. Most of the work done by the example program is taken care of in the "spiderFoundURL" method. When the spider finds a new URL, it is first checked to see if it is valid. If this URL results in an error, the URL is reported as a bad link. If the link is found to be valid, the link is examined to see if it is on a different server. If the link is on the same server, the "spiderFoundURL" method returns true, indicating that the spider should pursue this URL and find other links there. Links on other servers are not scanned for additional links because this would cause the spider to endlessly browse the Internet, looking for more and more Web sites. The program is looking for links only on the Web site that the user indicated.

Constructing the Spider Class

The previous section showed you how to use the Spider class, as seen in Listing 3. Using the Spider class and the "ISpiderReportable" interface can easily allow you to add spider capabilities to your own programs. This section will show you how the Spider class actually works.

The Spider class must keep track of which URLs it has visited. This must be done so that the spider insures that it does not visit the same URL more than once. Further, the spider must divide these URLs into three separate classes. The first group, stored in the "workloadWaiting" property, contains a list of URLs that the spider has encountered, yet has not had an opportunity to process yet. The first URL that the spider is to visit is placed into this collection to allow the spider to begin. The second group, stored in the "workloadProcessed" collection, is the URLs that the spider has already processed and does not need to revisit. The third group, stored in the "workloadError" property, contains the URLs that resulted in an error.

The begin method contains the main loop of the Spider class. The begin method repeatedly loops through the "workloadWaiting" collection and processes each page. Of course, as these pages are processed, other URLs are likely added to the "workloadWaiting" collection. The begin method continues this process until either the Spider is canceled, by calling the Spider class's cancel method, or there are no URLs remaining in the "workloadWaiting" method. This process is shown here:

cancel = false;

while ( !getWorkloadWaiting().isEmpty() && !cancel ) {

Object list[] = getWorkloadWaiting().toArray();

for ( int i=0;(i<list.length)&&!cancel;i++ )

processURL((URL)list[i]);

}

As the preceding code loops through the "workloadWaiting" collection, it passes each of the URLs that are to be processed to the "processURL" method. This method will actually read and then parse the HTML stored at each URL.

Reading and Parsing HTML

Java contains support both for accessing the contents of URLs and parsing HTML. The "processURL" method, which is called for each URL encountered, does this. Reading the contents of a URL is relatively easy in Java. The following code, from the "processURL" method, begins this process.

URLConnection connection = url.openConnection();

if ( (connection.getContentType()!=null) &&

!connection.getContentType().toLowerCase()

.startsWith("text/") ) {

getWorkloadWaiting().remove(url);

getWorkloadProcessed().add(url);

log("Not processing because content type is: " +

connection.getContentType() );

return;

}

First, a "URLConnection" object is constructed from whatever URL, stored in the variable "url", was passed in. There are many different types of documents found on Web sites. A spider is only interested in those documents that contain HTML, specifically text-based documents. The preceding code makes sure that the content type of the document starts with "text/". If the document type is not textual, the URL is removed from the waiting workload and added to the processed workload. This ensures that this URL will not be investigated again.

Now that a connection has been opened to the specified URL, the contents must be parsed. The following lines of code allow you to open the URL connection, as though it were a file, and read the contents.

InputStream is = connection.getInputStream();

Reader r = new InputStreamReader(is);

You now have a Reader object that you can use to read the contents of this URL. For this spider, you will simply pass the contents onto the HTML parser. The HTML parser that you will use in this example is the Swing HTML parser, which is built into Java. Java's support of HTML parsing is half-hearted at best. You must override a class to gain access to the HTML parser. This is because you must call the "getParser" method of the "HTMLEditorKit" class. Unfortunately, Sun made this method protected. The only workaround is to create your own class and override the "getParser" method, to make it public. This is done by the provided "HTMLParse" class, as seen in Listing 4.

import javax.swing.text.html.*;

public class HTMLParse extends HTMLEditorKit {

public HTMLEditorKit.Parser getParser()

{

return super.getParser();

}

This class is used in the "processURL" method of the Spider class, as follows. As you can see, the Reader object (r) that was oBTained to read the contents of the Web page is passed into the "HTMLEditorKit.Parser" object that was just obtained.

HTMLEditorKit.Parser parse = new HTMLParse().getParser();

parse.parse(r,new Parser(url),true);

You will also notice that a new Parser class is constructed. The Parser class is an inner class to the Spider class provided in the example. The Parser class is a callback class that contains certain methods that are called as each type of HTML tag is found. There are several callback methods, which are documented in the API documentation. There are only two that you are concerned with in this article. These are the methods called when a simple tag (a tag with no ending tag, such as <br>) and a begin tag are found. These two methods are named "handleSimpleTag" and "handleStartTag". Because the processing for each is identical, the "handleStartTag" method is programmed to simply call the "handleSimpleTag". The "handleSimpleTag" method is then responsible for extracting hyperlinks from the document. These hyperlinks will be used to locate other pages for the spider to visit. The "handleSimpleTag" method begins by checking to see whether there is an "href", or hypertext reference, on the current tag being parsed.

String href = (String)a.getAttribute(HTML.Attribute.HREF);

if( (href==null) && (t==HTML.Tag.FRAME) )

href = (String)a.getAttribute(HTML.Attribute.SRC);

if ( href==null )

return;

If there is no "href" attribute, the current tag is checked to see if it is a Frame. Frames point to their pages using an "src" attribute. A typical hyperlink will appear as follows in HTML:

<a href="linkedpage.html">Click Here</a>

The "href" attribute in the above link points to the page be linked to. But the page "linkedpage.html" is not an address. You couldn't type "linkedpage.html" into a browser and go anywhere. The "linkedpage.html" simply specifies a page somewhere on the Web server. This is called a relative URL. The relative URL must be resolved to a full, absolute URL that specifies the page. This is done by using the following line of code:

URL url = new URL(base,str);

This constructs a URL, where str is the relative URL and base is the page that the URL was found on. Using this form of the URL class's constructor allows you to construct a full, absolute URL. With the URL now in its correct, absolute form, the URL is checked to see whether it has already been processed, by making sure it's not in any of the workload collections. If this URL has not been processed, it is added to the waiting workload. Later on, it will be processed as well and perhaps add other hyperlinks to the waiting workload.

Conclusions

This article showed you how to create a simple spider that can visit every site on a Web sever. The example program presented here could easily be a starting point for many other spider programs. More advanced spiders, ones that must handle a very large volume of sites, would likely make use of such things as multi-threading and SQL databases. Unfortunately, Java's built-in HTML parsing is not multi-thread safe, so building such a spider can be a somewhat complex task. Topics such as these are covered in my book Programming Spiders, Bots and Aggregators in Java, by Sybex.

Listing 1: Finding the bad links (CheckLinks.java)

import java.awt.*;

import javax.swing.*;

import java.net.*;

import java.io.*;

/**

* This example uses a Java spider to scan a Web site

* and check for broken links. Written by Jeff Heaton.

* Jeff Heaton is the author of "Programming Spiders,

* Bots, and Aggregators" by Sybex. Jeff can be contacted

* through his Web site at http://www.jeffheaton.com.

* @author Jeff Heaton(http://www.jeffheaton.com)

* @version 1.0

public class CheckLinks extends javax.swing.JFrame implements

Runnable,ISpiderReportable {

/**

* The constructor. Perform setup here.

public CheckLinks()

{

//{{INIT_CONTROLS

setTitle("Find Broken Links");

getContentPane().setLayout(null);

setSize(405,288);

setVisible(false);

label1.setText("Enter a URL:");

getContentPane().add(label1);

label1.setBounds(12,12,84,12);

begin.setText("Begin");

begin.setActionCommand("Begin");

getContentPane().add(begin);

begin.setBounds(12,36,84,24);

getContentPane().add(url);

url.setBounds(108,36,288,24);

errorScroll.setAutoscrolls(true);

errorScroll.setHorizontalScrollBarPolicy(javax.swing.

ScrollPaneConstants.HORIZONTAL_SCROLLBAR_ALWAYS);

errorScroll.setVerticalScrollBarPolicy(javax.swing.

ScrollPaneConstants.VERTICAL_SCROLLBAR_ALWAYS);

errorScroll.setOpaque(true);

getContentPane().add(errorScroll);

errorScroll.setBounds(12,120,384,156);

errors.setEditable(false);

errorScroll.getViewport().add(errors);

errors.setBounds(0,0,366,138);

current.setText("Currently Processing: ");

getContentPane().add(current);

current.setBounds(12,72,384,12);

goodLinksLabel.setText("Good Links: 0");

getContentPane().add(goodLinksLabel);

goodLinksLabel.setBounds(12,96,192,12);

badLinksLabel.setText("Bad Links: 0");

getContentPane().add(badLinksLabel);

badLinksLabel.setBounds(216,96,96,12);

//}}

//{{INIT_MENUS

//}}

//{{REGISTER_LISTENERS

SymAction lSymAction = new SymAction();

begin.addActionListener(lSymAction);

//}}

}

/**

* Main method for the application

* @param args Not used

static public void main(String args[])

{

(new CheckLinks()).setVisible(true);

}

/**

* Add notifications.

public void addNotify()

{

// Record the size of the window prior to calling parent's

// addNotify.

Dimension size = getSize();

super.addNotify();

if ( frameSizeAdjusted )

return;

frameSizeAdjusted = true;

// Adjust size of frame according to the insets and menu bar

Insets insets = getInsets();

javax.swing.JMenuBar menuBar = getRootPane().getJMenuBar();

int menuBarHeight = 0;

if ( menuBar != null )

menuBarHeight = menuBar.getPreferredSize().height;

setSize(insets.left + insets.right + size.width, insets.top +

insets.bottom + size.height +

menuBarHeight);

}

// Used by addNotify

boolean frameSizeAdjusted = false;

//{{DECLARE_CONTROLS

javax.swing.JLabel label1 = new javax.swing.JLabel();

/**

* The begin or cancel button

javax.swing.JButton begin = new javax.swing.JButton();

/**

* The URL being processed

javax.swing.JTextField url = new javax.swing.JTextField();

/**

* Scroll the errors.

javax.swing.JScrollPane errorScroll =

new javax.swing.JScrollPane();

/**

* A place to store the errors created

javax.swing.JTextArea errors = new javax.swing.JTextArea();

javax.swing.JLabel current = new javax.swing.JLabel();

javax.swing.JLabel goodLinksLabel = new javax.swing.JLabel();

javax.swing.JLabel badLinksLabel = new javax.swing.JLabel();

//}}

//{{DECLARE_MENUS

//}}

/**

* The background spider thread

protected Thread backgroundThread;

/**

* The spider object being used

protected Spider spider;

/**

* The URL that the spider began with

protected URL base;

/**

* How many bad links have been found

protected int badLinksCount = 0;

/**

* How many good links have been found

protected int goodLinksCount = 0;

/**

* Internal class used to dispatch events

* @author Jeff Heaton

* @version 1.0

class SymAction implements java.awt.event.ActionListener {

public void actionPerformed(java.awt.event.ActionEvent event)

{

Object object = event.getSource();

if ( object == begin )

begin_actionPerformed(event);

}

/**

* Called when the begin or cancel buttons are clicked

* @param event The event associated with the button.

void begin_actionPerformed(java.awt.event.ActionEvent event)

{

if ( backgroundThread==null ) {

begin.setLabel("Cancel");

backgroundThread = new Thread(this);

backgroundThread.start();

goodLinksCount=0;

badLinksCount=0;

} else {

spider.cancel();

}

/**

* Perform the background thread Operation. This method

* actually starts the background thread.

public void run()

{

try {

errors.setText("");

spider = new Spider(this);

spider.clear();

base = new URL(url.getText());

spider.addURL(base);

spider.begin();

Runnable doLater = new Runnable()

{

public void run()

{

begin.setText("Begin");

}

};

SwingUtilities.invokeLater(doLater);

backgroundThread=null;

} catch ( MalformedURLException e ) {

UpdateErrors err = new UpdateErrors();

err.msg = "Bad address.";

SwingUtilities.invokeLater(err);

}

/**

* Called by the spider when a URL is found. It is here

* that links are validated.

* @param base The page that the link was found on.

* @param url The actual link address.

public boolean spiderFoundURL(URL base,URL url)

{

UpdateCurrentStats cs = new UpdateCurrentStats();

cs.msg = url.toString();

SwingUtilities.invokeLater(cs);

if ( !checkLink(url) ) {

UpdateErrors err = new UpdateErrors();

err.msg = url+"(on page " + base + ")\n";

SwingUtilities.invokeLater(err);

badLinksCount++;

return false;

}

goodLinksCount++;

if ( !url.getHost().equalsIgnoreCase(base.getHost()) )

return false;

else

return true;

}

/**

* Called when a URL error is found

* @param url The URL that resulted in an error.

public void spiderURLError(URL url)

{

}

/**

* Called internally to check whether a link is good

* @param url The link that is being checked.

* @return True if the link was good, false otherwise.

protected boolean checkLink(URL url)

{

try {

URLConnection connection = url.openConnection();

connection.connect();

return true;

} catch ( IOException e ) {

return false;

}

/**

* Called when the spider finds an e-mail address

* @param email The email address the spider found.

public void spiderFoundEMail(String email)

{

}

/**

* Internal class used to update the error information

* in a Thread-Safe way

* @author Jeff Heaton

* @version 1.0

class UpdateErrors implements Runnable {

public String msg;

public void run()

{

errors.append(msg);

}

/**

* Used to update the current status information

* in a "Thread-Safe" way

* @author Jeff Heaton

* @version 1.0

class UpdateCurrentStats implements Runnable {

public String msg;

public void run()

{

current.setText("Currently Processing: " + msg );

goodLinksLabel.setText("Good Links: " + goodLinksCount);

badLinksLabel.setText("Bad Links: " + badLinksCount);

}

Listing 2: Reporting spider events(ISpiderReportable.java)

import java.net.*;

interface ISpiderReportable {

public boolean spiderFoundURL(URL base,URL url);

public void spiderURLError(URL url);

public void spiderFoundEMail(String email);

}

Listing 3: A reusable spider (Spider.java)

import java.util.*;

import java.net.*;

import java.io.*;

import javax.swing.text.*;

import javax.swing.text.html.*;

/**

* That class implements a reusable spider

* @author Jeff Heaton(http://www.jeffheaton.com)

* @version 1.0

public class Spider {

/**

* A collection of URLs that resulted in an error

protected Collection workloadError = new ArrayList(3);

/**

* A collection of URLs that are waiting to be processed

protected Collection workloadWaiting = new ArrayList(3);

/**

* A collection of URLs that were processed

protected Collection workloadProcessed = new ArrayList(3);

/**

* The class that the spider should report its URLs to

protected ISpiderReportable report;

/**

* A flag that indicates whether this process

* should be canceled

protected boolean cancel = false;

/**

* The constructor

* @param report A class that implements the ISpiderReportable

* interface, that will receive information that the

* spider finds.

public Spider(ISpiderReportable report)

{

this.report = report;

}

/**

* Get the URLs that resulted in an error.

* @return A collection of URL's.

public Collection getWorkloadError()

{

return workloadError;

}

/**

* Get the URLs that were waiting to be processed.

* You should add one URL to this collection to

* begin the spider.

* @return A collection of URLs.

public Collection getWorkloadWaiting()

{

return workloadWaiting;

}

/**

* Get the URLs that were processed by this spider.

* @return A collection of URLs.

public Collection getWorkloadProcessed()

{

return workloadProcessed;

}

/**

* Clear all of the workloads.

public void clear()

{

getWorkloadError().clear();

getWorkloadWaiting().clear();

getWorkloadProcessed().clear();

}

/**

* Set a flag that will cause the begin

* method to return before it is done.

public void cancel()

{

cancel = true;

}

/**

* Add a URL for processing.

* @param url

public void addURL(URL url)

{

if ( getWorkloadWaiting().contains(url) )

return;

if ( getWorkloadError().contains(url) )

return;

if ( getWorkloadProcessed().contains(url) )

return;

log("Adding to workload: " + url );

getWorkloadWaiting().add(url);

}

/**

* Called internally to process a URL

* @param url The URL to be processed.

public void processURL(URL url)

{

try {

log("Processing: " + url );

// get the URL's contents

URLConnection connection = url.openConnection();

if ( (connection.getContentType()!=null) &&

!connection.getContentType().toLowerCase().s

tartsWith("text/") ) {

getWorkloadWaiting().remove(url);

getWorkloadProcessed().add(url);

log("Not processing because content type is: " +

connection.getContentType() );

return;

}

// read the URL

InputStream is = connection.getInputStream();

Reader r = new InputStreamReader(is);

// parse the URL

HTMLEditorKit.Parser parse = new HTMLParse().getParser();

parse.parse(r,new Parser(url),true);

} catch ( IOException e ) {

getWorkloadWaiting().remove(url);

getWorkloadError().add(url);

log("Error: " + url );

report.spiderURLError(url);

return;

}

// mark URL as complete

getWorkloadWaiting().remove(url);

getWorkloadProcessed().add(url);

log("Complete: " + url );

}

/**

* Called to start the spider

public void begin()

{

cancel = false;

while ( !getWorkloadWaiting().isEmpty() && !cancel ) {

Object list[] = getWorkloadWaiting().toArray();

for ( int i=0;(i<list.length)&&!cancel;i++ )

processURL((URL)list[i]);

}

/**

* A HTML parser callback used by this class to detect links

* @author Jeff Heaton

* @version 1.0

protected class Parser

extends HTMLEditorKit.ParserCallback {

protected URL base;

public Parser(URL base)

{

this.base = base;

}

public void handleSimpleTag(HTML.Tag t,

MutableAttributeSet a,int pos)

{

String href = (String)a.getAttribute(HTML.Attribute.HREF);

if( (href==null) && (t==HTML.Tag.FRAME) )

href = (String)a.getAttribute(HTML.Attribute.SRC);

if ( href==null )

return;

int i = href.indexOf('#');

if ( i!=-1 )

href = href.substring(0,i);

if ( href.toLowerCase().startsWith("mailto:") ) {

report.spiderFoundEMail(href);

return;

}

handleLink(base,href);

}

public void handleStartTag(HTML.Tag t,

MutableAttributeSet a,int pos)

{

handleSimpleTag(t,a,pos); // handle the same way

}

protected void handleLink(URL base,String str)

{

try {

URL url = new URL(base,str);

if ( report.spiderFoundURL(base,url) )

addURL(url);

} catch ( MalformedURLException e ) {

log("Found malformed URL: " + str );

}

/**

* Called internally to log information

* This basic method just writes the log

* out to the stdout.

* @param entry The information to be written to the log.

public void log(String entry)

{

System.out.println( (new Date()) + ":" + entry );

}

Listing 4: Parsing HTML (HTMLParse.java)

import javax.swing.text.html.*;

public class HTMLParse extends HTMLEditorKit {

public HTMLEditorKit.Parser getParser()

{

return super.getParser();

}

Author Bio: Jeff is the author of JSTL: jsp Standard Tag Library (Sams, 2002) and Programming Spiders, Bots, and Aggregators (Sybex, 2002). Jeff is a member of IEEE and a graduate student at Washington University in St. Louis. Jeff can be contacted through his Web site athttp://www.jeffheaton.com

Author Contact Info:

Jeff Heaton

heatonj@heat-on.com

636-530-9829

http://www.developer.com/java/other/article.php/1573761