SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
THE UNIVERSITY OF THE GAMBIA
SENIOR PROJECT
WEB CRAWLER
DOCUMENTATION
Written by:
Seedy Ahmed Jallow 2121210
Salieu Sallah 2112465
Landing Jatta 2121750
Table of Contents
INTRODUCTION........................................................................................................3
DESCRIPTION...........................................................................................................3
THEORITICAL BACKGROUND....................................................................................4
DOM PARSER............................................................................................................8
Using A DOM Parser...........................................................................................11
SOFTWARE ANALYSIS.............................................................................................12
Problem Definition..............................................................................................12
Functional Requirement.....................................................................................12
Non Functional Requirements............................................................................12
Target User.........................................................................................................13
Requirement Specification.................................................................................13
Acceptance Criteria............................................................................................14
System Assumption............................................................................................15
Relationship Description.....................................................................................15
Structure of the website.....................................................................................15
SOFTWARE DESIGN................................................................................................16
System Development Environment....................................................................16
System Development Languages..................................................................16
Classes...............................................................................................................19
Main Class......................................................................................................19
Web Crawler Class..........................................................................................20
SOFTWARE TESTING...............................................................................................22
BIBLIOGRAPHY AND REFERENCES..........................................................................25
INTRODUCTION
This is an implementation of a web crawler using the Java programming language. This
project is implemented fully from scratch using a DOM parser to parse our XML files.
This is project is about taking a fully built XML website and visit recursively all the
pages that are present in the website searching for links and saving them in a hash table
and later printing the links recursively. In other words the Web crawler fetches data from
the already built XML site. Starting with an initial URL, which is not only limited to the
index page of the website, it crawls through all the pages of the website recursively.
However, the articles show a powerful technique to traverse the hierarchy and generate
DOM events, instead of outputting an XML document directly. Now I can plug-in
different content handlers that do different things or generate different versions of the
XML.
Internet has become a basic necessity and without it, life is going to be very difficult.
With the help of Internet, a person can get a huge amount of information related to any
topic. A person uses a search engine to get information about the topic of interest. The
user just enters a keyword and sometimes a string in the text-field of a search engine to
get the related information. The links for different web-pages appear in the form of list
and this is a ranked list generated by the necessary processing in the system. This is
basically due to the indexing done inside the system in order to show the relevant results
containing exact information to the user. The user clicks on the relevant link of web page
from the ranked list of web-pages and navigates through the respective web pages.
Similarly, sometimes there is a need to get the text of a web page using a parser and for
this purpose many html parsers are available to get the data in the form of text. When the
tags are removed from a web page then in order to do the indexing of words, some
processing is needed to be done in the text and get some relevant results to know about
the words and the set of data present in that web page respectively.
DESCRIPTION
A Web crawler is an Internet bot which systematically browses the World Wide Web,
typically for the purpose of Web indexing. A Web crawler may also be called a Web
spider, an ant, an automatic indexer, or (in software context) a Web strutter.
Web search engines and some other sites use Web crawling or spidering softwares to
update their web content or indexes of others sites' web content. Web crawlers can copy
all the pages they visit for later processing by a search engine which indexes the
downloaded pages so the users can search much more efficiently.
Crawlers can validate hyper links and HTML /XML code. They can also be used for
web scraping.
Web crawlers are a key component of web search engines, where they are used to collect
the pages that are to be indexed. Crawlers have many applications beyond general
search, for example in web data mining (e.g. Attributor, a service that mines the web for
copyright violations, or ShopWiki, a price comparison service).
THEORITICAL BACKGROUND
Web crawlers are almost as old as the web itself. In the spring of 1993, just months after
the release of NCSA Mosaic, Matthew Gray wrote the first web crawler,
the World Wide Web Wanderer, which was used from 1993 to 1996 to compile statistics
about the growth of the web. A year later, David Eichmann wrote the
first research paper containing a short description of a web crawler, the RBSE spider.
Burner provided the first detailed description of the architecture of a web
crawler, namely the original Internet Archive crawler Brin and Page’s seminal paper on
the (early) architecture of the Google search engine contained a brief description of the
Google crawler, which used a central database for coordinating the crawling.
Conceptually, the algorithm executed by a web crawler is extremely simple: select a
URL from a set of candidates, download the associated web pages, extract the URLs
(hyperlinks) contained therein, and add those URLs that have not been encountered
before to the candidate set. Indeed, it is quite possible to implement a simple functioning
web crawler in a few lines of a high-level scripting language such as Perl. However,
building a web-scale web crawler imposes major engineering challenges, all of which
are ultimately related to scale. In order to maintain a search engine corpus of say, ten
billion web pages, in a reasonable state of freshness, say with pages being refreshed
every 4 weeks on average, the crawler must download over 4,000 pages/second. In order
to achieve this, the crawler must be distributed over multiple computers, and each
crawling machine must pursue multiple downloads in parallel. But if a distributed and
highly parallel web crawler were to issue many concurrent requests to a single web
server, it would in all likelihood overload and crash that web server. Therefore,web
crawlers need to implement politeness policies that rate-limit the amount of traffic
directed to any particular web server (possibly informed by that server’s observed
responsiveness). There are many possible politeness policies; one that is particularly
easy to implement is to disallow concurrent requests to the same web server; a slightly
more sophisticated policy would be to wait for time proportional to the last download
time before contacting a given web server again. In some web crawler designs (e.g. the
original Google crawler and PolyBot the page downloading processes are distributed,
while the major data structures – the set of discovered URLs and the set of URLs that
have to be downloaded – are maintained by a single machine. This design is
conceptually simple, but it does not scale indefinitely; eventually the central data
structures become a bottleneck. The alternative is to partition the major data structures
over the crawling machines.
This program starts by creating a hash table of Strings to store the attributes and the
hyper links..
static Hashtable<String, String> openList = new Hashtable<String, String>();
static Hashtable<String, String> extList = new Hashtable<String, String>();
static Hashtable<String, String> closeList = new Hashtable<String, String>();
A HASHTABLE is a data structure used to implement an associative array, a structure
that can map keys to values. A hash table uses a hash function to compute an index into
an array of buckets or slots, from which the correct value can be found. In the context of
this web ,crawler it use to map our key (a) and our value (href).
After importing all the necessary files we then parse the XML files to the DOM. The
Document Object Model (DOM) is a programming interface for HTML, XML and SVG
documents. It provides a structured representation of the document (a tree) and it defines
a way that the structure can be accessed from programs so that they can change the
document structure, style and content. The DOM provides a representation of the
document as a structured group of nodes and objects that have properties and methods.
Nodes can also have event handlers attached to them, and once that event is triggered the
event handlers get executed. Essentially, it connects web pages to scripts or
programming languages.
import java.io.File;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Hashtable;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import org.w3c.dom.Element;
public static void parsePage(URL url) {
String xmlPath = url.getFile();
File xmlFile = new File(xmlPath);
String page = null;
The Document Object Model (DOM) is a set of language-independent interfaces for
programmatic access to the logical XML document. We will use the latest Java DOM
Interfaces. These correspond to the latest version of the language-
DOM Level 1 interface as specified by the W3C, which is always accessible through
this link. IBM’s XML4J parser all latest version, pretty much as soon as it is available. .
As we have learned, the structure of a well formed XML document can be expressed
logically as a tree with a single Interface that encapsulates the structural connections
between the XML constructs is called the Node. The Node con that express structural
connections such as Node# getChildNodes(), Node# getNextSibling(), Node#
getParentNode(),
The DOM Interfaces also contain separate interfaces for XML’s high-level constructs
such as Element . Each of these interfaces extends Node . For example, there are
interfaces for Element, Attribute, Comment, Text, and so on. Each of these specific and
setter functions for their own specific data. For example, the Attribute interface has
Attribute# getName(), and Attribute member functions. The Element interface has the
means to get and set attributes via functions like Element#
getAttributeNode(java.lang.String), and Element# setAttributeNode(Attribute).
Always remember that various high-level interfaces such as Element, Attribute, Text,
Comment, and so on, all extend means the structural member functions of Node (such as
getNodeName() ) are available to Element, Attribute , and all these t illustration of this is
that any node such as Element or Text knows what it is, by re-implementing
getNodeType() . This allow programmer to query the type using Node# getNodeType()
instead of Java’s more expensive run-time type instance of f
So, in Java you can write a simple recursive function to traverse a DOM tree:
The root of the DOM tree is the Document interface. We have waited until now to
introduce it because it serves multiple purposes. It represents the whole document and
contains the methods by which you can get to the global document information and the
root Element .
Second, it serves as a general constructor or factory for all XML types, providing
methods to create the various cons an XML document. If an XML parser gives you a
DOM Document reference, you may still invoke the create methods with to build more
DOM nodes and use append Child and other functions to add them to the document
node or other nodes if the client programmer changes, adds, or removes nodes from the
DOM tree, there is no DOM requirement to check validity. This burden is left to the
programmer (with possible help from the specific DOM or parser implementation).
Obviously the third step is the complicated one. Once you know the contents of the
XML document, you might want to, for example, generate a Web page, create a
purchase order, or build a pie chart. Considering the infinite range of data that could be
contained in an XML document, the task of writing an application that correctly
processes any potential input is intimidating. Fortunately, the common XML parsing
tools discussed here can make the task much, much simpler.
DOM PARSER
The XML Parser for Java provides a way for your applications to work
with XML data on the Web. The XML Parser provides classes for
parsing, generating, manipulating, and validating XML documents. You
can include the XML Parser in Business-to-Business (B2B) and other
applications that manage XML documents, work with metacontent,
interface with databases, and exchange messages and data. The XML
Parser is written entirely in Java, and conforms to the XML 1.0
Recommendation and associated standards, such as Document Object
Model (DOM) 1.0, Simple API for XML (SAX) 1.0, and the XML
Namespaces Recommendation.
DOM implementations
The Document Object Model is an application programmer’s interface
to XML data. XML parsers produce a DOM representation of the parsed
XML. Your application uses the methods defined by the DOM to access
and manipulate the parsed XML. The IBM XML Parser provides two
DOM implementations:
– Standard DOM: provides the standard DOM Level 1 API, and is highly
tuned for performance
– TX Compatibility DOM: provides a large number of features not
provided by the standard DOM API, and is not tuned for performance.
You choose the DOM implementation you need for your application
when you write your code. You cannot, however, use both DOM’s in
the XML Parser at the same time. In the XML Parser, the DOM API is
implemented using the SAX API.
Modular design
The XML Parser has a modular architecture. This means that you can
customize
the XML Parser in a variety of different ways, including the following:
Construct different types of parsers using the classes provided,
including:
– Validating and non-validating SAX parser
– Validating and non-validating DOM parser
– Validating and non-validating TXDOM parser
To see all the classes for the XML Parser, look in the W3C for Java IDE
for
the W3C XML Parser for Java project and the org.w3c.xml.parsers
package.
Specify two catalog file formats: the SGML Open catalog, and the X-
Catalog
format. Replace the DTD-based validator with a validator based on
some other method,
such as the Document Content Description (DCD), Schema for Object-
Oriented XML (SOX), or Document Definition Markup Language (DDML)
proposals under consideration by the World Wide Web Consortium
(W3C).
Constructing a parser with only the features your application needs
reduces the
number of class files or the size of the JAR file you need. For more
information
about constructing the XML Parser, refer to the related tasks at the
bottom of this
page.
Constructing a parser
You construct a parser by instantiating one of the classes in the
com.ibm.xml.parsers package. You can instantiate the classes in one
of the following ways:
– Using a parser factory
– Explicitly instantiating a parser class
– Extending a parser class
For more information about constructing a parser, refer to the related
tasks at the
bottom of this page.
Samples
We provide the following sample programs in the IBM XML Parser for
Java Examples project. The sample programs demonstrate the
features of the XML Parser using the SAX and DOM APIs:
– SAXWriter and DOMWriter: parse a file, and print out the file in XML
format.
– SAXCount and DOMCount: parse your input file, and output the total
parse time along with counts of elements, attributes, text characters,
and white space characters you can ignore. SAXCount and DOMCount
also display any errors or warnings that occurred during the parse.
– DOMFilter: searches for specific elements in your XML document.
– TreeViewer: displays the input XML file in a graphical tree-style
interface. It also
highlights lines that have validation errors or are not well-formed.
Creating a DOM parser
You can construct a parser in your application in one of the following ways:
– Using a parser factory
– Explicitly instantiating a parser class
– Extending a parser class
To create a DOM parser, use one of the methods listed above, and specify
com.ibm.xml.parsers.DOMParser to get a validating parser, or
com.ibm.xml.parsers.NonValidatingDOMParser to get a non-validating parser. To access
the DOM tree, your application can call the getDocument() method on the parser.
For more information about constructing a parser, refer to the related tasks below.
Using A DOM Parser
import com.ibm.xml.parsers.DOMParser;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
//Constructing parser by instantiating parser object
//In this case from DOMParser
public class example2 {
static public void main( String[] argv ) {
String xmlFile = “file:///xml_document_to_parse”;
DOMParser parser = new DOMParser();
try {
parser.parse(xmlFile);
} catch (SAXException se) {
se.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
}
// The next lines are only for DOM Parsers
Document doc = ((DOMParser) parser).getDocument();
if ( doc != null ) {
try {
(new dom.DOMWriter( false ) ).print( doc ); // use print
method from dom.DOMWriter
} catch ( UnsupportedEncodingException ex ) {
ex.printStackTrace();
}
}
}
SOFTWARE ANALYSIS
Problem Definition
For our senior project, we were asked to write a search engine program that will list all
the pages that are present in a particular off-line website, and as well list all the external
links that are reachable from one of the internal pages.
Search engines consist of many features like web crawling, words extracting, indexing,
ranking, searching, search querying etc. In this project I am just concentrating on
crawling through the website and indexing the pages and outputting them as well as the
external links that are reachable through one of the internal pages.
Functional Requirement
Functional requirements means the physical module which are going to be produced by
the proposed system. The only functional module for this system is web crawling. The
crawler takes the index page of the website as input. Then it scans through all the
elements on the page, extracting the hyper link references to other pages and storing
them in a list to be scanned through later. The crawler will scan through the pages
recursively storing all scanned pages in a hash table to make sure the crawler takes care
of circular references.
Non Functional Requirements
This program isn't meant to be an end user program, so very little emphasis is made on
the user interface. As a result no user interface was developed. Input and output will be
through the terminal. It is worth noting also that this is not a professional program either,
so the issue of product security and the like are not considered.
Target User
The target users of this program are aside from the project instructor and supervisor
(obviously), are the general programming community who want to see the very basic
implementation of a search engine. They are allowed to use, reuse, share my code as
long as I am credited for it.
Requirement Specification
The logical model above is a data flow diagram overview showing the processes
required for the proposed system. Details of the processes is explained in the physical
design below.
Process Description
Input The index page of the website that is to be crawled is
inputed by the user.
Create URL Create a URL from the path of a file.
Parse Page Creates document builders to break down the structure of
the page into a tree of nodes. And traverses through the
nodes to collect hyper link references to other pages
Save Links Stores the hyper link references in a list and provide links
to the crawler.
Internal Links Gets all the URLs the whose references are internal pages
of the website or in other words, has the “file://” protocol.
External Links Gets all the URLs that are referencing to pages external to
the website or in other words has the “http://” protocol.
Save in table Stores all the links in their respective hash tables.
Html Page Checks whether the URL is referencing to a valid html
page and not an image, port etc.
Print Outputs the URLs
Acceptance Criteria
On the day of completion of the project, all the features explained above will be
provided, mainly web crawling.
System Assumption
It is url search only. All processes of the web crawler are made to processes url info only.
It doesn't care about other searches including image searching. The results of other
languages are unexpected. Thou the use of anything other than urls will not lead to
system errors. It also assumes that the user is well versed with command line input,
output and other command line attributes that are necessary in running the program.
Relationship Description
Each page has many links, so the relationship between the pages and links are one to
many.
Structure of the website
Federn, is the name of the website to be crawled. The website contains information on
feathers. There are hundreds of feathers whose description and identification is given in
this website. The website is available in three languages, German, English and French.
Each page of the website has a link tab on the top of the page. That tab contains links to
the home page, feathers, identification, news, bibliography, services and help.
The home page of Federn contains the description of the idea behind the website, the
authors, the concept behind this project and the acknowledgement of the contribution of
others in the development of the website. As you can see it contains a lot of links
referring to other pages. All the links though are internal links.
The feathers page contains all the feathers that were identified and described in this
website. The list of feathers are arranged in two formats. Firstly, they are arranged
according to their Genus and Family names on one side of the page, and arranged
according to alphabetical order on the other side. Each feather name is a link to the page
containing the description of the feathers and scanned images.
The identification page contains an image of the feather with a picture of a bird which
had that type of feather. It also contains detailed descriptions of the different types,
colors and shapes of that feather and the main function of the feather in flight and
temperature regulation.
There is a news tab that contain any new information found on the feathers or any
discovery made on feathers.
The bibliography contains links and resources where information on this website is
gathered from. It also contains service and help pages.
As you can see this website is a huge one. Each page aside from the main index page,
has 3 copies of itself in 3 different languages.
SOFTWARE DESIGN
System Development Environment
System Development Languages
The only language that is used in the development of this program is Java. Java is a
highly dynamic language. It contains most of the functionalities that was needed in the
development of the program.
The Java IO package allowed me to utilize the file class which I used to create file
objects which I feed to the parser to parse the pages of the website. I created the files by
using the absolute file paths which was extracted from the urls.
The Java
NET
package contains classes which were used to create urls from file paths. The urls can be
created as in my case by passing the absolute file path of the parent class and the page
name of the url being processed. Using of urls in my program is very crucial, bearing in
mind that I needed to check the referenced locations of the urls that are being processed
to make sure that the urls are referring to pages that are local to the website being
crawled or in order words are stored in the file system. You can check the protocols of
the urls by using the getProtocol() method. If it returns “file://” then that page being
referenced by the url is local to the file system. If it returns “http://” then that page being
referenced by the url is referring to a page outside the website being crawled.
Getting the baseURI:
Creating a URL:
Checking the protocol of a url
The Java UTIL class enables us to use a structure called Hash table. Hash tables are used
to store data objects in this case urls. I created two instances of the hash table class to
store urls on the website being crawled that refer to pages that are internal to the website
and store urls on the website that refer to other pages that are outside. Thou there are
other storage systems that can be used like MySQL, array lists, arrays etc because it is
unlike MySQL it is very simple to implement and use and unlike array lists, it is faster at
storing, searching and retrieving data which is very important considering that thousands
of urls can be stored and searched through over and over again.
Creating hash tables to store internal and external links
The Java
library also has a very important package which is by far the most important tool used in
my program, the xml parser package. This package contains the document builder
factory which is used to create document builders which contains the parsers we are
going to use to break down the pages. The parser parses the content of the file which it
is feed as an XML document and returns a new DOM document object. This package
also contains methods which validate the xml documents and verify whether the
documents are well formed as well.
Getting the document builder factory, document builder and parsing a xml document
into a DOM document object sample;
It is very important that the parser doesn't not validate the xml pages because it will
require Internet access and as you already know, the program is crawling off-line web
pages. If it does, it will lead to system errors.
Setting off the validating features of the parser;
The external package DOM is used to create documents which will store the parsed
pages content into a document. The document contains elements in a tree like structure
with each element corresponding to a node on the tree. Traversing through the tree with
any appropriate traversal method, all the nodes containing a-element tags are collected
and stored in a list of nodes. Looping through that list of nodes, one is able to extract all
a-element tags containing hyper link references.
Classes
In my implementation of the program, I used only classes. The first class contained the
main method while the second class contained the main implementation of the web
crawler.
Main Class
The main class contains the main method. The main method contains the prompt for the
user to enter the absolute file path of the index page of the website to be crawled. When
the user complies, the path is converted to a url object and is stored in the hash table
containing internal links. The main method also contains the first call of the recursive
method processPage(URL url).
Structure of the main method;
Web Crawler Class
This class contains 80% of the implementation. It has only one method definition, that of
the recursive method, processPage(). At the beginning of the class, the hash tables are
declared followed by the definition of the processPage() method.
The processPage() method contains only one parameter, the url object that is passed.
Inside the method, the absolute path of the url is extracted and a file object is created
thereof. The document builder object is created from the declaration and initialization of
the document builder factory and document builders in the preceding lines of code. The
method also contains the code snippet making sure that the parser doesn't validate the
xml pages. The parser is then called to parse the xml document and then the root
element of DOM document is then normalized. Thereafter, the root element of the
document is extracted and the traversal of the nodes of the document begin. All a-
element tags are selected and stored in a list of nodes. The nodes then are looped
through and all the a-element tags containing the “href” attribute, the values of the
attributes are extracted and a url is created therein of the pages that href is referencing
to. As explained before, the url is created by extracting the base url of the parent file of
page being referenced to and the page name of that file.
The protocol of the file is then checked, and if it is “file://”, the program proceeds to
check whether that url isn't referring to an image, a port etc, that it is referring to an
actual page. Then it proceeds to make sure that that url is not already stored in the hash
table containing links to internal pages of the website. If it is stored in the hash table
already, the link is discarded by the system and the next link on the node list is
processed. If it ain't stored in the hash table, the url is stored and that page is processed
for more urls.
If the protocol is tested and it returns “http://”, the program proceeds to check whether
that url isn't already stored in the hash table containing links to external pages. It it is,
the url is discarded and the next link on the list is processed. If not, the url is stored in
the table and then it is printed to the screen
SOFTWARE TESTING
During the testing of the program, many problems were encountered. One of the first
problems we had during the initial testings was with the validation of the parser. It is
standard that all xml documents are checked to see whether they are well formed
documents and are valid.
The website we are crawling as you already know, is off-line and if the parser tries to
validate it, errors like the one shown below will occur because it needs to connect to the
Internet to perform the checks.
To solve it, we set all features of the document builder factory that could start the
validation of the xml documents to false, as we have shown you somewhere before.
Another problem we encountered during the implementation of the program was how to
get the absolute paths of the relative path to the pages we found on each page we already
crawled. All that the crawler returned was the names of the files that were found to be
referenced from the page that we were crawling. What we later did was to get the base
uri of the file that was being crawled; it returns the absolute file path of that file being
crawled and attached the names of the pages that were found on that file. That way we
were able to create a url for all the links and processed them.
Aside from the problems mentioned above, the program was able to pass through the
final tests without any major bugs therefore bringing us successfully to the end of the
implementation of the program. Although it was not an easy ride, it was worth every bit
of effort we invested in it. Below are Terminal images showing the compilation and
running of the program and the results i.e. the links on the website being crawled. No
graphical interface is developed therefore the default GUI; the terminal is used.
Command to compile the Program:
Running the Program:
Prompt and input of Index page:
The program ran smoothly and proceeded to print out all the links that were found on the
website and label them internal or external depending on were they are referencing to
and the protocol they contain.
BIBLIOGRAPHY AND REFERENCES
[HREF1] What is a “Web Crawler" ? (
http://research.compaq.com/SRC/mercator/faq.html )
[HERF2] inverted index ( http://burks.brighton.ac.uk/burks/foldoc/86/59.htm )
[MARC] Marckini, Fredrick. Secrets to making your Internet Web Pages Achieve Top
Rankings (ResponseDirect.com, Inc., c1999 )
http://en.wikipedia.org/wiki/Web_crawler
http://research.microsoft.com/pubs/102936/eds-webcrawlerarchitecture.pdf
http://research.microsoft.com/pubs/102936/eds-webcrawlerarchitecture.pdf

Mais conteúdo relacionado

Mais procurados

Smart crawlet A two stage crawler for efficiently harvesting deep web interf...
Smart crawlet A two stage crawler  for efficiently harvesting deep web interf...Smart crawlet A two stage crawler  for efficiently harvesting deep web interf...
Smart crawlet A two stage crawler for efficiently harvesting deep web interf...Rana Jayant
 
Research on Key Technology of Web Reptile
Research on Key Technology of Web ReptileResearch on Key Technology of Web Reptile
Research on Key Technology of Web ReptileIRJESJOURNAL
 
Colloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerColloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerAkshay Pratap Singh
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.iosrjce
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawlerishmecse13
 
Working of a Web Crawler
Working of a Web CrawlerWorking of a Web Crawler
Working of a Web CrawlerSanchit Saini
 
IRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine OptimizationIRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine OptimizationIRJET Journal
 
Web crawler with seo analysis
Web crawler with seo analysis Web crawler with seo analysis
Web crawler with seo analysis Vikram Parmar
 
What is a web crawler and how does it work
What is a web crawler and how does it workWhat is a web crawler and how does it work
What is a web crawler and how does it workSwati Sharma
 
IRJET- A Two-Way Smart Web Spider
IRJET- A Two-Way Smart Web SpiderIRJET- A Two-Way Smart Web Spider
IRJET- A Two-Way Smart Web SpiderIRJET Journal
 
The glory of REST in Java: Spring HATEOAS, RAML, Temenos IRIS
The glory of REST in Java: Spring HATEOAS, RAML, Temenos IRISThe glory of REST in Java: Spring HATEOAS, RAML, Temenos IRIS
The glory of REST in Java: Spring HATEOAS, RAML, Temenos IRISGeert Pante
 

Mais procurados (19)

Smart crawlet A two stage crawler for efficiently harvesting deep web interf...
Smart crawlet A two stage crawler  for efficiently harvesting deep web interf...Smart crawlet A two stage crawler  for efficiently harvesting deep web interf...
Smart crawlet A two stage crawler for efficiently harvesting deep web interf...
 
Smart Crawler
Smart CrawlerSmart Crawler
Smart Crawler
 
Research on Key Technology of Web Reptile
Research on Key Technology of Web ReptileResearch on Key Technology of Web Reptile
Research on Key Technology of Web Reptile
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
Colloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerColloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web Crawler
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
 
Web crawling
Web crawlingWeb crawling
Web crawling
 
Seminar on crawler
Seminar on crawlerSeminar on crawler
Seminar on crawler
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
SemaGrow demonstrator: “Web Crawler + AgroTagger”
SemaGrow demonstrator: “Web Crawler + AgroTagger”SemaGrow demonstrator: “Web Crawler + AgroTagger”
SemaGrow demonstrator: “Web Crawler + AgroTagger”
 
Working of a Web Crawler
Working of a Web CrawlerWorking of a Web Crawler
Working of a Web Crawler
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
IRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine OptimizationIRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine Optimization
 
Web crawler with seo analysis
Web crawler with seo analysis Web crawler with seo analysis
Web crawler with seo analysis
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
What is a web crawler and how does it work
What is a web crawler and how does it workWhat is a web crawler and how does it work
What is a web crawler and how does it work
 
IRJET- A Two-Way Smart Web Spider
IRJET- A Two-Way Smart Web SpiderIRJET- A Two-Way Smart Web Spider
IRJET- A Two-Way Smart Web Spider
 
The glory of REST in Java: Spring HATEOAS, RAML, Temenos IRIS
The glory of REST in Java: Spring HATEOAS, RAML, Temenos IRISThe glory of REST in Java: Spring HATEOAS, RAML, Temenos IRIS
The glory of REST in Java: Spring HATEOAS, RAML, Temenos IRIS
 

Semelhante a Senior Project Documentation.

Ruby On Rails Siddhesh
Ruby On Rails SiddheshRuby On Rails Siddhesh
Ruby On Rails SiddheshSiddhesh Bhobe
 
Web Crawler For Mining Web Data
Web Crawler For Mining Web DataWeb Crawler For Mining Web Data
Web Crawler For Mining Web DataIRJET Journal
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search EngineNIKHIL NAIR
 
Inverted textindexing
Inverted textindexingInverted textindexing
Inverted textindexingKhwaja Aamer
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the webVan-Duyet Le
 
Simile Exhibit @ VGSom : A tutorial
Simile Exhibit @ VGSom : A tutorialSimile Exhibit @ VGSom : A tutorial
Simile Exhibit @ VGSom : A tutorialKanishka Chakraborty
 
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...ijmech
 
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...ijmech
 
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...ijmech
 
What are the different types of web scraping approaches
What are the different types of web scraping approachesWhat are the different types of web scraping approaches
What are the different types of web scraping approachesAparna Sharma
 
Web 2.0 Mashups
Web 2.0 MashupsWeb 2.0 Mashups
Web 2.0 Mashupshchen1
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
HIGWGET-A Model for Crawling Secure Hidden WebPages
HIGWGET-A Model for Crawling Secure Hidden WebPagesHIGWGET-A Model for Crawling Secure Hidden WebPages
HIGWGET-A Model for Crawling Secure Hidden WebPagesijdkp
 
Discovering Heterogeneous Resources in the Internet
Discovering Heterogeneous Resources in the InternetDiscovering Heterogeneous Resources in the Internet
Discovering Heterogeneous Resources in the InternetRazzakul Chowdhury
 
Web Crawling Using Location Aware Technique
Web Crawling Using Location Aware TechniqueWeb Crawling Using Location Aware Technique
Web Crawling Using Location Aware Techniqueijsrd.com
 

Semelhante a Senior Project Documentation. (20)

Ruby On Rails Siddhesh
Ruby On Rails SiddheshRuby On Rails Siddhesh
Ruby On Rails Siddhesh
 
Web Crawler For Mining Web Data
Web Crawler For Mining Web DataWeb Crawler For Mining Web Data
Web Crawler For Mining Web Data
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search Engine
 
Web 2 0 Tools
Web 2 0 ToolsWeb 2 0 Tools
Web 2 0 Tools
 
Inverted textindexing
Inverted textindexingInverted textindexing
Inverted textindexing
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web
 
E017624043
E017624043E017624043
E017624043
 
webcrawler.pptx
webcrawler.pptxwebcrawler.pptx
webcrawler.pptx
 
Simile Exhibit @ VGSom : A tutorial
Simile Exhibit @ VGSom : A tutorialSimile Exhibit @ VGSom : A tutorial
Simile Exhibit @ VGSom : A tutorial
 
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
 
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
 
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
 
What are the different types of web scraping approaches
What are the different types of web scraping approachesWhat are the different types of web scraping approaches
What are the different types of web scraping approaches
 
Web 2.0 Mashups
Web 2.0 MashupsWeb 2.0 Mashups
Web 2.0 Mashups
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
HIGWGET-A Model for Crawling Secure Hidden WebPages
HIGWGET-A Model for Crawling Secure Hidden WebPagesHIGWGET-A Model for Crawling Secure Hidden WebPages
HIGWGET-A Model for Crawling Secure Hidden WebPages
 
L017447590
L017447590L017447590
L017447590
 
Discovering Heterogeneous Resources in the Internet
Discovering Heterogeneous Resources in the InternetDiscovering Heterogeneous Resources in the Internet
Discovering Heterogeneous Resources in the Internet
 
Resource Discovery Paper.PDF
Resource Discovery Paper.PDFResource Discovery Paper.PDF
Resource Discovery Paper.PDF
 
Web Crawling Using Location Aware Technique
Web Crawling Using Location Aware TechniqueWeb Crawling Using Location Aware Technique
Web Crawling Using Location Aware Technique
 

Senior Project Documentation.

  • 1. THE UNIVERSITY OF THE GAMBIA SENIOR PROJECT WEB CRAWLER DOCUMENTATION Written by: Seedy Ahmed Jallow 2121210 Salieu Sallah 2112465 Landing Jatta 2121750
  • 2. Table of Contents INTRODUCTION........................................................................................................3 DESCRIPTION...........................................................................................................3 THEORITICAL BACKGROUND....................................................................................4 DOM PARSER............................................................................................................8 Using A DOM Parser...........................................................................................11 SOFTWARE ANALYSIS.............................................................................................12 Problem Definition..............................................................................................12 Functional Requirement.....................................................................................12 Non Functional Requirements............................................................................12 Target User.........................................................................................................13 Requirement Specification.................................................................................13 Acceptance Criteria............................................................................................14 System Assumption............................................................................................15 Relationship Description.....................................................................................15 Structure of the website.....................................................................................15 SOFTWARE DESIGN................................................................................................16 System Development Environment....................................................................16 System Development Languages..................................................................16 Classes...............................................................................................................19 Main Class......................................................................................................19 Web Crawler Class..........................................................................................20 SOFTWARE TESTING...............................................................................................22 BIBLIOGRAPHY AND REFERENCES..........................................................................25
  • 3. INTRODUCTION This is an implementation of a web crawler using the Java programming language. This project is implemented fully from scratch using a DOM parser to parse our XML files. This is project is about taking a fully built XML website and visit recursively all the pages that are present in the website searching for links and saving them in a hash table and later printing the links recursively. In other words the Web crawler fetches data from the already built XML site. Starting with an initial URL, which is not only limited to the index page of the website, it crawls through all the pages of the website recursively. However, the articles show a powerful technique to traverse the hierarchy and generate DOM events, instead of outputting an XML document directly. Now I can plug-in different content handlers that do different things or generate different versions of the XML. Internet has become a basic necessity and without it, life is going to be very difficult. With the help of Internet, a person can get a huge amount of information related to any topic. A person uses a search engine to get information about the topic of interest. The user just enters a keyword and sometimes a string in the text-field of a search engine to get the related information. The links for different web-pages appear in the form of list and this is a ranked list generated by the necessary processing in the system. This is basically due to the indexing done inside the system in order to show the relevant results containing exact information to the user. The user clicks on the relevant link of web page from the ranked list of web-pages and navigates through the respective web pages. Similarly, sometimes there is a need to get the text of a web page using a parser and for this purpose many html parsers are available to get the data in the form of text. When the tags are removed from a web page then in order to do the indexing of words, some processing is needed to be done in the text and get some relevant results to know about the words and the set of data present in that web page respectively. DESCRIPTION
  • 4. A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider, an ant, an automatic indexer, or (in software context) a Web strutter. Web search engines and some other sites use Web crawling or spidering softwares to update their web content or indexes of others sites' web content. Web crawlers can copy all the pages they visit for later processing by a search engine which indexes the downloaded pages so the users can search much more efficiently. Crawlers can validate hyper links and HTML /XML code. They can also be used for web scraping. Web crawlers are a key component of web search engines, where they are used to collect the pages that are to be indexed. Crawlers have many applications beyond general search, for example in web data mining (e.g. Attributor, a service that mines the web for copyright violations, or ShopWiki, a price comparison service). THEORITICAL BACKGROUND Web crawlers are almost as old as the web itself. In the spring of 1993, just months after the release of NCSA Mosaic, Matthew Gray wrote the first web crawler, the World Wide Web Wanderer, which was used from 1993 to 1996 to compile statistics about the growth of the web. A year later, David Eichmann wrote the first research paper containing a short description of a web crawler, the RBSE spider. Burner provided the first detailed description of the architecture of a web crawler, namely the original Internet Archive crawler Brin and Page’s seminal paper on the (early) architecture of the Google search engine contained a brief description of the Google crawler, which used a central database for coordinating the crawling. Conceptually, the algorithm executed by a web crawler is extremely simple: select a URL from a set of candidates, download the associated web pages, extract the URLs (hyperlinks) contained therein, and add those URLs that have not been encountered before to the candidate set. Indeed, it is quite possible to implement a simple functioning web crawler in a few lines of a high-level scripting language such as Perl. However,
  • 5. building a web-scale web crawler imposes major engineering challenges, all of which are ultimately related to scale. In order to maintain a search engine corpus of say, ten billion web pages, in a reasonable state of freshness, say with pages being refreshed every 4 weeks on average, the crawler must download over 4,000 pages/second. In order to achieve this, the crawler must be distributed over multiple computers, and each crawling machine must pursue multiple downloads in parallel. But if a distributed and highly parallel web crawler were to issue many concurrent requests to a single web server, it would in all likelihood overload and crash that web server. Therefore,web crawlers need to implement politeness policies that rate-limit the amount of traffic directed to any particular web server (possibly informed by that server’s observed responsiveness). There are many possible politeness policies; one that is particularly easy to implement is to disallow concurrent requests to the same web server; a slightly more sophisticated policy would be to wait for time proportional to the last download time before contacting a given web server again. In some web crawler designs (e.g. the original Google crawler and PolyBot the page downloading processes are distributed, while the major data structures – the set of discovered URLs and the set of URLs that have to be downloaded – are maintained by a single machine. This design is conceptually simple, but it does not scale indefinitely; eventually the central data structures become a bottleneck. The alternative is to partition the major data structures over the crawling machines. This program starts by creating a hash table of Strings to store the attributes and the hyper links.. static Hashtable<String, String> openList = new Hashtable<String, String>(); static Hashtable<String, String> extList = new Hashtable<String, String>(); static Hashtable<String, String> closeList = new Hashtable<String, String>(); A HASHTABLE is a data structure used to implement an associative array, a structure that can map keys to values. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found. In the context of this web ,crawler it use to map our key (a) and our value (href). After importing all the necessary files we then parse the XML files to the DOM. The Document Object Model (DOM) is a programming interface for HTML, XML and SVG documents. It provides a structured representation of the document (a tree) and it defines a way that the structure can be accessed from programs so that they can change the
  • 6. document structure, style and content. The DOM provides a representation of the document as a structured group of nodes and objects that have properties and methods. Nodes can also have event handlers attached to them, and once that event is triggered the event handlers get executed. Essentially, it connects web pages to scripts or programming languages. import java.io.File; import java.net.MalformedURLException; import java.net.URL; import java.util.Hashtable; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.DocumentBuilder; import org.w3c.dom.Document; import org.w3c.dom.NodeList; import org.w3c.dom.Node; import org.w3c.dom.Element; public static void parsePage(URL url) { String xmlPath = url.getFile(); File xmlFile = new File(xmlPath); String page = null; The Document Object Model (DOM) is a set of language-independent interfaces for programmatic access to the logical XML document. We will use the latest Java DOM Interfaces. These correspond to the latest version of the language- DOM Level 1 interface as specified by the W3C, which is always accessible through this link. IBM’s XML4J parser all latest version, pretty much as soon as it is available. . As we have learned, the structure of a well formed XML document can be expressed logically as a tree with a single Interface that encapsulates the structural connections between the XML constructs is called the Node. The Node con that express structural connections such as Node# getChildNodes(), Node# getNextSibling(), Node# getParentNode(),
  • 7. The DOM Interfaces also contain separate interfaces for XML’s high-level constructs such as Element . Each of these interfaces extends Node . For example, there are interfaces for Element, Attribute, Comment, Text, and so on. Each of these specific and setter functions for their own specific data. For example, the Attribute interface has Attribute# getName(), and Attribute member functions. The Element interface has the means to get and set attributes via functions like Element# getAttributeNode(java.lang.String), and Element# setAttributeNode(Attribute). Always remember that various high-level interfaces such as Element, Attribute, Text, Comment, and so on, all extend means the structural member functions of Node (such as getNodeName() ) are available to Element, Attribute , and all these t illustration of this is that any node such as Element or Text knows what it is, by re-implementing getNodeType() . This allow programmer to query the type using Node# getNodeType() instead of Java’s more expensive run-time type instance of f So, in Java you can write a simple recursive function to traverse a DOM tree: The root of the DOM tree is the Document interface. We have waited until now to introduce it because it serves multiple purposes. It represents the whole document and contains the methods by which you can get to the global document information and the root Element . Second, it serves as a general constructor or factory for all XML types, providing methods to create the various cons an XML document. If an XML parser gives you a DOM Document reference, you may still invoke the create methods with to build more DOM nodes and use append Child and other functions to add them to the document node or other nodes if the client programmer changes, adds, or removes nodes from the DOM tree, there is no DOM requirement to check validity. This burden is left to the programmer (with possible help from the specific DOM or parser implementation). Obviously the third step is the complicated one. Once you know the contents of the XML document, you might want to, for example, generate a Web page, create a purchase order, or build a pie chart. Considering the infinite range of data that could be contained in an XML document, the task of writing an application that correctly processes any potential input is intimidating. Fortunately, the common XML parsing tools discussed here can make the task much, much simpler.
  • 8. DOM PARSER The XML Parser for Java provides a way for your applications to work with XML data on the Web. The XML Parser provides classes for parsing, generating, manipulating, and validating XML documents. You can include the XML Parser in Business-to-Business (B2B) and other applications that manage XML documents, work with metacontent, interface with databases, and exchange messages and data. The XML Parser is written entirely in Java, and conforms to the XML 1.0 Recommendation and associated standards, such as Document Object Model (DOM) 1.0, Simple API for XML (SAX) 1.0, and the XML Namespaces Recommendation. DOM implementations The Document Object Model is an application programmer’s interface to XML data. XML parsers produce a DOM representation of the parsed XML. Your application uses the methods defined by the DOM to access and manipulate the parsed XML. The IBM XML Parser provides two DOM implementations: – Standard DOM: provides the standard DOM Level 1 API, and is highly tuned for performance – TX Compatibility DOM: provides a large number of features not provided by the standard DOM API, and is not tuned for performance. You choose the DOM implementation you need for your application when you write your code. You cannot, however, use both DOM’s in the XML Parser at the same time. In the XML Parser, the DOM API is implemented using the SAX API. Modular design The XML Parser has a modular architecture. This means that you can
  • 9. customize the XML Parser in a variety of different ways, including the following: Construct different types of parsers using the classes provided, including: – Validating and non-validating SAX parser – Validating and non-validating DOM parser – Validating and non-validating TXDOM parser To see all the classes for the XML Parser, look in the W3C for Java IDE for the W3C XML Parser for Java project and the org.w3c.xml.parsers package. Specify two catalog file formats: the SGML Open catalog, and the X- Catalog format. Replace the DTD-based validator with a validator based on some other method, such as the Document Content Description (DCD), Schema for Object- Oriented XML (SOX), or Document Definition Markup Language (DDML) proposals under consideration by the World Wide Web Consortium (W3C). Constructing a parser with only the features your application needs reduces the number of class files or the size of the JAR file you need. For more information about constructing the XML Parser, refer to the related tasks at the bottom of this page. Constructing a parser You construct a parser by instantiating one of the classes in the com.ibm.xml.parsers package. You can instantiate the classes in one of the following ways: – Using a parser factory – Explicitly instantiating a parser class
  • 10. – Extending a parser class For more information about constructing a parser, refer to the related tasks at the bottom of this page. Samples We provide the following sample programs in the IBM XML Parser for Java Examples project. The sample programs demonstrate the features of the XML Parser using the SAX and DOM APIs: – SAXWriter and DOMWriter: parse a file, and print out the file in XML format. – SAXCount and DOMCount: parse your input file, and output the total parse time along with counts of elements, attributes, text characters, and white space characters you can ignore. SAXCount and DOMCount also display any errors or warnings that occurred during the parse. – DOMFilter: searches for specific elements in your XML document. – TreeViewer: displays the input XML file in a graphical tree-style interface. It also highlights lines that have validation errors or are not well-formed. Creating a DOM parser You can construct a parser in your application in one of the following ways: – Using a parser factory – Explicitly instantiating a parser class – Extending a parser class To create a DOM parser, use one of the methods listed above, and specify com.ibm.xml.parsers.DOMParser to get a validating parser, or com.ibm.xml.parsers.NonValidatingDOMParser to get a non-validating parser. To access the DOM tree, your application can call the getDocument() method on the parser. For more information about constructing a parser, refer to the related tasks below.
  • 11. Using A DOM Parser import com.ibm.xml.parsers.DOMParser; import org.w3c.dom.Document; import org.xml.sax.SAXException; import java.io.IOException; import java.io.UnsupportedEncodingException; //Constructing parser by instantiating parser object //In this case from DOMParser public class example2 { static public void main( String[] argv ) { String xmlFile = “file:///xml_document_to_parse”; DOMParser parser = new DOMParser(); try { parser.parse(xmlFile); } catch (SAXException se) { se.printStackTrace(); } catch (IOException ioe) { ioe.printStackTrace(); } // The next lines are only for DOM Parsers Document doc = ((DOMParser) parser).getDocument(); if ( doc != null ) { try { (new dom.DOMWriter( false ) ).print( doc ); // use print method from dom.DOMWriter } catch ( UnsupportedEncodingException ex ) { ex.printStackTrace(); } } }
  • 12. SOFTWARE ANALYSIS Problem Definition For our senior project, we were asked to write a search engine program that will list all the pages that are present in a particular off-line website, and as well list all the external links that are reachable from one of the internal pages. Search engines consist of many features like web crawling, words extracting, indexing, ranking, searching, search querying etc. In this project I am just concentrating on crawling through the website and indexing the pages and outputting them as well as the external links that are reachable through one of the internal pages. Functional Requirement Functional requirements means the physical module which are going to be produced by the proposed system. The only functional module for this system is web crawling. The crawler takes the index page of the website as input. Then it scans through all the elements on the page, extracting the hyper link references to other pages and storing them in a list to be scanned through later. The crawler will scan through the pages recursively storing all scanned pages in a hash table to make sure the crawler takes care of circular references. Non Functional Requirements This program isn't meant to be an end user program, so very little emphasis is made on the user interface. As a result no user interface was developed. Input and output will be through the terminal. It is worth noting also that this is not a professional program either, so the issue of product security and the like are not considered.
  • 13. Target User The target users of this program are aside from the project instructor and supervisor (obviously), are the general programming community who want to see the very basic implementation of a search engine. They are allowed to use, reuse, share my code as long as I am credited for it. Requirement Specification
  • 14. The logical model above is a data flow diagram overview showing the processes required for the proposed system. Details of the processes is explained in the physical design below. Process Description Input The index page of the website that is to be crawled is inputed by the user. Create URL Create a URL from the path of a file. Parse Page Creates document builders to break down the structure of the page into a tree of nodes. And traverses through the nodes to collect hyper link references to other pages Save Links Stores the hyper link references in a list and provide links to the crawler. Internal Links Gets all the URLs the whose references are internal pages of the website or in other words, has the “file://” protocol. External Links Gets all the URLs that are referencing to pages external to the website or in other words has the “http://” protocol. Save in table Stores all the links in their respective hash tables. Html Page Checks whether the URL is referencing to a valid html page and not an image, port etc. Print Outputs the URLs Acceptance Criteria On the day of completion of the project, all the features explained above will be provided, mainly web crawling.
  • 15. System Assumption It is url search only. All processes of the web crawler are made to processes url info only. It doesn't care about other searches including image searching. The results of other languages are unexpected. Thou the use of anything other than urls will not lead to system errors. It also assumes that the user is well versed with command line input, output and other command line attributes that are necessary in running the program. Relationship Description Each page has many links, so the relationship between the pages and links are one to many. Structure of the website Federn, is the name of the website to be crawled. The website contains information on feathers. There are hundreds of feathers whose description and identification is given in this website. The website is available in three languages, German, English and French. Each page of the website has a link tab on the top of the page. That tab contains links to the home page, feathers, identification, news, bibliography, services and help. The home page of Federn contains the description of the idea behind the website, the authors, the concept behind this project and the acknowledgement of the contribution of others in the development of the website. As you can see it contains a lot of links referring to other pages. All the links though are internal links. The feathers page contains all the feathers that were identified and described in this website. The list of feathers are arranged in two formats. Firstly, they are arranged according to their Genus and Family names on one side of the page, and arranged according to alphabetical order on the other side. Each feather name is a link to the page containing the description of the feathers and scanned images. The identification page contains an image of the feather with a picture of a bird which had that type of feather. It also contains detailed descriptions of the different types, colors and shapes of that feather and the main function of the feather in flight and temperature regulation.
  • 16. There is a news tab that contain any new information found on the feathers or any discovery made on feathers. The bibliography contains links and resources where information on this website is gathered from. It also contains service and help pages. As you can see this website is a huge one. Each page aside from the main index page, has 3 copies of itself in 3 different languages. SOFTWARE DESIGN System Development Environment System Development Languages The only language that is used in the development of this program is Java. Java is a highly dynamic language. It contains most of the functionalities that was needed in the development of the program.
  • 17. The Java IO package allowed me to utilize the file class which I used to create file objects which I feed to the parser to parse the pages of the website. I created the files by using the absolute file paths which was extracted from the urls. The Java NET package contains classes which were used to create urls from file paths. The urls can be created as in my case by passing the absolute file path of the parent class and the page name of the url being processed. Using of urls in my program is very crucial, bearing in mind that I needed to check the referenced locations of the urls that are being processed to make sure that the urls are referring to pages that are local to the website being crawled or in order words are stored in the file system. You can check the protocols of the urls by using the getProtocol() method. If it returns “file://” then that page being referenced by the url is local to the file system. If it returns “http://” then that page being referenced by the url is referring to a page outside the website being crawled. Getting the baseURI: Creating a URL: Checking the protocol of a url
  • 18. The Java UTIL class enables us to use a structure called Hash table. Hash tables are used to store data objects in this case urls. I created two instances of the hash table class to store urls on the website being crawled that refer to pages that are internal to the website and store urls on the website that refer to other pages that are outside. Thou there are other storage systems that can be used like MySQL, array lists, arrays etc because it is unlike MySQL it is very simple to implement and use and unlike array lists, it is faster at storing, searching and retrieving data which is very important considering that thousands of urls can be stored and searched through over and over again. Creating hash tables to store internal and external links The Java library also has a very important package which is by far the most important tool used in my program, the xml parser package. This package contains the document builder factory which is used to create document builders which contains the parsers we are going to use to break down the pages. The parser parses the content of the file which it is feed as an XML document and returns a new DOM document object. This package also contains methods which validate the xml documents and verify whether the documents are well formed as well. Getting the document builder factory, document builder and parsing a xml document into a DOM document object sample;
  • 19. It is very important that the parser doesn't not validate the xml pages because it will require Internet access and as you already know, the program is crawling off-line web pages. If it does, it will lead to system errors. Setting off the validating features of the parser; The external package DOM is used to create documents which will store the parsed pages content into a document. The document contains elements in a tree like structure with each element corresponding to a node on the tree. Traversing through the tree with any appropriate traversal method, all the nodes containing a-element tags are collected and stored in a list of nodes. Looping through that list of nodes, one is able to extract all a-element tags containing hyper link references. Classes In my implementation of the program, I used only classes. The first class contained the main method while the second class contained the main implementation of the web crawler. Main Class The main class contains the main method. The main method contains the prompt for the user to enter the absolute file path of the index page of the website to be crawled. When the user complies, the path is converted to a url object and is stored in the hash table containing internal links. The main method also contains the first call of the recursive method processPage(URL url). Structure of the main method;
  • 20. Web Crawler Class This class contains 80% of the implementation. It has only one method definition, that of the recursive method, processPage(). At the beginning of the class, the hash tables are declared followed by the definition of the processPage() method. The processPage() method contains only one parameter, the url object that is passed. Inside the method, the absolute path of the url is extracted and a file object is created thereof. The document builder object is created from the declaration and initialization of the document builder factory and document builders in the preceding lines of code. The method also contains the code snippet making sure that the parser doesn't validate the xml pages. The parser is then called to parse the xml document and then the root element of DOM document is then normalized. Thereafter, the root element of the document is extracted and the traversal of the nodes of the document begin. All a- element tags are selected and stored in a list of nodes. The nodes then are looped through and all the a-element tags containing the “href” attribute, the values of the
  • 21. attributes are extracted and a url is created therein of the pages that href is referencing to. As explained before, the url is created by extracting the base url of the parent file of page being referenced to and the page name of that file. The protocol of the file is then checked, and if it is “file://”, the program proceeds to check whether that url isn't referring to an image, a port etc, that it is referring to an actual page. Then it proceeds to make sure that that url is not already stored in the hash table containing links to internal pages of the website. If it is stored in the hash table already, the link is discarded by the system and the next link on the node list is processed. If it ain't stored in the hash table, the url is stored and that page is processed for more urls. If the protocol is tested and it returns “http://”, the program proceeds to check whether that url isn't already stored in the hash table containing links to external pages. It it is, the url is discarded and the next link on the list is processed. If not, the url is stored in the table and then it is printed to the screen
  • 22. SOFTWARE TESTING During the testing of the program, many problems were encountered. One of the first problems we had during the initial testings was with the validation of the parser. It is standard that all xml documents are checked to see whether they are well formed documents and are valid. The website we are crawling as you already know, is off-line and if the parser tries to validate it, errors like the one shown below will occur because it needs to connect to the Internet to perform the checks.
  • 23. To solve it, we set all features of the document builder factory that could start the validation of the xml documents to false, as we have shown you somewhere before. Another problem we encountered during the implementation of the program was how to get the absolute paths of the relative path to the pages we found on each page we already crawled. All that the crawler returned was the names of the files that were found to be referenced from the page that we were crawling. What we later did was to get the base uri of the file that was being crawled; it returns the absolute file path of that file being crawled and attached the names of the pages that were found on that file. That way we were able to create a url for all the links and processed them. Aside from the problems mentioned above, the program was able to pass through the final tests without any major bugs therefore bringing us successfully to the end of the implementation of the program. Although it was not an easy ride, it was worth every bit of effort we invested in it. Below are Terminal images showing the compilation and running of the program and the results i.e. the links on the website being crawled. No graphical interface is developed therefore the default GUI; the terminal is used. Command to compile the Program: Running the Program: Prompt and input of Index page: The program ran smoothly and proceeded to print out all the links that were found on the website and label them internal or external depending on were they are referencing to and the protocol they contain.
  • 24.
  • 25. BIBLIOGRAPHY AND REFERENCES [HREF1] What is a “Web Crawler" ? ( http://research.compaq.com/SRC/mercator/faq.html ) [HERF2] inverted index ( http://burks.brighton.ac.uk/burks/foldoc/86/59.htm ) [MARC] Marckini, Fredrick. Secrets to making your Internet Web Pages Achieve Top Rankings (ResponseDirect.com, Inc., c1999 ) http://en.wikipedia.org/wiki/Web_crawler http://research.microsoft.com/pubs/102936/eds-webcrawlerarchitecture.pdf http://research.microsoft.com/pubs/102936/eds-webcrawlerarchitecture.pdf