Intelligent crawling and indexing using lucene

Intelligent Crawling and Indexing
using Lucene

By
Shiva Thatipelli
Mohammad Zubair (Advisor)


Contents
Searching
 Indexing
 Lucene
 Indexing with Lucene
 Indexing Static and Dynamic Pages
 Extracting and Indexing Dynamic Pages
 Implementation
 Screens

Searching
 Looking up words in an index
 Factors Affecting Search
 Precision – How well the system can
filter
 Speed
 Single, Multiple Phase queries, Results
ranking, Sorting, Wild card queries,
Range queries support

Indexing
 Sequential Search is bad (Not Scalable)
 Index speeds up selection
 Index is a special data structure which
allows rapid searching.
 Different Index Implementations
- B Trees
- Hash Map

Search Process

Query

Docs Docs

Indexing API
Hits
Index

Lucene

 High-performance, full-featured text
search engine library
 Written 100% in pure java
 Easy to use yet powerful API
 Jakarta Apache Product. Strong open
source community support.

Why Lucene?
 Open source (Not proprietary)
 Easy to use, good documentation
 Interoperable - ex: Index generated by java
can be used by VB, asp, perl application
 Powerful And Highly Scalable
 Index Format
 Designed for interoperability
 Well Documented
 Resides on File System, RAM, custom store

Continued
 Algorithms
 Efficient, fast and optimized
• Incremental Indexing
• Boolean Query, Fuzzy Query, Range Query,
Multi Phrase Query, Wild Card Query etc…
• Content Tagging – Documents as Collection
of terms
 Heterogeneous documents - Useful when
different set of metadata present for different
mime types

Indexing With Lucene
 What type of documents can be
indexed?
 Any document from which text can be
fetched and extracted over the net with a
URL
 Uses Inverted Index
- The index stores statistics about
terms in order to make term-based
search more efficient.

Indexing With Lucene Contd…
HTML XLS WORD PDF

extracted extracted extracted extracted

Parser Parser Parser Parser

Analyzer

Index

Indexing Static and Dynamic
Pages
 Static Pages which are HTML, XLS, WORD, PDF
documents on web which can be easily crawled and
indexed by search engines like Google and Yahoo.
 Static Pages over the internet can be passed into
Lucene and indexed and searched with direct URLs.
 Dynamic Pages which are generated due to result of
parameters submitted; like search results pages,
Database hidden pages cannot be indexed with direct
URLs.
 To index Dynamic Pages we need the parameters
submitted by users to generate those pages.

Extracting and Indexing Dynamic
Pages
 Extracting dynamic web pages which also can be
called as database hidden pages needs some kind of
input to generate the URLs
 To get the input parameters, we used of Apache
Access logs which contain user request as URL.
 A sample entry in Apache access log is as follows:
127.0.0.1 - - [31/Aug/2005:18:44:03 -0400] "GET
/archon/servlet/search?
formname=simple&fulltext=maly&group=subject&sor
t=title HTTP/1.1" 200 9560

Extracting and Indexing Dynamic
Pages Contd...
 It contains all the information like IP-address of the computer
accessing the information, date, time information accessed,
Method called, Request URL, HTTP version, and HTTP code.
 The Request URL is the one which has all the input parameters,
in this case formname=simple
fulltext=maly group=subject sort=title
 Results page is dynamic and dependent upon the parameters
passed.
 A full URL like
http://archon.cs.odu.edu:8066/archon/servlet/searc
Can be generated from Request URL by appending Website
address.

Indexing Dynamic Pages…
Apache Logs

Parse and generate URL

Results page Could be any file type

Analyzer

Index

Implementation
 The above flow chart describes the way
Apache logs are parsed and URLs are
generated
 It shows how the Results pages are
fetched and extracted from the URLs
 The Results page is sent for analysis
then Lucene generates the index which
will be used for future searches.

 Results:
 Hardware Environment
 Dedicated machine for indexing: No, but nominal usage at time
of indexing.
 CPU: Intel x86 P4 2.8Ghz
 RAM: 512 DDR
 Drive configuration: IDE 7200rpm
 Software environment
 Lucene Version: 1.4
 Java Version: 1..2
 OS Version: Windows 2000
 Apache Web server version 1.3 to 2.0
 Location of index: local

Create Index
IndexByLog.java file reads the access logs on local computer, generates
the URLs, fetches and extracts the results page from the URLs and
indexes them and stores in LuceneIndex folder.

Files extraction and Index
Creation

Conclusion
 It is very easy to implement efficient and
powerful search engines using Lucene
 Lucene can be used to index dynamic pages
and database hidden pages
 Web Server Access logs can be used to
generate URLs and Java, Lucene API can be
used to fetch and index database hidden
pages.
 There are some security risks involved as we
can reveal what users are doing what
searches and other sensitive information .

Intelligent crawling and indexing using lucene

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (15)

Semelhante a Intelligent crawling and indexing using lucene

Semelhante a Intelligent crawling and indexing using lucene (20)

Último

Último (20)

Intelligent crawling and indexing using lucene