2.
Contents
Searching
Indexing
Lucene
Indexing with Lucene
Indexing Static and Dynamic Pages
Extracting and Indexing Dynamic Pages
Implementation
Screens
3. Searching
Looking up words in an index
Factors Affecting Search
Precision – How well the system can
filter
Speed
Single, Multiple Phase queries, Results
ranking, Sorting, Wild card queries,
Range queries support
4. Indexing
Sequential Search is bad (Not Scalable)
Index speeds up selection
Index is a special data structure which
allows rapid searching.
Different Index Implementations
- B Trees
- Hash Map
6. Lucene
High-performance, full-featured text
search engine library
Written 100% in pure java
Easy to use yet powerful API
Jakarta Apache Product. Strong open
source community support.
7. Why Lucene?
Open source (Not proprietary)
Easy to use, good documentation
Interoperable - ex: Index generated by java
can be used by VB, asp, perl application
Powerful And Highly Scalable
Index Format
Designed for interoperability
Well Documented
Resides on File System, RAM, custom store
8. Continued
Algorithms
Efficient, fast and optimized
• Incremental Indexing
• Boolean Query, Fuzzy Query, Range Query,
Multi Phrase Query, Wild Card Query etc…
• Content Tagging – Documents as Collection
of terms
Heterogeneous documents - Useful when
different set of metadata present for different
mime types
9. Indexing With Lucene
What type of documents can be
indexed?
Any document from which text can be
fetched and extracted over the net with a
URL
Uses Inverted Index
- The index stores statistics about
terms in order to make term-based
search more efficient.
10. Indexing With Lucene Contd…
HTML XLS WORD PDF
extracted extracted extracted extracted
Parser Parser Parser Parser
Analyzer
Index
11. Indexing Static and Dynamic
Pages
Static Pages which are HTML, XLS, WORD, PDF
documents on web which can be easily crawled and
indexed by search engines like Google and Yahoo.
Static Pages over the internet can be passed into
Lucene and indexed and searched with direct URLs.
Dynamic Pages which are generated due to result of
parameters submitted; like search results pages,
Database hidden pages cannot be indexed with direct
URLs.
To index Dynamic Pages we need the parameters
submitted by users to generate those pages.
12. Extracting and Indexing Dynamic
Pages
Extracting dynamic web pages which also can be
called as database hidden pages needs some kind of
input to generate the URLs
To get the input parameters, we used of Apache
Access logs which contain user request as URL.
A sample entry in Apache access log is as follows:
127.0.0.1 - - [31/Aug/2005:18:44:03 -0400] "GET
/archon/servlet/search?
formname=simple&fulltext=maly&group=subject&sor
t=title HTTP/1.1" 200 9560
13. Extracting and Indexing Dynamic
Pages Contd...
It contains all the information like IP-address of the computer
accessing the information, date, time information accessed,
Method called, Request URL, HTTP version, and HTTP code.
The Request URL is the one which has all the input parameters,
in this case formname=simple
fulltext=maly group=subject sort=title
Results page is dynamic and dependent upon the parameters
passed.
A full URL like
http://archon.cs.odu.edu:8066/archon/servlet/searc
Can be generated from Request URL by appending Website
address.
14. Indexing Dynamic Pages…
Apache Logs
Parse and generate URL
Results page Could be any file type
Analyzer
Index
15. Implementation
The above flow chart describes the way
Apache logs are parsed and URLs are
generated
It shows how the Results pages are
fetched and extracted from the URLs
The Results page is sent for analysis
then Lucene generates the index which
will be used for future searches.
17. Results:
Hardware Environment
Dedicated machine for indexing: No, but nominal usage at time
of indexing.
CPU: Intel x86 P4 2.8Ghz
RAM: 512 DDR
Drive configuration: IDE 7200rpm
Software environment
Lucene Version: 1.4
Java Version: 1..2
OS Version: Windows 2000
Apache Web server version 1.3 to 2.0
Location of index: local
18. Create Index
IndexByLog.java file reads the access logs on local computer, generates
the URLs, fetches and extracts the results page from the URLs and
indexes them and stores in LuceneIndex folder.
23. Conclusion
It is very easy to implement efficient and
powerful search engines using Lucene
Lucene can be used to index dynamic pages
and database hidden pages
Web Server Access logs can be used to
generate URLs and Java, Lucene API can be
used to fetch and index database hidden
pages.
There are some security risks involved as we
can reveal what users are doing what
searches and other sensitive information .