SlideShare uma empresa Scribd logo
1 de 24
Intelligent Crawling and Indexing
using Lucene


                 By
           Shiva Thatipelli
      Mohammad Zubair (Advisor)

      Contents
    Searching
   Indexing
   Lucene
   Indexing with Lucene
   Indexing Static and Dynamic Pages
   Extracting and Indexing Dynamic Pages
   Implementation
   Screens
Searching
   Looking up words in an index
   Factors Affecting Search
   Precision – How well the system can
    filter
   Speed
   Single, Multiple Phase queries, Results
    ranking, Sorting, Wild card queries,
    Range queries support
Indexing
   Sequential Search is bad (Not Scalable)
   Index speeds up selection
   Index is a special data structure which
    allows rapid searching.
   Different Index Implementations
        - B Trees
        - Hash Map
Search Process

                      Query


Docs                                 Docs


       Indexing API
                              Hits
                      Index
Lucene

   High-performance, full-featured text
    search engine library
   Written 100% in pure java
   Easy to use yet powerful API
   Jakarta Apache Product. Strong open
    source community support.
Why Lucene?
   Open source (Not proprietary)
   Easy to use, good documentation
   Interoperable - ex: Index generated by java
    can be used by VB, asp, perl application
   Powerful And Highly Scalable
   Index Format
       Designed for interoperability
       Well Documented
       Resides on File System, RAM, custom store
Continued
   Algorithms
       Efficient, fast and optimized
•   Incremental Indexing
•   Boolean Query, Fuzzy Query, Range Query,
    Multi Phrase Query, Wild Card Query etc…
•   Content Tagging – Documents as Collection
    of terms
   Heterogeneous documents - Useful when
    different set of metadata present for different
    mime types
Indexing With Lucene
   What type of documents can be
    indexed?
       Any document from which text can be
        fetched and extracted over the net with a
        URL
   Uses Inverted Index
     - The index stores statistics about
    terms in order to make term-based
    search more efficient.
Indexing With Lucene Contd…
 HTML            XLS                 WORD            PDF


     extracted         extracted         extracted         extracted

 Parser          Parser              Parser           Parser




                          Analyzer




                          Index
Indexing Static and Dynamic
Pages
   Static Pages which are HTML, XLS, WORD, PDF
    documents on web which can be easily crawled and
    indexed by search engines like Google and Yahoo.
   Static Pages over the internet can be passed into
    Lucene and indexed and searched with direct URLs.
   Dynamic Pages which are generated due to result of
    parameters submitted; like search results pages,
    Database hidden pages cannot be indexed with direct
    URLs.
   To index Dynamic Pages we need the parameters
    submitted by users to generate those pages.
Extracting and Indexing Dynamic
Pages
   Extracting dynamic web pages which also can be
    called as database hidden pages needs some kind of
    input to generate the URLs
   To get the input parameters, we used of Apache
    Access logs which contain user request as URL.
   A sample entry in Apache access log is as follows:
    127.0.0.1 - - [31/Aug/2005:18:44:03 -0400] "GET
    /archon/servlet/search?
    formname=simple&fulltext=maly&group=subject&sor
    t=title HTTP/1.1" 200 9560
Extracting and Indexing Dynamic
Pages Contd...
   It contains all the information like IP-address of the computer
    accessing the information, date, time information accessed,
    Method called, Request URL, HTTP version, and HTTP code.
   The Request URL is the one which has all the input parameters,
    in this case formname=simple
fulltext=maly group=subject        sort=title
   Results page is dynamic and dependent upon the parameters
    passed.
   A full URL like
    http://archon.cs.odu.edu:8066/archon/servlet/searc
     Can be generated from Request URL by appending Website
    address.
Indexing Dynamic Pages…
          Apache Logs



                        Parse and generate URL



         Results page         Could be any file type




            Analyzer




              Index
Implementation
   The above flow chart describes the way
    Apache logs are parsed and URLs are
    generated
   It shows how the Results pages are
    fetched and extracted from the URLs
   The Results page is sent for analysis
    then Lucene generates the index which
    will be used for future searches.
Demo
   Results:
   Hardware Environment
   Dedicated machine for indexing: No, but nominal usage at time
    of indexing.
   CPU: Intel x86 P4 2.8Ghz
   RAM: 512 DDR
   Drive configuration: IDE 7200rpm
   Software environment
   Lucene Version: 1.4
   Java Version: 1..2
   OS Version: Windows 2000
   Apache Web server version 1.3 to 2.0
   Location of index: local
Create Index
IndexByLog.java file reads the access logs on local computer, generates
the URLs, fetches and extracts the results page from the URLs and
indexes them and stores in LuceneIndex folder.
Files extraction and Index
Creation
Searching at the prompt
Searching on the web
Results on the web
Conclusion
   It is very easy to implement efficient and
    powerful search engines using Lucene
   Lucene can be used to index dynamic pages
    and database hidden pages
   Web Server Access logs can be used to
    generate URLs and Java, Lucene API can be
    used to fetch and index database hidden
    pages.
   There are some security risks involved as we
    can reveal what users are doing what
    searches and other sensitive information .
Questions?

Mais conteúdo relacionado

Mais procurados

Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Adrien Grand
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingabial
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Adrien Grand
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsOpenSource Connections
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Improved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert MuirImproved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert Muirlucenerevolution
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetuprcmuir
 

Mais procurados (20)

Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Lucene
LuceneLucene
Lucene
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Azure search
Azure searchAzure search
Azure search
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
 
Improved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert MuirImproved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert Muir
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Solr Architecture
Solr ArchitectureSolr Architecture
Solr Architecture
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 

Destaque

Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisJosiane Gamgo
 
The search engine index
The search engine indexThe search engine index
The search engine indexCJ Jenkins
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint201014161
 
Perspectivas da web semântica para a biblioteconomia
Perspectivas da web semântica para a biblioteconomiaPerspectivas da web semântica para a biblioteconomia
Perspectivas da web semântica para a biblioteconomiaNaira Michelle Alves Pereira
 
Construcción de una ontología OWL con protégé 4
Construcción de una ontología OWL con protégé 4Construcción de una ontología OWL con protégé 4
Construcción de una ontología OWL con protégé 4Taniana Rodriguez
 
Blogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media BootcampBlogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media Bootcampwesleyzhao
 
Ted Dunning - Whither Hadoop
Ted Dunning - Whither HadoopTed Dunning - Whither Hadoop
Ted Dunning - Whither HadoopEd Kohlwey
 
Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching Laxman Kotte
 
Search Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media BootcampSearch Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media Bootcampwesleyzhao
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Carlos Castillo (ChaTo)
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneJosiane Gamgo
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-WebinarEdureka!
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scriptingTony Fabeen
 

Destaque (15)

Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint
 
Perspectivas da web semântica para a biblioteconomia
Perspectivas da web semântica para a biblioteconomiaPerspectivas da web semântica para a biblioteconomia
Perspectivas da web semântica para a biblioteconomia
 
Construcción de una ontología OWL con protégé 4
Construcción de una ontología OWL con protégé 4Construcción de una ontología OWL con protégé 4
Construcción de una ontología OWL con protégé 4
 
Blogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media BootcampBlogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media Bootcamp
 
Ted Dunning - Whither Hadoop
Ted Dunning - Whither HadoopTed Dunning - Whither Hadoop
Ted Dunning - Whither Hadoop
 
Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching
 
Search Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media BootcampSearch Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media Bootcamp
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache Lucene
 
Solr
SolrSolr
Solr
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-Webinar
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scripting
 
Index types
Index typesIndex types
Index types
 

Semelhante a Intelligent crawling and indexing using lucene

Solr中国6月21日企业搜索
Solr中国6月21日企业搜索Solr中国6月21日企业搜索
Solr中国6月21日企业搜索longkeyy
 
SharePoint Developer Education Day Palo Alto
SharePoint  Developer Education Day  Palo  AltoSharePoint  Developer Education Day  Palo  Alto
SharePoint Developer Education Day Palo Altollangit
 
PyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in pythonPyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in pythonChetan Giridhar
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Netgramana
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerIOSR Journals
 
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...EPC Group
 
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, LucidworksIntroduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, LucidworksLucidworks
 
A customized web search engine [autosaved]
A customized web search engine [autosaved]A customized web search engine [autosaved]
A customized web search engine [autosaved]Mustafa Elkhiat
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache StanbolAlkuvoima
 
Design a share point 2013 architecture – the basics
Design a share point 2013 architecture – the basicsDesign a share point 2013 architecture – the basics
Design a share point 2013 architecture – the basicsAlexander Meijers
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdfAbanti Aazmin
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
Enterprise Search in SharePoint 2010
Enterprise Search in SharePoint 2010Enterprise Search in SharePoint 2010
Enterprise Search in SharePoint 2010bgerman
 
KnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge baseKnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge baseLaurent Alquier
 
Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...DESTIN-Informatique.com
 

Semelhante a Intelligent crawling and indexing using lucene (20)

Solr中国6月21日企业搜索
Solr中国6月21日企业搜索Solr中国6月21日企业搜索
Solr中国6月21日企业搜索
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
 
SharePoint Developer Education Day Palo Alto
SharePoint  Developer Education Day  Palo  AltoSharePoint  Developer Education Day  Palo  Alto
SharePoint Developer Education Day Palo Alto
 
PyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in pythonPyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in python
 
Solr -
Solr - Solr -
Solr -
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
 
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
 
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, LucidworksIntroduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
 
A customized web search engine [autosaved]
A customized web search engine [autosaved]A customized web search engine [autosaved]
A customized web search engine [autosaved]
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
Design a share point 2013 architecture – the basics
Design a share point 2013 architecture – the basicsDesign a share point 2013 architecture – the basics
Design a share point 2013 architecture – the basics
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdf
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Enterprise Search in SharePoint 2010
Enterprise Search in SharePoint 2010Enterprise Search in SharePoint 2010
Enterprise Search in SharePoint 2010
 
KnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge baseKnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge base
 
Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...
 
Ikenstudiolive
IkenstudioliveIkenstudiolive
Ikenstudiolive
 
R01765113122
R01765113122R01765113122
R01765113122
 

Último

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Intelligent crawling and indexing using lucene

  • 1. Intelligent Crawling and Indexing using Lucene By Shiva Thatipelli Mohammad Zubair (Advisor)
  • 2. Contents Searching  Indexing  Lucene  Indexing with Lucene  Indexing Static and Dynamic Pages  Extracting and Indexing Dynamic Pages  Implementation  Screens
  • 3. Searching  Looking up words in an index  Factors Affecting Search  Precision – How well the system can filter  Speed  Single, Multiple Phase queries, Results ranking, Sorting, Wild card queries, Range queries support
  • 4. Indexing  Sequential Search is bad (Not Scalable)  Index speeds up selection  Index is a special data structure which allows rapid searching.  Different Index Implementations - B Trees - Hash Map
  • 5. Search Process Query Docs Docs Indexing API Hits Index
  • 6. Lucene  High-performance, full-featured text search engine library  Written 100% in pure java  Easy to use yet powerful API  Jakarta Apache Product. Strong open source community support.
  • 7. Why Lucene?  Open source (Not proprietary)  Easy to use, good documentation  Interoperable - ex: Index generated by java can be used by VB, asp, perl application  Powerful And Highly Scalable  Index Format  Designed for interoperability  Well Documented  Resides on File System, RAM, custom store
  • 8. Continued  Algorithms  Efficient, fast and optimized • Incremental Indexing • Boolean Query, Fuzzy Query, Range Query, Multi Phrase Query, Wild Card Query etc… • Content Tagging – Documents as Collection of terms  Heterogeneous documents - Useful when different set of metadata present for different mime types
  • 9. Indexing With Lucene  What type of documents can be indexed?  Any document from which text can be fetched and extracted over the net with a URL  Uses Inverted Index - The index stores statistics about terms in order to make term-based search more efficient.
  • 10. Indexing With Lucene Contd… HTML XLS WORD PDF extracted extracted extracted extracted Parser Parser Parser Parser Analyzer Index
  • 11. Indexing Static and Dynamic Pages  Static Pages which are HTML, XLS, WORD, PDF documents on web which can be easily crawled and indexed by search engines like Google and Yahoo.  Static Pages over the internet can be passed into Lucene and indexed and searched with direct URLs.  Dynamic Pages which are generated due to result of parameters submitted; like search results pages, Database hidden pages cannot be indexed with direct URLs.  To index Dynamic Pages we need the parameters submitted by users to generate those pages.
  • 12. Extracting and Indexing Dynamic Pages  Extracting dynamic web pages which also can be called as database hidden pages needs some kind of input to generate the URLs  To get the input parameters, we used of Apache Access logs which contain user request as URL.  A sample entry in Apache access log is as follows: 127.0.0.1 - - [31/Aug/2005:18:44:03 -0400] "GET /archon/servlet/search? formname=simple&fulltext=maly&group=subject&sor t=title HTTP/1.1" 200 9560
  • 13. Extracting and Indexing Dynamic Pages Contd...  It contains all the information like IP-address of the computer accessing the information, date, time information accessed, Method called, Request URL, HTTP version, and HTTP code.  The Request URL is the one which has all the input parameters, in this case formname=simple fulltext=maly group=subject sort=title  Results page is dynamic and dependent upon the parameters passed.  A full URL like http://archon.cs.odu.edu:8066/archon/servlet/searc Can be generated from Request URL by appending Website address.
  • 14. Indexing Dynamic Pages… Apache Logs Parse and generate URL Results page Could be any file type Analyzer Index
  • 15. Implementation  The above flow chart describes the way Apache logs are parsed and URLs are generated  It shows how the Results pages are fetched and extracted from the URLs  The Results page is sent for analysis then Lucene generates the index which will be used for future searches.
  • 16. Demo
  • 17. Results:  Hardware Environment  Dedicated machine for indexing: No, but nominal usage at time of indexing.  CPU: Intel x86 P4 2.8Ghz  RAM: 512 DDR  Drive configuration: IDE 7200rpm  Software environment  Lucene Version: 1.4  Java Version: 1..2  OS Version: Windows 2000  Apache Web server version 1.3 to 2.0  Location of index: local
  • 18. Create Index IndexByLog.java file reads the access logs on local computer, generates the URLs, fetches and extracts the results page from the URLs and indexes them and stores in LuceneIndex folder.
  • 19. Files extraction and Index Creation
  • 23. Conclusion  It is very easy to implement efficient and powerful search engines using Lucene  Lucene can be used to index dynamic pages and database hidden pages  Web Server Access logs can be used to generate URLs and Java, Lucene API can be used to fetch and index database hidden pages.  There are some security risks involved as we can reveal what users are doing what searches and other sensitive information .