SlideShare uma empresa Scribd logo
1 de 24
Intelligent Crawling and Indexing
using Lucene


                 By
           Shiva Thatipelli
      Mohammad Zubair (Advisor)

      Contents
    Searching
   Indexing
   Lucene
   Indexing with Lucene
   Indexing Static and Dynamic Pages
   Extracting and Indexing Dynamic Pages
   Implementation
   Screens
Searching
   Looking up words in an index
   Factors Affecting Search
   Precision – How well the system can
    filter
   Speed
   Single, Multiple Phase queries, Results
    ranking, Sorting, Wild card queries,
    Range queries support
Indexing
   Sequential Search is bad (Not Scalable)
   Index speeds up selection
   Index is a special data structure which
    allows rapid searching.
   Different Index Implementations
        - B Trees
        - Hash Map
Search Process

                      Query


Docs                                 Docs


       Indexing API
                              Hits
                      Index
Lucene

   High-performance, full-featured text
    search engine library
   Written 100% in pure java
   Easy to use yet powerful API
   Jakarta Apache Product. Strong open
    source community support.
Why Lucene?
   Open source (Not proprietary)
   Easy to use, good documentation
   Interoperable - ex: Index generated by java
    can be used by VB, asp, perl application
   Powerful And Highly Scalable
   Index Format
       Designed for interoperability
       Well Documented
       Resides on File System, RAM, custom store
Continued
   Algorithms
       Efficient, fast and optimized
•   Incremental Indexing
•   Boolean Query, Fuzzy Query, Range Query,
    Multi Phrase Query, Wild Card Query etc…
•   Content Tagging – Documents as Collection
    of terms
   Heterogeneous documents - Useful when
    different set of metadata present for different
    mime types
Indexing With Lucene
   What type of documents can be
    indexed?
       Any document from which text can be
        fetched and extracted over the net with a
        URL
   Uses Inverted Index
     - The index stores statistics about
    terms in order to make term-based
    search more efficient.
Indexing With Lucene Contd…
 HTML            XLS                 WORD            PDF


     extracted         extracted         extracted         extracted

 Parser          Parser              Parser           Parser




                          Analyzer




                          Index
Indexing Static and Dynamic
Pages
   Static Pages which are HTML, XLS, WORD, PDF
    documents on web which can be easily crawled and
    indexed by search engines like Google and Yahoo.
   Static Pages over the internet can be passed into
    Lucene and indexed and searched with direct URLs.
   Dynamic Pages which are generated due to result of
    parameters submitted; like search results pages,
    Database hidden pages cannot be indexed with direct
    URLs.
   To index Dynamic Pages we need the parameters
    submitted by users to generate those pages.
Extracting and Indexing Dynamic
Pages
   Extracting dynamic web pages which also can be
    called as database hidden pages needs some kind of
    input to generate the URLs
   To get the input parameters, we used of Apache
    Access logs which contain user request as URL.
   A sample entry in Apache access log is as follows:
    127.0.0.1 - - [31/Aug/2005:18:44:03 -0400] "GET
    /archon/servlet/search?
    formname=simple&fulltext=maly&group=subject&sor
    t=title HTTP/1.1" 200 9560
Extracting and Indexing Dynamic
Pages Contd...
   It contains all the information like IP-address of the computer
    accessing the information, date, time information accessed,
    Method called, Request URL, HTTP version, and HTTP code.
   The Request URL is the one which has all the input parameters,
    in this case formname=simple
fulltext=maly group=subject        sort=title
   Results page is dynamic and dependent upon the parameters
    passed.
   A full URL like
    http://archon.cs.odu.edu:8066/archon/servlet/searc
     Can be generated from Request URL by appending Website
    address.
Indexing Dynamic Pages…
          Apache Logs



                        Parse and generate URL



         Results page         Could be any file type




            Analyzer




              Index
Implementation
   The above flow chart describes the way
    Apache logs are parsed and URLs are
    generated
   It shows how the Results pages are
    fetched and extracted from the URLs
   The Results page is sent for analysis
    then Lucene generates the index which
    will be used for future searches.
Demo
   Results:
   Hardware Environment
   Dedicated machine for indexing: No, but nominal usage at time
    of indexing.
   CPU: Intel x86 P4 2.8Ghz
   RAM: 512 DDR
   Drive configuration: IDE 7200rpm
   Software environment
   Lucene Version: 1.4
   Java Version: 1..2
   OS Version: Windows 2000
   Apache Web server version 1.3 to 2.0
   Location of index: local
Create Index
IndexByLog.java file reads the access logs on local computer, generates
the URLs, fetches and extracts the results page from the URLs and
indexes them and stores in LuceneIndex folder.
Files extraction and Index
Creation
Searching at the prompt
Searching on the web
Results on the web
Conclusion
   It is very easy to implement efficient and
    powerful search engines using Lucene
   Lucene can be used to index dynamic pages
    and database hidden pages
   Web Server Access logs can be used to
    generate URLs and Java, Lucene API can be
    used to fetch and index database hidden
    pages.
   There are some security risks involved as we
    can reveal what users are doing what
    searches and other sensitive information .
Questions?

Mais conteúdo relacionado

Mais procurados

Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Adrien Grand
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingabial
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Adrien Grand
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsOpenSource Connections
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Improved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert MuirImproved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert Muirlucenerevolution
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetuprcmuir
 

Mais procurados (20)

Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Lucene
LuceneLucene
Lucene
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Azure search
Azure searchAzure search
Azure search
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
 
Improved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert MuirImproved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert Muir
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Solr Architecture
Solr ArchitectureSolr Architecture
Solr Architecture
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 

Destaque

Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisJosiane Gamgo
 
The search engine index
The search engine indexThe search engine index
The search engine indexCJ Jenkins
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint201014161
 
Perspectivas da web semântica para a biblioteconomia
Perspectivas da web semântica para a biblioteconomiaPerspectivas da web semântica para a biblioteconomia
Perspectivas da web semântica para a biblioteconomiaNaira Michelle Alves Pereira
 
Construcción de una ontología OWL con protégé 4
Construcción de una ontología OWL con protégé 4Construcción de una ontología OWL con protégé 4
Construcción de una ontología OWL con protégé 4Taniana Rodriguez
 
Blogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media BootcampBlogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media Bootcampwesleyzhao
 
Ted Dunning - Whither Hadoop
Ted Dunning - Whither HadoopTed Dunning - Whither Hadoop
Ted Dunning - Whither HadoopEd Kohlwey
 
Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching Laxman Kotte
 
Search Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media BootcampSearch Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media Bootcampwesleyzhao
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Carlos Castillo (ChaTo)
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneJosiane Gamgo
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-WebinarEdureka!
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scriptingTony Fabeen
 

Destaque (15)

Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint
 
Perspectivas da web semântica para a biblioteconomia
Perspectivas da web semântica para a biblioteconomiaPerspectivas da web semântica para a biblioteconomia
Perspectivas da web semântica para a biblioteconomia
 
Construcción de una ontología OWL con protégé 4
Construcción de una ontología OWL con protégé 4Construcción de una ontología OWL con protégé 4
Construcción de una ontología OWL con protégé 4
 
Blogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media BootcampBlogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media Bootcamp
 
Ted Dunning - Whither Hadoop
Ted Dunning - Whither HadoopTed Dunning - Whither Hadoop
Ted Dunning - Whither Hadoop
 
Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching
 
Search Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media BootcampSearch Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media Bootcamp
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache Lucene
 
Solr
SolrSolr
Solr
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-Webinar
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scripting
 
Index types
Index typesIndex types
Index types
 

Semelhante a Intelligent crawling and indexing using lucene

Solr中国6月21日企业搜索
Solr中国6月21日企业搜索Solr中国6月21日企业搜索
Solr中国6月21日企业搜索longkeyy
 
SharePoint Developer Education Day Palo Alto
SharePoint  Developer Education Day  Palo  AltoSharePoint  Developer Education Day  Palo  Alto
SharePoint Developer Education Day Palo Altollangit
 
PyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in pythonPyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in pythonChetan Giridhar
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Netgramana
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerIOSR Journals
 
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...EPC Group
 
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, LucidworksIntroduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, LucidworksLucidworks
 
A customized web search engine [autosaved]
A customized web search engine [autosaved]A customized web search engine [autosaved]
A customized web search engine [autosaved]Mustafa Elkhiat
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache StanbolAlkuvoima
 
Design a share point 2013 architecture – the basics
Design a share point 2013 architecture – the basicsDesign a share point 2013 architecture – the basics
Design a share point 2013 architecture – the basicsAlexander Meijers
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdfAbanti Aazmin
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
Enterprise Search in SharePoint 2010
Enterprise Search in SharePoint 2010Enterprise Search in SharePoint 2010
Enterprise Search in SharePoint 2010bgerman
 
KnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge baseKnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge baseLaurent Alquier
 
Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...DESTIN-Informatique.com
 

Semelhante a Intelligent crawling and indexing using lucene (20)

Solr中国6月21日企业搜索
Solr中国6月21日企业搜索Solr中国6月21日企业搜索
Solr中国6月21日企业搜索
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
 
SharePoint Developer Education Day Palo Alto
SharePoint  Developer Education Day  Palo  AltoSharePoint  Developer Education Day  Palo  Alto
SharePoint Developer Education Day Palo Alto
 
PyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in pythonPyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in python
 
Solr -
Solr - Solr -
Solr -
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
 
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
 
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, LucidworksIntroduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
 
A customized web search engine [autosaved]
A customized web search engine [autosaved]A customized web search engine [autosaved]
A customized web search engine [autosaved]
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
Design a share point 2013 architecture – the basics
Design a share point 2013 architecture – the basicsDesign a share point 2013 architecture – the basics
Design a share point 2013 architecture – the basics
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdf
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Enterprise Search in SharePoint 2010
Enterprise Search in SharePoint 2010Enterprise Search in SharePoint 2010
Enterprise Search in SharePoint 2010
 
KnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge baseKnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge base
 
Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...
 
Ikenstudiolive
IkenstudioliveIkenstudiolive
Ikenstudiolive
 
R01765113122
R01765113122R01765113122
R01765113122
 

Último

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Último (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Intelligent crawling and indexing using lucene

  • 1. Intelligent Crawling and Indexing using Lucene By Shiva Thatipelli Mohammad Zubair (Advisor)
  • 2. Contents Searching  Indexing  Lucene  Indexing with Lucene  Indexing Static and Dynamic Pages  Extracting and Indexing Dynamic Pages  Implementation  Screens
  • 3. Searching  Looking up words in an index  Factors Affecting Search  Precision – How well the system can filter  Speed  Single, Multiple Phase queries, Results ranking, Sorting, Wild card queries, Range queries support
  • 4. Indexing  Sequential Search is bad (Not Scalable)  Index speeds up selection  Index is a special data structure which allows rapid searching.  Different Index Implementations - B Trees - Hash Map
  • 5. Search Process Query Docs Docs Indexing API Hits Index
  • 6. Lucene  High-performance, full-featured text search engine library  Written 100% in pure java  Easy to use yet powerful API  Jakarta Apache Product. Strong open source community support.
  • 7. Why Lucene?  Open source (Not proprietary)  Easy to use, good documentation  Interoperable - ex: Index generated by java can be used by VB, asp, perl application  Powerful And Highly Scalable  Index Format  Designed for interoperability  Well Documented  Resides on File System, RAM, custom store
  • 8. Continued  Algorithms  Efficient, fast and optimized • Incremental Indexing • Boolean Query, Fuzzy Query, Range Query, Multi Phrase Query, Wild Card Query etc… • Content Tagging – Documents as Collection of terms  Heterogeneous documents - Useful when different set of metadata present for different mime types
  • 9. Indexing With Lucene  What type of documents can be indexed?  Any document from which text can be fetched and extracted over the net with a URL  Uses Inverted Index - The index stores statistics about terms in order to make term-based search more efficient.
  • 10. Indexing With Lucene Contd… HTML XLS WORD PDF extracted extracted extracted extracted Parser Parser Parser Parser Analyzer Index
  • 11. Indexing Static and Dynamic Pages  Static Pages which are HTML, XLS, WORD, PDF documents on web which can be easily crawled and indexed by search engines like Google and Yahoo.  Static Pages over the internet can be passed into Lucene and indexed and searched with direct URLs.  Dynamic Pages which are generated due to result of parameters submitted; like search results pages, Database hidden pages cannot be indexed with direct URLs.  To index Dynamic Pages we need the parameters submitted by users to generate those pages.
  • 12. Extracting and Indexing Dynamic Pages  Extracting dynamic web pages which also can be called as database hidden pages needs some kind of input to generate the URLs  To get the input parameters, we used of Apache Access logs which contain user request as URL.  A sample entry in Apache access log is as follows: 127.0.0.1 - - [31/Aug/2005:18:44:03 -0400] "GET /archon/servlet/search? formname=simple&fulltext=maly&group=subject&sor t=title HTTP/1.1" 200 9560
  • 13. Extracting and Indexing Dynamic Pages Contd...  It contains all the information like IP-address of the computer accessing the information, date, time information accessed, Method called, Request URL, HTTP version, and HTTP code.  The Request URL is the one which has all the input parameters, in this case formname=simple fulltext=maly group=subject sort=title  Results page is dynamic and dependent upon the parameters passed.  A full URL like http://archon.cs.odu.edu:8066/archon/servlet/searc Can be generated from Request URL by appending Website address.
  • 14. Indexing Dynamic Pages… Apache Logs Parse and generate URL Results page Could be any file type Analyzer Index
  • 15. Implementation  The above flow chart describes the way Apache logs are parsed and URLs are generated  It shows how the Results pages are fetched and extracted from the URLs  The Results page is sent for analysis then Lucene generates the index which will be used for future searches.
  • 16. Demo
  • 17. Results:  Hardware Environment  Dedicated machine for indexing: No, but nominal usage at time of indexing.  CPU: Intel x86 P4 2.8Ghz  RAM: 512 DDR  Drive configuration: IDE 7200rpm  Software environment  Lucene Version: 1.4  Java Version: 1..2  OS Version: Windows 2000  Apache Web server version 1.3 to 2.0  Location of index: local
  • 18. Create Index IndexByLog.java file reads the access logs on local computer, generates the URLs, fetches and extracts the results page from the URLs and indexes them and stores in LuceneIndex folder.
  • 19. Files extraction and Index Creation
  • 23. Conclusion  It is very easy to implement efficient and powerful search engines using Lucene  Lucene can be used to index dynamic pages and database hidden pages  Web Server Access logs can be used to generate URLs and Java, Lucene API can be used to fetch and index database hidden pages.  There are some security risks involved as we can reveal what users are doing what searches and other sensitive information .