SlideShare uma empresa Scribd logo
1 de 36
Can you be dynamic and fast? “ Miss Marple and the case of the Missing MIPS” Zoë Slattery
Agenda ,[object Object],[object Object],[object Object],[object Object]
Index and search ,[object Object],[object Object],[object Object],[object Object],[object Object],1. Lagoze, C. Singhal, A. Information Discovery: Needles and Haystacks. IEEE Internet Computing. Volume 9(3), 16-18, 2005.
Options for information retrieval ,[object Object],[object Object],[object Object],[object Object],Egothor Xapian Lucene Implementation language Language bindings Language ports License Java None None BSD like C++ Perl, Python, PHP, Java, TCL None GPL Java None C++, Perl,  PHP, C# Apache 2
Lucene [2] 2. Gospodnetic, O., Hatcher, E. Lucene in Action. Manning Publications Co., Greenwich. 2005. DB Web File system Get user  query Present search  results Index Index documents Search index Gather data Lucene Application User
Lucene indexing start 3. Inverted index 1. Documents Analysis Index creation Optimise 4. Optimised inverted index . Oh for a muse of fire that would  acsend the brightest  heaven of  invention..... fire ascend ... Henry V, Scouting for boys... Aerospace, Henry V... Terms Documents end [fire]  [ascend]  [bright] [heaven] 2. Token stream
Agenda ,[object Object],[object Object],[object Object],[object Object]
Indexing speed ,[object Object],[object Object],[object Object],Java + JIT Java PHP 4 32 167 Time to index /seconds 0.3 3 43 Time to optimise /seconds 4.3 35 210 Total time Ouch! nearly 50 times as fast in Java
Why is the performance so bad? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Analysis - Java Analyzing "A Quick Brown Fox jumped over the Lazy Dog" StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog]  SimpleAnalyzer: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]  StopAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog]  Analyzing "XY&Z Corporation - xyz@example.com" StandardAnalyzer: [xy&z] [corporation] [xyz@example.com]  SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]  StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]
Analysis - PHP Analysing "A Quick Brown Fox jumped over the Lazy Dog" Default (lower case) filter: [a]  [quick]  [brown]  [fox]  [jumped]  [over]  [the]  [lazy]  [dog]  Stop words filter: [quick]  [brown]  [fox]  [jumped]  [over]  [lazy]  [dog]  Short words filter: [quick]  [brown]  [fox]  [jumped]  [over]  [the]  [lazy]  [dog]  Analysing "XY&Z Corporation - xyz@example.com" Default (lower case) filter: [xy]  [z]  [corporation]  [xyz]  [example]  [com]  Stop words filter: [xy]  [z]  [corporation]  [xyz]  [example]  [com]  Short words filter: [xy]  [corporation]  [xyz]  [example]  [com]
Compare indexes Same 663 terms java php
Agenda ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Execution profiles ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Java profile
Small problems with TPTP... ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Java Java + profile 2.3 687258 Time to index /seconds 0.3 673851 Time to optimise /seconds 88 50 % time in indexing
PHP profile
No problems with this tool ,[object Object],[object Object],[object Object],[object Object],[object Object],PHP PHP + profile 5 70 Time to index /seconds 3 55 Time to optimise /seconds 63 56 % time in indexing
look at the normalize() code public function normalize(Token $srcToken ) {   $newToken = new Token( strtolower( $srcToken->getTermText() ), $srcToken->getStartOffset(), $srcToken->getEndOffset()); $newToken->setPositionIncrement($srcToken->getPositionIncrement()); return $newToken; }
The normalize() function Sum( ) = 2.92;  18.99 – 2.92 =  16.07
Micro benchmark <?php  require_once &quot;Token.php&quot;;  require_once &quot;LowerCase.php&quot;;  $token = new Token(&quot;GO&quot;, 105, 107);  $filter = new LowerCase();  for ($i=0; $i < 10000000; $i++) {  $norm_token = $filter->normalize($token);  }  ?>
normalize() opcodes compiled vars:  !0 = $srcToken, !1 = $newToken  line  #  op  ext  return  operands  ----------------------------------------------------------------------------  11  0  RECV 1  13  1  ZEND_FETCH_CLASS :0 'Token'  2  NEW $1 :0  3  ZEND_INIT_METHOD_CALL !0, 'getTermText'  4  DO_FCALL_BY_NAME 0  5  SEND_VAR_NO_REF $3  6  DO_FCALL 1  'strtolower'  7  SEND_VAR_NO_REF $4  14  8  ZEND_INIT_METHOD_CALL !0, 'getStartOffset'  9  DO_FCALL_BY_NAME 0  10  SEND_VAR_NO_REF $6  15  11  ZEND_INIT_METHOD_CALL  !0, 'getEndOffset'  12  DO_FCALL_BY_NAME 0  13  SEND_VAR_NO_REF $8  14  DO_FCALL_BY_NAME 3  15  ASSIGN  !1, $1  16  ......
System profile 1. Convert to lower case 2. Look up opcodes
How Xdebug works Script execution ,[object Object],[object Object],Execute function Call out to profiler – start time  Call out to profiler – end time  ZEND_INIT_METHOD_CALL DO_FCALL_BY_NAME
The normalize() function Sum( ) = 2.92;  18.99 – 2.92 =  16.07   Is consumed in setting up functions to be run
Why is function calling faster in Java? ,[object Object],[object Object],[object Object]
Agenda ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
PHP profile
look at the call to normalize() $token = $this->normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); public function normalize(Token $srcToken ) {   $newToken = new Token(strtolower( $srcToken->getTermText() ), $srcToken->getStartOffset(), $srcToken->getEndOffset()); $newToken->setPositionIncrement($srcToken->getPositionIncrement()); return $newToken; }
look at the call to normalize() normalize() recoded.... $token = $this->normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); public function normalize (Token $srcToken) { $ srcToken->setTermText(strtolower($srcToken->getTermtext())); return $srcToken; }
After fix
Performance improvement? PHP + fix PHP 151 167 Time to index /seconds 43 43 Time to optimise /seconds Java  32 3 35 194 210 Total time 9.5 % improvement Java + JIT 4 0.3 4.3
Agenda ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Conclusions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],3.  http://framework.zend.com/issues/browse/ZF-3683 4. Prechelt, L. An empirical comparison of seven programming languages. Computer. Volume 33(10), 23-29, 2000.
Options for PHP  Y Y Y N N N N Y 5.  http://pecl.php.net/package/clucene Do you  care about  speed? Use Zend  Search Lucene Only  need basic  features? Can  support Java  environment? Use a Web  Service? Use Lucene via a Java bridge No Lucene  solution  today [5] Use SOLR as  web service
Other useful links ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Mais conteúdo relacionado

Mais procurados

Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
rcmuir
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
Swapnil & Patil
 

Mais procurados (20)

Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
Improved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert MuirImproved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert Muir
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Introduction to apache lucene
Introduction to apache luceneIntroduction to apache lucene
Introduction to apache lucene
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Lucene
LuceneLucene
Lucene
 
Search at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, TwitterSearch at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, Twitter
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
Azure search
Azure searchAzure search
Azure search
 

Destaque

Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scripting
Tony Fabeen
 
An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted index
weedge
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
Josiane Gamgo
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint
201014161
 

Destaque (18)

Solr
SolrSolr
Solr
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache Lucene
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scripting
 
Index types
Index typesIndex types
Index types
 
From Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityFrom Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalability
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Lucene
LuceneLucene
Lucene
 
Lucandra
LucandraLucandra
Lucandra
 
Inverted index
Inverted indexInverted index
Inverted index
 
An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted index
 
Introduction to solr
Introduction to solrIntroduction to solr
Introduction to solr
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Using Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index ExplosionUsing Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index Explosion
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - Sematext
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint
 

Semelhante a Search Lucene

PHP Performance: Principles and tools
PHP Performance: Principles and toolsPHP Performance: Principles and tools
PHP Performance: Principles and tools
10n Software, LLC
 
540slidesofnodejsbackendhopeitworkforu.pdf
540slidesofnodejsbackendhopeitworkforu.pdf540slidesofnodejsbackendhopeitworkforu.pdf
540slidesofnodejsbackendhopeitworkforu.pdf
hamzadamani7
 
これからのPerlプロダクトのかたち(YAPC::Asia 2013)
これからのPerlプロダクトのかたち(YAPC::Asia 2013)これからのPerlプロダクトのかたち(YAPC::Asia 2013)
これからのPerlプロダクトのかたち(YAPC::Asia 2013)
goccy
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language Analysis
Nicole Gomez
 
T3CON09 - FLOW3-based Intranet – first Experiences
T3CON09 - FLOW3-based Intranet – first ExperiencesT3CON09 - FLOW3-based Intranet – first Experiences
T3CON09 - FLOW3-based Intranet – first Experiences
elementare teilchen GmbH
 
Unit Test for ZF SlideShare Component
Unit Test for ZF SlideShare ComponentUnit Test for ZF SlideShare Component
Unit Test for ZF SlideShare Component
zftalk
 

Semelhante a Search Lucene (20)

PHP Performance: Principles and tools
PHP Performance: Principles and toolsPHP Performance: Principles and tools
PHP Performance: Principles and tools
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
 
540slidesofnodejsbackendhopeitworkforu.pdf
540slidesofnodejsbackendhopeitworkforu.pdf540slidesofnodejsbackendhopeitworkforu.pdf
540slidesofnodejsbackendhopeitworkforu.pdf
 
これからのPerlプロダクトのかたち(YAPC::Asia 2013)
これからのPerlプロダクトのかたち(YAPC::Asia 2013)これからのPerlプロダクトのかたち(YAPC::Asia 2013)
これからのPerlプロダクトのかたち(YAPC::Asia 2013)
 
Docker interview Questions-3.pdf
Docker interview Questions-3.pdfDocker interview Questions-3.pdf
Docker interview Questions-3.pdf
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language Analysis
 
T3CON09 - FLOW3-based Intranet – first Experiences
T3CON09 - FLOW3-based Intranet – first ExperiencesT3CON09 - FLOW3-based Intranet – first Experiences
T3CON09 - FLOW3-based Intranet – first Experiences
 
Dutch PHP Conference 2013: Distilled
Dutch PHP Conference 2013: DistilledDutch PHP Conference 2013: Distilled
Dutch PHP Conference 2013: Distilled
 
Celery in the Django
Celery in the DjangoCelery in the Django
Celery in the Django
 
Demo
DemoDemo
Demo
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thrift
 
Python tools for testing web services over HTTP
Python tools for testing web services over HTTPPython tools for testing web services over HTTP
Python tools for testing web services over HTTP
 
first pitch
first pitchfirst pitch
first pitch
 
werwr
werwrwerwr
werwr
 
sdfsdf
sdfsdfsdfsdf
sdfsdf
 
college
collegecollege
college
 
first pitch
first pitchfirst pitch
first pitch
 
Greenathan
GreenathanGreenathan
Greenathan
 
Unit Test for ZF SlideShare Component
Unit Test for ZF SlideShare ComponentUnit Test for ZF SlideShare Component
Unit Test for ZF SlideShare Component
 

Mais de Jeremy Coates

Mais de Jeremy Coates (17)

Cyber Security and GDPR
Cyber Security and GDPRCyber Security and GDPR
Cyber Security and GDPR
 
Aspect Oriented Programming
Aspect Oriented ProgrammingAspect Oriented Programming
Aspect Oriented Programming
 
Why is PHP Awesome
Why is PHP AwesomeWhy is PHP Awesome
Why is PHP Awesome
 
Testing with Codeception
Testing with CodeceptionTesting with Codeception
Testing with Codeception
 
An introduction to Phing the PHP build system (PHPDay, May 2012)
An introduction to Phing the PHP build system (PHPDay, May 2012)An introduction to Phing the PHP build system (PHPDay, May 2012)
An introduction to Phing the PHP build system (PHPDay, May 2012)
 
An introduction to Phing the PHP build system
An introduction to Phing the PHP build systemAn introduction to Phing the PHP build system
An introduction to Phing the PHP build system
 
Insects in your mind
Insects in your mindInsects in your mind
Insects in your mind
 
Phing
PhingPhing
Phing
 
Hudson Continuous Integration for PHP
Hudson Continuous Integration for PHPHudson Continuous Integration for PHP
Hudson Continuous Integration for PHP
 
The Uncertainty Principle
The Uncertainty PrincipleThe Uncertainty Principle
The Uncertainty Principle
 
Exploiting Php With Php
Exploiting Php With PhpExploiting Php With Php
Exploiting Php With Php
 
What's new, what's hot in PHP 5.3
What's new, what's hot in PHP 5.3What's new, what's hot in PHP 5.3
What's new, what's hot in PHP 5.3
 
Kiss Phpnw08
Kiss Phpnw08Kiss Phpnw08
Kiss Phpnw08
 
Regex Basics
Regex BasicsRegex Basics
Regex Basics
 
Mysql Explain Explained
Mysql Explain ExplainedMysql Explain Explained
Mysql Explain Explained
 
Introduction to Version Control
Introduction to Version ControlIntroduction to Version Control
Introduction to Version Control
 
PHPNW Conference Update
PHPNW Conference UpdatePHPNW Conference Update
PHPNW Conference Update
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Search Lucene

  • 1. Can you be dynamic and fast? “ Miss Marple and the case of the Missing MIPS” Zoë Slattery
  • 2.
  • 3.
  • 4.
  • 5. Lucene [2] 2. Gospodnetic, O., Hatcher, E. Lucene in Action. Manning Publications Co., Greenwich. 2005. DB Web File system Get user query Present search results Index Index documents Search index Gather data Lucene Application User
  • 6. Lucene indexing start 3. Inverted index 1. Documents Analysis Index creation Optimise 4. Optimised inverted index . Oh for a muse of fire that would acsend the brightest heaven of invention..... fire ascend ... Henry V, Scouting for boys... Aerospace, Henry V... Terms Documents end [fire] [ascend] [bright] [heaven] 2. Token stream
  • 7.
  • 8.
  • 9.
  • 10. Analysis - Java Analyzing &quot;A Quick Brown Fox jumped over the Lazy Dog&quot; StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] SimpleAnalyzer: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] StopAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Analyzing &quot;XY&Z Corporation - xyz@example.com&quot; StandardAnalyzer: [xy&z] [corporation] [xyz@example.com] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]
  • 11. Analysis - PHP Analysing &quot;A Quick Brown Fox jumped over the Lazy Dog&quot; Default (lower case) filter: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] Stop words filter: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Short words filter: [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] Analysing &quot;XY&Z Corporation - xyz@example.com&quot; Default (lower case) filter: [xy] [z] [corporation] [xyz] [example] [com] Stop words filter: [xy] [z] [corporation] [xyz] [example] [com] Short words filter: [xy] [corporation] [xyz] [example] [com]
  • 12. Compare indexes Same 663 terms java php
  • 13.
  • 14.
  • 16.
  • 18.
  • 19. look at the normalize() code public function normalize(Token $srcToken ) { $newToken = new Token( strtolower( $srcToken->getTermText() ), $srcToken->getStartOffset(), $srcToken->getEndOffset()); $newToken->setPositionIncrement($srcToken->getPositionIncrement()); return $newToken; }
  • 20. The normalize() function Sum( ) = 2.92; 18.99 – 2.92 = 16.07
  • 21. Micro benchmark <?php require_once &quot;Token.php&quot;; require_once &quot;LowerCase.php&quot;; $token = new Token(&quot;GO&quot;, 105, 107); $filter = new LowerCase(); for ($i=0; $i < 10000000; $i++) { $norm_token = $filter->normalize($token); } ?>
  • 22. normalize() opcodes compiled vars: !0 = $srcToken, !1 = $newToken line # op ext return operands ---------------------------------------------------------------------------- 11 0 RECV 1 13 1 ZEND_FETCH_CLASS :0 'Token' 2 NEW $1 :0 3 ZEND_INIT_METHOD_CALL !0, 'getTermText' 4 DO_FCALL_BY_NAME 0 5 SEND_VAR_NO_REF $3 6 DO_FCALL 1 'strtolower' 7 SEND_VAR_NO_REF $4 14 8 ZEND_INIT_METHOD_CALL !0, 'getStartOffset' 9 DO_FCALL_BY_NAME 0 10 SEND_VAR_NO_REF $6 15 11 ZEND_INIT_METHOD_CALL !0, 'getEndOffset' 12 DO_FCALL_BY_NAME 0 13 SEND_VAR_NO_REF $8 14 DO_FCALL_BY_NAME 3 15 ASSIGN !1, $1 16 ......
  • 23. System profile 1. Convert to lower case 2. Look up opcodes
  • 24.
  • 25. The normalize() function Sum( ) = 2.92; 18.99 – 2.92 = 16.07 Is consumed in setting up functions to be run
  • 26.
  • 27.
  • 29. look at the call to normalize() $token = $this->normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); public function normalize(Token $srcToken ) { $newToken = new Token(strtolower( $srcToken->getTermText() ), $srcToken->getStartOffset(), $srcToken->getEndOffset()); $newToken->setPositionIncrement($srcToken->getPositionIncrement()); return $newToken; }
  • 30. look at the call to normalize() normalize() recoded.... $token = $this->normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); public function normalize (Token $srcToken) { $ srcToken->setTermText(strtolower($srcToken->getTermtext())); return $srcToken; }
  • 32. Performance improvement? PHP + fix PHP 151 167 Time to index /seconds 43 43 Time to optimise /seconds Java 32 3 35 194 210 Total time 9.5 % improvement Java + JIT 4 0.3 4.3
  • 33.
  • 34.
  • 35. Options for PHP Y Y Y N N N N Y 5. http://pecl.php.net/package/clucene Do you care about speed? Use Zend Search Lucene Only need basic features? Can support Java environment? Use a Web Service? Use Lucene via a Java bridge No Lucene solution today [5] Use SOLR as web service
  • 36.