SlideShare uma empresa Scribd logo
1 de 36
Can you be dynamic and fast? “ Miss Marple and the case of the Missing MIPS” Zoë Slattery
Agenda ,[object Object],[object Object],[object Object],[object Object]
Index and search ,[object Object],[object Object],[object Object],[object Object],[object Object],1. Lagoze, C. Singhal, A. Information Discovery: Needles and Haystacks. IEEE Internet Computing. Volume 9(3), 16-18, 2005.
Options for information retrieval ,[object Object],[object Object],[object Object],[object Object],Egothor Xapian Lucene Implementation language Language bindings Language ports License Java None None BSD like C++ Perl, Python, PHP, Java, TCL None GPL Java None C++, Perl,  PHP, C# Apache 2
Lucene [2] 2. Gospodnetic, O., Hatcher, E. Lucene in Action. Manning Publications Co., Greenwich. 2005. DB Web File system Get user  query Present search  results Index Index documents Search index Gather data Lucene Application User
Lucene indexing start 3. Inverted index 1. Documents Analysis Index creation Optimise 4. Optimised inverted index . Oh for a muse of fire that would  acsend the brightest  heaven of  invention..... fire ascend ... Henry V, Scouting for boys... Aerospace, Henry V... Terms Documents end [fire]  [ascend]  [bright] [heaven] 2. Token stream
Agenda ,[object Object],[object Object],[object Object],[object Object]
Indexing speed ,[object Object],[object Object],[object Object],Java + JIT Java PHP 4 32 167 Time to index /seconds 0.3 3 43 Time to optimise /seconds 4.3 35 210 Total time Ouch! nearly 50 times as fast in Java
Why is the performance so bad? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Analysis - Java Analyzing "A Quick Brown Fox jumped over the Lazy Dog" StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog]  SimpleAnalyzer: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]  StopAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog]  Analyzing "XY&Z Corporation - xyz@example.com" StandardAnalyzer: [xy&z] [corporation] [xyz@example.com]  SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]  StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]
Analysis - PHP Analysing "A Quick Brown Fox jumped over the Lazy Dog" Default (lower case) filter: [a]  [quick]  [brown]  [fox]  [jumped]  [over]  [the]  [lazy]  [dog]  Stop words filter: [quick]  [brown]  [fox]  [jumped]  [over]  [lazy]  [dog]  Short words filter: [quick]  [brown]  [fox]  [jumped]  [over]  [the]  [lazy]  [dog]  Analysing "XY&Z Corporation - xyz@example.com" Default (lower case) filter: [xy]  [z]  [corporation]  [xyz]  [example]  [com]  Stop words filter: [xy]  [z]  [corporation]  [xyz]  [example]  [com]  Short words filter: [xy]  [corporation]  [xyz]  [example]  [com]
Compare indexes Same 663 terms java php
Agenda ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Execution profiles ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Java profile
Small problems with TPTP... ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Java Java + profile 2.3 687258 Time to index /seconds 0.3 673851 Time to optimise /seconds 88 50 % time in indexing
PHP profile
No problems with this tool ,[object Object],[object Object],[object Object],[object Object],[object Object],PHP PHP + profile 5 70 Time to index /seconds 3 55 Time to optimise /seconds 63 56 % time in indexing
look at the normalize() code public function normalize(Token $srcToken ) {   $newToken = new Token( strtolower( $srcToken->getTermText() ), $srcToken->getStartOffset(), $srcToken->getEndOffset()); $newToken->setPositionIncrement($srcToken->getPositionIncrement()); return $newToken; }
The normalize() function Sum( ) = 2.92;  18.99 – 2.92 =  16.07
Micro benchmark <?php  require_once &quot;Token.php&quot;;  require_once &quot;LowerCase.php&quot;;  $token = new Token(&quot;GO&quot;, 105, 107);  $filter = new LowerCase();  for ($i=0; $i < 10000000; $i++) {  $norm_token = $filter->normalize($token);  }  ?>
normalize() opcodes compiled vars:  !0 = $srcToken, !1 = $newToken  line  #  op  ext  return  operands  ----------------------------------------------------------------------------  11  0  RECV 1  13  1  ZEND_FETCH_CLASS :0 'Token'  2  NEW $1 :0  3  ZEND_INIT_METHOD_CALL !0, 'getTermText'  4  DO_FCALL_BY_NAME 0  5  SEND_VAR_NO_REF $3  6  DO_FCALL 1  'strtolower'  7  SEND_VAR_NO_REF $4  14  8  ZEND_INIT_METHOD_CALL !0, 'getStartOffset'  9  DO_FCALL_BY_NAME 0  10  SEND_VAR_NO_REF $6  15  11  ZEND_INIT_METHOD_CALL  !0, 'getEndOffset'  12  DO_FCALL_BY_NAME 0  13  SEND_VAR_NO_REF $8  14  DO_FCALL_BY_NAME 3  15  ASSIGN  !1, $1  16  ......
System profile 1. Convert to lower case 2. Look up opcodes
How Xdebug works Script execution ,[object Object],[object Object],Execute function Call out to profiler – start time  Call out to profiler – end time  ZEND_INIT_METHOD_CALL DO_FCALL_BY_NAME
The normalize() function Sum( ) = 2.92;  18.99 – 2.92 =  16.07   Is consumed in setting up functions to be run
Why is function calling faster in Java? ,[object Object],[object Object],[object Object]
Agenda ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
PHP profile
look at the call to normalize() $token = $this->normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); public function normalize(Token $srcToken ) {   $newToken = new Token(strtolower( $srcToken->getTermText() ), $srcToken->getStartOffset(), $srcToken->getEndOffset()); $newToken->setPositionIncrement($srcToken->getPositionIncrement()); return $newToken; }
look at the call to normalize() normalize() recoded.... $token = $this->normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); public function normalize (Token $srcToken) { $ srcToken->setTermText(strtolower($srcToken->getTermtext())); return $srcToken; }
After fix
Performance improvement? PHP + fix PHP 151 167 Time to index /seconds 43 43 Time to optimise /seconds Java  32 3 35 194 210 Total time 9.5 % improvement Java + JIT 4 0.3 4.3
Agenda ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Conclusions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],3.  http://framework.zend.com/issues/browse/ZF-3683 4. Prechelt, L. An empirical comparison of seven programming languages. Computer. Volume 33(10), 23-29, 2000.
Options for PHP  Y Y Y N N N N Y 5.  http://pecl.php.net/package/clucene Do you  care about  speed? Use Zend  Search Lucene Only  need basic  features? Can  support Java  environment? Use a Web  Service? Use Lucene via a Java bridge No Lucene  solution  today [5] Use SOLR as  web service
Other useful links ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Mais conteúdo relacionado

Mais procurados

Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingabial
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Luceneotisg
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Adrien Grand
 
Improved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert MuirImproved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert Muirlucenerevolution
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetuprcmuir
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy SokolenkoProvectus
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Introduction to apache lucene
Introduction to apache luceneIntroduction to apache lucene
Introduction to apache luceneShrikrishna Parab
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsOpenSource Connections
 
Search at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, TwitterSearch at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, TwitterLucidworks
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedBeyondTrees
 

Mais procurados (20)

Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
Improved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert MuirImproved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert Muir
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Introduction to apache lucene
Introduction to apache luceneIntroduction to apache lucene
Introduction to apache lucene
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Lucene
LuceneLucene
Lucene
 
Search at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, TwitterSearch at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, Twitter
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
Azure search
Azure searchAzure search
Azure search
 

Destaque

Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneJosiane Gamgo
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scriptingTony Fabeen
 
From Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityFrom Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityStéphane Gamard
 
Lucandra
LucandraLucandra
Lucandraotisg
 
An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted indexweedge
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisJosiane Gamgo
 
The search engine index
The search engine indexThe search engine index
The search engine indexCJ Jenkins
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
Using Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index ExplosionUsing Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index ExplosionLucidworks (Archived)
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextRafał Kuć
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint201014161
 

Destaque (18)

Solr
SolrSolr
Solr
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache Lucene
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scripting
 
Index types
Index typesIndex types
Index types
 
From Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityFrom Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalability
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Lucene
LuceneLucene
Lucene
 
Lucandra
LucandraLucandra
Lucandra
 
Inverted index
Inverted indexInverted index
Inverted index
 
An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted index
 
Introduction to solr
Introduction to solrIntroduction to solr
Introduction to solr
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Using Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index ExplosionUsing Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index Explosion
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - Sematext
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint
 

Semelhante a Search Lucene

PHP Performance: Principles and tools
PHP Performance: Principles and toolsPHP Performance: Principles and tools
PHP Performance: Principles and tools10n Software, LLC
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierDatabricks
 
540slidesofnodejsbackendhopeitworkforu.pdf
540slidesofnodejsbackendhopeitworkforu.pdf540slidesofnodejsbackendhopeitworkforu.pdf
540slidesofnodejsbackendhopeitworkforu.pdfhamzadamani7
 
これからのPerlプロダクトのかたち(YAPC::Asia 2013)
これからのPerlプロダクトのかたち(YAPC::Asia 2013)これからのPerlプロダクトのかたち(YAPC::Asia 2013)
これからのPerlプロダクトのかたち(YAPC::Asia 2013)goccy
 
Docker interview Questions-3.pdf
Docker interview Questions-3.pdfDocker interview Questions-3.pdf
Docker interview Questions-3.pdfYogeshwaran R
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNicole Gomez
 
T3CON09 - FLOW3-based Intranet – first Experiences
T3CON09 - FLOW3-based Intranet – first ExperiencesT3CON09 - FLOW3-based Intranet – first Experiences
T3CON09 - FLOW3-based Intranet – first Experienceselementare teilchen GmbH
 
Celery in the Django
Celery in the DjangoCelery in the Django
Celery in the DjangoWalter Liu
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftTalentica Software
 
Python tools for testing web services over HTTP
Python tools for testing web services over HTTPPython tools for testing web services over HTTP
Python tools for testing web services over HTTPMykhailo Kolesnyk
 
Unit Test for ZF SlideShare Component
Unit Test for ZF SlideShare ComponentUnit Test for ZF SlideShare Component
Unit Test for ZF SlideShare Componentzftalk
 

Semelhante a Search Lucene (20)

PHP Performance: Principles and tools
PHP Performance: Principles and toolsPHP Performance: Principles and tools
PHP Performance: Principles and tools
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
 
540slidesofnodejsbackendhopeitworkforu.pdf
540slidesofnodejsbackendhopeitworkforu.pdf540slidesofnodejsbackendhopeitworkforu.pdf
540slidesofnodejsbackendhopeitworkforu.pdf
 
これからのPerlプロダクトのかたち(YAPC::Asia 2013)
これからのPerlプロダクトのかたち(YAPC::Asia 2013)これからのPerlプロダクトのかたち(YAPC::Asia 2013)
これからのPerlプロダクトのかたち(YAPC::Asia 2013)
 
Docker interview Questions-3.pdf
Docker interview Questions-3.pdfDocker interview Questions-3.pdf
Docker interview Questions-3.pdf
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language Analysis
 
T3CON09 - FLOW3-based Intranet – first Experiences
T3CON09 - FLOW3-based Intranet – first ExperiencesT3CON09 - FLOW3-based Intranet – first Experiences
T3CON09 - FLOW3-based Intranet – first Experiences
 
Dutch PHP Conference 2013: Distilled
Dutch PHP Conference 2013: DistilledDutch PHP Conference 2013: Distilled
Dutch PHP Conference 2013: Distilled
 
Celery in the Django
Celery in the DjangoCelery in the Django
Celery in the Django
 
Demo
DemoDemo
Demo
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thrift
 
Python tools for testing web services over HTTP
Python tools for testing web services over HTTPPython tools for testing web services over HTTP
Python tools for testing web services over HTTP
 
first pitch
first pitchfirst pitch
first pitch
 
werwr
werwrwerwr
werwr
 
sdfsdf
sdfsdfsdfsdf
sdfsdf
 
college
collegecollege
college
 
first pitch
first pitchfirst pitch
first pitch
 
Greenathan
GreenathanGreenathan
Greenathan
 
Unit Test for ZF SlideShare Component
Unit Test for ZF SlideShare ComponentUnit Test for ZF SlideShare Component
Unit Test for ZF SlideShare Component
 

Mais de Jeremy Coates

Cyber Security and GDPR
Cyber Security and GDPRCyber Security and GDPR
Cyber Security and GDPRJeremy Coates
 
Aspect Oriented Programming
Aspect Oriented ProgrammingAspect Oriented Programming
Aspect Oriented ProgrammingJeremy Coates
 
Testing with Codeception
Testing with CodeceptionTesting with Codeception
Testing with CodeceptionJeremy Coates
 
An introduction to Phing the PHP build system (PHPDay, May 2012)
An introduction to Phing the PHP build system (PHPDay, May 2012)An introduction to Phing the PHP build system (PHPDay, May 2012)
An introduction to Phing the PHP build system (PHPDay, May 2012)Jeremy Coates
 
An introduction to Phing the PHP build system
An introduction to Phing the PHP build systemAn introduction to Phing the PHP build system
An introduction to Phing the PHP build systemJeremy Coates
 
Insects in your mind
Insects in your mindInsects in your mind
Insects in your mindJeremy Coates
 
Hudson Continuous Integration for PHP
Hudson Continuous Integration for PHPHudson Continuous Integration for PHP
Hudson Continuous Integration for PHPJeremy Coates
 
The Uncertainty Principle
The Uncertainty PrincipleThe Uncertainty Principle
The Uncertainty PrincipleJeremy Coates
 
Exploiting Php With Php
Exploiting Php With PhpExploiting Php With Php
Exploiting Php With PhpJeremy Coates
 
What's new, what's hot in PHP 5.3
What's new, what's hot in PHP 5.3What's new, what's hot in PHP 5.3
What's new, what's hot in PHP 5.3Jeremy Coates
 
Mysql Explain Explained
Mysql Explain ExplainedMysql Explain Explained
Mysql Explain ExplainedJeremy Coates
 
Introduction to Version Control
Introduction to Version ControlIntroduction to Version Control
Introduction to Version ControlJeremy Coates
 
PHPNW Conference Update
PHPNW Conference UpdatePHPNW Conference Update
PHPNW Conference UpdateJeremy Coates
 

Mais de Jeremy Coates (17)

Cyber Security and GDPR
Cyber Security and GDPRCyber Security and GDPR
Cyber Security and GDPR
 
Aspect Oriented Programming
Aspect Oriented ProgrammingAspect Oriented Programming
Aspect Oriented Programming
 
Why is PHP Awesome
Why is PHP AwesomeWhy is PHP Awesome
Why is PHP Awesome
 
Testing with Codeception
Testing with CodeceptionTesting with Codeception
Testing with Codeception
 
An introduction to Phing the PHP build system (PHPDay, May 2012)
An introduction to Phing the PHP build system (PHPDay, May 2012)An introduction to Phing the PHP build system (PHPDay, May 2012)
An introduction to Phing the PHP build system (PHPDay, May 2012)
 
An introduction to Phing the PHP build system
An introduction to Phing the PHP build systemAn introduction to Phing the PHP build system
An introduction to Phing the PHP build system
 
Insects in your mind
Insects in your mindInsects in your mind
Insects in your mind
 
Phing
PhingPhing
Phing
 
Hudson Continuous Integration for PHP
Hudson Continuous Integration for PHPHudson Continuous Integration for PHP
Hudson Continuous Integration for PHP
 
The Uncertainty Principle
The Uncertainty PrincipleThe Uncertainty Principle
The Uncertainty Principle
 
Exploiting Php With Php
Exploiting Php With PhpExploiting Php With Php
Exploiting Php With Php
 
What's new, what's hot in PHP 5.3
What's new, what's hot in PHP 5.3What's new, what's hot in PHP 5.3
What's new, what's hot in PHP 5.3
 
Kiss Phpnw08
Kiss Phpnw08Kiss Phpnw08
Kiss Phpnw08
 
Regex Basics
Regex BasicsRegex Basics
Regex Basics
 
Mysql Explain Explained
Mysql Explain ExplainedMysql Explain Explained
Mysql Explain Explained
 
Introduction to Version Control
Introduction to Version ControlIntroduction to Version Control
Introduction to Version Control
 
PHPNW Conference Update
PHPNW Conference UpdatePHPNW Conference Update
PHPNW Conference Update
 

Último

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Último (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Search Lucene

  • 1. Can you be dynamic and fast? “ Miss Marple and the case of the Missing MIPS” Zoë Slattery
  • 2.
  • 3.
  • 4.
  • 5. Lucene [2] 2. Gospodnetic, O., Hatcher, E. Lucene in Action. Manning Publications Co., Greenwich. 2005. DB Web File system Get user query Present search results Index Index documents Search index Gather data Lucene Application User
  • 6. Lucene indexing start 3. Inverted index 1. Documents Analysis Index creation Optimise 4. Optimised inverted index . Oh for a muse of fire that would acsend the brightest heaven of invention..... fire ascend ... Henry V, Scouting for boys... Aerospace, Henry V... Terms Documents end [fire] [ascend] [bright] [heaven] 2. Token stream
  • 7.
  • 8.
  • 9.
  • 10. Analysis - Java Analyzing &quot;A Quick Brown Fox jumped over the Lazy Dog&quot; StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] SimpleAnalyzer: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] StopAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Analyzing &quot;XY&Z Corporation - xyz@example.com&quot; StandardAnalyzer: [xy&z] [corporation] [xyz@example.com] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]
  • 11. Analysis - PHP Analysing &quot;A Quick Brown Fox jumped over the Lazy Dog&quot; Default (lower case) filter: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] Stop words filter: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Short words filter: [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] Analysing &quot;XY&Z Corporation - xyz@example.com&quot; Default (lower case) filter: [xy] [z] [corporation] [xyz] [example] [com] Stop words filter: [xy] [z] [corporation] [xyz] [example] [com] Short words filter: [xy] [corporation] [xyz] [example] [com]
  • 12. Compare indexes Same 663 terms java php
  • 13.
  • 14.
  • 16.
  • 18.
  • 19. look at the normalize() code public function normalize(Token $srcToken ) { $newToken = new Token( strtolower( $srcToken->getTermText() ), $srcToken->getStartOffset(), $srcToken->getEndOffset()); $newToken->setPositionIncrement($srcToken->getPositionIncrement()); return $newToken; }
  • 20. The normalize() function Sum( ) = 2.92; 18.99 – 2.92 = 16.07
  • 21. Micro benchmark <?php require_once &quot;Token.php&quot;; require_once &quot;LowerCase.php&quot;; $token = new Token(&quot;GO&quot;, 105, 107); $filter = new LowerCase(); for ($i=0; $i < 10000000; $i++) { $norm_token = $filter->normalize($token); } ?>
  • 22. normalize() opcodes compiled vars: !0 = $srcToken, !1 = $newToken line # op ext return operands ---------------------------------------------------------------------------- 11 0 RECV 1 13 1 ZEND_FETCH_CLASS :0 'Token' 2 NEW $1 :0 3 ZEND_INIT_METHOD_CALL !0, 'getTermText' 4 DO_FCALL_BY_NAME 0 5 SEND_VAR_NO_REF $3 6 DO_FCALL 1 'strtolower' 7 SEND_VAR_NO_REF $4 14 8 ZEND_INIT_METHOD_CALL !0, 'getStartOffset' 9 DO_FCALL_BY_NAME 0 10 SEND_VAR_NO_REF $6 15 11 ZEND_INIT_METHOD_CALL !0, 'getEndOffset' 12 DO_FCALL_BY_NAME 0 13 SEND_VAR_NO_REF $8 14 DO_FCALL_BY_NAME 3 15 ASSIGN !1, $1 16 ......
  • 23. System profile 1. Convert to lower case 2. Look up opcodes
  • 24.
  • 25. The normalize() function Sum( ) = 2.92; 18.99 – 2.92 = 16.07 Is consumed in setting up functions to be run
  • 26.
  • 27.
  • 29. look at the call to normalize() $token = $this->normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); public function normalize(Token $srcToken ) { $newToken = new Token(strtolower( $srcToken->getTermText() ), $srcToken->getStartOffset(), $srcToken->getEndOffset()); $newToken->setPositionIncrement($srcToken->getPositionIncrement()); return $newToken; }
  • 30. look at the call to normalize() normalize() recoded.... $token = $this->normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); public function normalize (Token $srcToken) { $ srcToken->setTermText(strtolower($srcToken->getTermtext())); return $srcToken; }
  • 32. Performance improvement? PHP + fix PHP 151 167 Time to index /seconds 43 43 Time to optimise /seconds Java 32 3 35 194 210 Total time 9.5 % improvement Java + JIT 4 0.3 4.3
  • 33.
  • 34.
  • 35. Options for PHP Y Y Y N N N N Y 5. http://pecl.php.net/package/clucene Do you care about speed? Use Zend Search Lucene Only need basic features? Can support Java environment? Use a Web Service? Use Lucene via a Java bridge No Lucene solution today [5] Use SOLR as web service
  • 36.