SlideShare a Scribd company logo
1 of 44
Download to read offline
Challenges in maintaing a high-performance
Search-Engine written in Java
Simon Willnauer
Apache Lucene Core Committer & PMC Chair
simonw@apache.org / simon.willnauer@searchworkings.com
Who am I?



• Lucene Core Committer
• Project Management Committee Chair (PMC)
• Apache Member
• Co-Founder BerlinBuzzwords
• Co-Founder on Searchworkings.org / Searchworkings.com




                                                          2
http://www.searchworkings.org

• Community Portal targeting OpenSource Search




                                                 3
Agenda



• Its all about performance ...eerrr community
• It’s Java so its fast?
• Challenges we faced and solved in the last years
    • Testing, Performance, Concurrency and Resource Utilization
• Questions




                                                                   4
Lets talk about Lucene



• Apache TLP since 2001
• Grandfather of projects like Mahout, Hadoop, Nutch, Tika
• Used by thousands of applications world wide
• Apache 2.0 licensed
• Core has Zero-Dependency
• Developed and Maintained by Volunteers




                                                             5
Just a search engine - so what’s the big deal?

• True - Just software!
• Massive community - with big expectations (YOU?)
• Mission critical for lots of companies
• End-user expects instant results independent of the request complexity
• New features sometimes require major changes
• Our contract is trust - we need to maintain trust!




                                                                           6
Trust & Passion

• ~ 30 committers (~ 10 active, some are payed to work on Lucene)
• All technical communication are public (JIRA, Mailinglist, IRC)
• Consensus is king! - usually :)
• No lead developer or architect
• No stand-ups, meetings or roadmap
• Up to 10k mails per month
• No passion, no progress!
• The Apache way: Community over Code


                                                                    7
Community? / 0-Dependencies?




   yesterday           today   tomorrow




                                          8
What are folks looking for?


                    Confident Users
                     API - Stability

                                 Lucene 3x




                      Lucene 4



     Innovation                         Performance
     New Users                          Happy Users

                                                      9
Maintaining a Library - it’s tricky!

• hides a lot technical details
    • synchronization, data-structures, algorithms
• Most of the users don’t look behind the scenes
• If you don’t look behind the scenes, it’s on us to make sure that the
 library is as efficient as possible...

• A good reputation needs to be maintained....
   • Lucene 4 promises a lot but it has to prove itself




                                                                          10
But lets talk about fun challenges...

• Written in Java... ;)
• A ton of different use-cases
• Nothing is impossible - or why LowercaseFilter supports supplementary
 characters

• There are Bugs, there are always Bugs!
• Mission Critical Software - shipped for free




                                                                          11
We are working in Java so....

• No need to know the machine & your environment
• Use JDK Collections, they are fast
• Short Lived Objects are Free
• Concurrency is straight forward
• IO is trivial
• Method Calls are fast - there is a JIT, no?
• Unicode is there and it works
• No memory problems - there is a GC, right?


                                                   12
Really?




          No!
                13
Mechanical Sympathy (Jackie Steward / Martin Thompson)




“The most amazing achievement of the
computer software industry is its continuing
cancellation of the steady and staggering
gains made by the computer hardware
industry.”       Henry Peteroski




                                                         14
Where to focus on?

                                                 CPU Cache Utilization
        Cost of a Monitor / CAS
                                                                      Space/CPU Utilization
      Need of mutability     Concurrency
                                                           Cost & Need of a Multiple Writers Model
                        Compute up-front?
 Concrete or Virtual?
                                              JVM memory allocation
   Environment                                                             Can we allow stack allocation?

             JIT - Friendliness?                             Impact on GC
                    Can we specialized a data-strucutres                Amount of Objects (Long & Short Living)


Any exploitable data properties
                                   Compression                             Can I reuse this object in the short term?

                                    Do we need 2 bytes per Character?




                                                                                                              15
What we focus on...

                                                             Materialized Data structures for Java HEAP
          Write, Commit, Merge
                                             Prevent False Sharing
                                                                     Space/CPU Utilization
 Write Once & Read - Only
                                Concurrency
                                                        Single Writer - Multiple Readers
        No Java Collections where scale is an issue


carefully test what JIT likes

               Environment                                  Guarantee continuous memory allocation

 prevent invokeVirtual where possible
                                                       Impact on GC
                                                                     Data Structures with Constant number of objects
                      Finite State Transducers / Machines
                                                                   MemoryMap | NIO
Strings can share prefix & suffix   Compression
                                                                             Exploit FS / OS Caches
                                    Materialize strings to bytes
  UTF-8 by default or custom encoding


Write, Commit, Merge



                                                                                                             16
Making things fast and stable is...




    an engineering effort! & takes time!



          Go tell it your boss!


                                           17
But is it necessary?

• Yes & No - it all depends on finding the hotspots
• Measure & Optimize for you use-case.
   • Data-structures are not general purpose (like the don’t support
    deletes)

• Follow the 80 / 20 rule
• Enforce Efficiency by design
   • Java Iterators are a good example of how not to do it!
• Remember you OS is highly optimized, make use of it!



                                                                       18
Enough high level - concrete problems please!



• Challenge: Idle is no-good!
• Challenge: One Data-Structure to rule them all?
• Challenge: How how to test a library
• Challenge: What’s needed for a 20000% performance improvement




                                                                  19
Challenge: Idle is no-good

• Building an index is a CPU & IO intensive task
• Lucene is full of indexes (thats basically all it does)
• Ultimate Goal is to scale up with CPUs and saturate IO at the same time

Don’t go crazy!
• Keep your code complexity in mind
   • Other people might need to maintain / extend this




                                                                        20
Here is the problem




                      WTF?


                             21
A closer look...


                               IndexWriter




                                                                 Multi-Threaded
                            DocumentsWriter

                   Thread    Thread   Thread   Thread   Thread
                    State     State    State    State    State




                    do
                    d
                    d
                    d         do
                              d
                              d
                              d        do
                                       d
                                       d
                                       d        do
                                                d
                                                d
                                                d        do
                                                         d
                                                         d
                                                         d


                       merge segments in memory




                                                                 Single-Threaded
                                                                                    Answer: it gives
     Merge on flush                    do
                                      do                                             you threads a
                                       do
                                       do
                                        do
                                        doc
                                                                                      break and it’s
                                                                                   having a drink with
           Flush to Disk                                                            your slow-as-s**t
                                                                                       IO System
                                   Directory

                                                                                                         22
Our Solution




                          IndexWriter
                        DocumentsWriter




                                                         Multi-Threaded
               DWPT      DWPT     DWPT     DWPT   DWPT




                  ddo
                   d
                   d     ddo
                         dd       ddo
                                   d
                                   d       ddo
                                            d
                                            d     ddo
                                                  dd




  Flush to Disk
                               Directory




                                                                          23
The Result




                                                                 vs. 620 sec on 3.x




   Indexing Ingest Rate over time with Lucene 4.0 & DWPT Indexing 7 Million
                           4kb wikipedia documents


                                                                               24
Challenge: One Data-Structure to Rule them all?

• Like most other systems writing datastructures to disk Lucene didn’t
 expose it for extension

• Major problem for researchers, engineers who know what they are doing
• Special use-cases need special solutions
   • Unique ID Field usually is a 1 to 1 key to document mapping
   • Holding a posting list pointer is a wasteful
   • Term lookup + disk seek vs. Term lookup + read
• Research is active in this area (integer encoding for instance)



                                                                         25
10000 ft view




        IndexWriter                IndexReader




                      Directory

                      FileSystem




                                                 26
Introducing an extra layer




         IndexWriter                 IndexReader




                         Flex API

                             Codec

                         Directory

                        FileSystem
                                                   27
For Backwards Compatibility you know?

                                                                             Available Codecs
 Index
                                                                                     ?                ?

                                                                                  Lucene 5       Lucene 4



              segment                                                                        ?

                                                                                          Lucene 3


      id            title
                                << read >>


   Lucene 4       Lucene 4

                                              Index   Index
                                             Reader   Writer                 Index



              segment

                                                               << merge >>
                                                                                           segment
         id             title

    Lucene 3       Lucene 3

                                                                                     id              title

                                                                                  Lucene 5       Lucene 5


                                                                                                             28
Using the right tool for the job..




                                Switching to Memory PostingsFormat


                                                                     29
Using the right tool for the job..




          Switching to BlockTreeTermIndex


                                            30
Challenge: How to test a library

• A library typically has:
   • lots of interfaces & abstract classes
   • tons of parameters
   • needs to handle user input gracefully
• Ideally we test all combinations of Interfaces, parameters and user
 inputs?

• Yeah - right!




                                                                        31
What’s wrong with Unit-Test

• Short answer: Nothing!
• But...
   • 1 Run == 1000 Runs? (only cover regression?)
   • Boundaries are rarely reached
   • Waste of CPU cycles
   • Test usually run against a single implementation
   • How to test against the full Unicode-Range?




                                                        32
An Example

  The method to test:



  The test:




  The result:




                        33
Can it fail?


      It can! ...after 53139 Runs




• Boundaries are everywhere
• There is no positive value for Integer.MIN
• But how to repeat / debug?




                                               34
Solution: A Randomized UnitTest Framework

• Disclaimer: this stuff has been around for ages - not our invention!
• Random selection of:
   • Interface Implementations
   • Input Parameters like # iterations, # threads, # cache sizes,
    intervals, ...

  • Random Valid Unicode Strings (Breaking JVM for fun and profit)
  • Throttling IO
  • Random Low Level Data-Strucutures
  • And many more...

                                                                         35
Make sure your unit tests fail - eventually!

• Framework is build for Lucene
   • Currently factored out into a general purpose framework
   • Check it out on: https://github.com/carrotsearch/randomizedtesting
• Wanna help the Lucene Project?
   • Run our tests and report the failure!




                                                                          36
Challenge: What’s needed for a 20k%
Performance improvement.




                                      BEER!
  COFFEE!
                      FUN!

                                              37
The Problem: Fuzzy Search

• Retrieve all documents containing a given term within a Levenshtein
 Distance of <= 2

• Given: a sorted dictionary of terms
• Trivial Solution: Brute Force - filter(terms, LD(2, queryTerm))
• Problem: it’s damn slow!
   • O(t) terms examined, t=number of terms in all docs for that field.
    Exhaustively compares each term. We would prefer O(log2t) instead.

  • O(n2) comparison function, n=length of term. Levenshtein dynamic
    programming. We would prefer O(n) instead.




                                                                          38
Solution: Turn Queries into Automatons

• Read a crazy Paper about building Levenshtein Automaton and
 implement it. (sounds easy - right?)

• Only explore subtrees that can lead to an accept state of some finite
 state machine.

• AutomatonQuery traverses the term dictionary and the state machine in
 parallel

• Imagine the index as a state machine that recognizes Terms and
 transduces matching Documents.

  • AutomatonQuery represents a user’s search needs as a FSM.
  • The intersection of the two emits search results


                                                                          39
Solution: Turn Queries into Finite State Machines


Example DFA for “dogs” Levenshtein Distance 1
                       Finite-State Queries in Lucene
 Accepts: “dugs”                                      Robert Muir
              o                                    rmuir@apache.org


                   u0000-f, g ,h-n, o, p-uffff

       d          g




                                                                      40
Turns out to be a massive improvement!




         In Lucene 3 this is about 0.1 - 0.2 QPS

                                                   41
Berlin Buzzwords 2012




 • Conference on High-Scalability, NoSQL and Search
 • 600+ Attendees, 50 Sessions, Trainings etc.
 • http://www.berlinbuzzwords.com


                                                      42
Who we are? Who builds Lucene?



• There is no Who!
• There is no such thing as “the creator of Lucene”
• It’s a true team effort
• That and only that makes Lucene what it is today!




                                                      43
Questions anybody?




                     ?
                         44

More Related Content

What's hot

SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloudaidanshribman
 
Enhancing Live Migration Process for CPU and/or memory intensive VMs running...
Enhancing Live Migration Process for CPU and/or  memory intensive VMs running...Enhancing Live Migration Process for CPU and/or  memory intensive VMs running...
Enhancing Live Migration Process for CPU and/or memory intensive VMs running...Benoit Hudzia
 
Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...
Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...
Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...Stefan Marr
 
2장. Runtime Data Areas
2장. Runtime Data Areas2장. Runtime Data Areas
2장. Runtime Data Areas김 한도
 
BM Real-time Technologies for SUSE Linux Enterprise Real Time
BM Real-time Technologies for SUSE Linux Enterprise Real TimeBM Real-time Technologies for SUSE Linux Enterprise Real Time
BM Real-time Technologies for SUSE Linux Enterprise Real TimeNovell
 
1장 Java란 무엇인가.key
1장 Java란 무엇인가.key1장 Java란 무엇인가.key
1장 Java란 무엇인가.key김 한도
 

What's hot (8)

SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloud
 
Enhancing Live Migration Process for CPU and/or memory intensive VMs running...
Enhancing Live Migration Process for CPU and/or  memory intensive VMs running...Enhancing Live Migration Process for CPU and/or  memory intensive VMs running...
Enhancing Live Migration Process for CPU and/or memory intensive VMs running...
 
Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...
Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...
Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Mul...
 
Android Virtualization: Opportunity and Organization
Android Virtualization: Opportunity and OrganizationAndroid Virtualization: Opportunity and Organization
Android Virtualization: Opportunity and Organization
 
2장. Runtime Data Areas
2장. Runtime Data Areas2장. Runtime Data Areas
2장. Runtime Data Areas
 
BM Real-time Technologies for SUSE Linux Enterprise Real Time
BM Real-time Technologies for SUSE Linux Enterprise Real TimeBM Real-time Technologies for SUSE Linux Enterprise Real Time
BM Real-time Technologies for SUSE Linux Enterprise Real Time
 
1장 Java란 무엇인가.key
1장 Java란 무엇인가.key1장 Java란 무엇인가.key
1장 Java란 무엇인가.key
 
Ccna PrepCenter - IP Subnetting from Networkers
Ccna PrepCenter - IP Subnetting from NetworkersCcna PrepCenter - IP Subnetting from Networkers
Ccna PrepCenter - IP Subnetting from Networkers
 

Viewers also liked

Multiservicios DMRD, C.A.
Multiservicios DMRD, C.A.Multiservicios DMRD, C.A.
Multiservicios DMRD, C.A.Rafael Roa
 
Reporte de la pelicula un monstruo en paris.
Reporte de la pelicula un monstruo en paris.Reporte de la pelicula un monstruo en paris.
Reporte de la pelicula un monstruo en paris.ARIADNAGM
 
BIBLIA CATOLICA, ANTIGUO TESTAMENTO, CRONICAS, PARTE 13 DE 47
BIBLIA CATOLICA, ANTIGUO TESTAMENTO, CRONICAS, PARTE 13 DE 47BIBLIA CATOLICA, ANTIGUO TESTAMENTO, CRONICAS, PARTE 13 DE 47
BIBLIA CATOLICA, ANTIGUO TESTAMENTO, CRONICAS, PARTE 13 DE 47sifexol
 
BIT 198 Carta Decano y Portada
BIT 198 Carta Decano y PortadaBIT 198 Carta Decano y Portada
BIT 198 Carta Decano y PortadaEugenio Fontán
 
Aplicattion form-admision Master in Film Direction 2016/2017. ENGLISH
Aplicattion form-admision Master in Film Direction 2016/2017.  ENGLISHAplicattion form-admision Master in Film Direction 2016/2017.  ENGLISH
Aplicattion form-admision Master in Film Direction 2016/2017. ENGLISHBande á Part Escuela de Cine
 
Dossier premsa The Family Run 2015 - Sant Cugat
Dossier premsa The Family Run 2015 - Sant CugatDossier premsa The Family Run 2015 - Sant Cugat
Dossier premsa The Family Run 2015 - Sant CugatPremsa Sant Cugat
 
Marina Angelaki - PASTEUR4OA: Supporting Open Access Policies
Marina Angelaki - PASTEUR4OA: Supporting Open Access PoliciesMarina Angelaki - PASTEUR4OA: Supporting Open Access Policies
Marina Angelaki - PASTEUR4OA: Supporting Open Access PoliciesOpenAIRE
 
Bissell Steam Mop
Bissell Steam Mop Bissell Steam Mop
Bissell Steam Mop Bissell1
 
Relaciones de tablas en una base de datos
Relaciones de tablas en una base de datosRelaciones de tablas en una base de datos
Relaciones de tablas en una base de datoserickksanti14
 
Catalogo muebles de bano Altai Visobath Kerlanic
Catalogo muebles de bano Altai Visobath KerlanicCatalogo muebles de bano Altai Visobath Kerlanic
Catalogo muebles de bano Altai Visobath KerlanicKerlanic Tile
 
Social Media und Internet: Wo stehen wir heute, welche Trends und Visionen ko...
Social Media und Internet: Wo stehen wir heute, welche Trends und Visionen ko...Social Media und Internet: Wo stehen wir heute, welche Trends und Visionen ko...
Social Media und Internet: Wo stehen wir heute, welche Trends und Visionen ko...Jochen Robes
 
"El Principito" para MJ
"El Principito" para MJ"El Principito" para MJ
"El Principito" para MJIrene Hidalgo
 
Caso compañía manufacturera k
Caso compañía manufacturera kCaso compañía manufacturera k
Caso compañía manufacturera kDaniel Jurado
 
AUTOEVALUACIÓN CULTURA Y ESTRUCTURA AFECTIVA
AUTOEVALUACIÓN  CULTURA Y ESTRUCTURA AFECTIVAAUTOEVALUACIÓN  CULTURA Y ESTRUCTURA AFECTIVA
AUTOEVALUACIÓN CULTURA Y ESTRUCTURA AFECTIVAFANNY JEM WONG MIÑÁN
 
Seguridad en las empresas
Seguridad en las empresasSeguridad en las empresas
Seguridad en las empresasOlatic
 

Viewers also liked (20)

Multiservicios DMRD, C.A.
Multiservicios DMRD, C.A.Multiservicios DMRD, C.A.
Multiservicios DMRD, C.A.
 
Reporte de la pelicula un monstruo en paris.
Reporte de la pelicula un monstruo en paris.Reporte de la pelicula un monstruo en paris.
Reporte de la pelicula un monstruo en paris.
 
BIBLIA CATOLICA, ANTIGUO TESTAMENTO, CRONICAS, PARTE 13 DE 47
BIBLIA CATOLICA, ANTIGUO TESTAMENTO, CRONICAS, PARTE 13 DE 47BIBLIA CATOLICA, ANTIGUO TESTAMENTO, CRONICAS, PARTE 13 DE 47
BIBLIA CATOLICA, ANTIGUO TESTAMENTO, CRONICAS, PARTE 13 DE 47
 
BIT 198 Carta Decano y Portada
BIT 198 Carta Decano y PortadaBIT 198 Carta Decano y Portada
BIT 198 Carta Decano y Portada
 
Aplicattion form-admision Master in Film Direction 2016/2017. ENGLISH
Aplicattion form-admision Master in Film Direction 2016/2017.  ENGLISHAplicattion form-admision Master in Film Direction 2016/2017.  ENGLISH
Aplicattion form-admision Master in Film Direction 2016/2017. ENGLISH
 
160111 corp presentation
160111 corp presentation160111 corp presentation
160111 corp presentation
 
Its
ItsIts
Its
 
Chuong1@tkw
Chuong1@tkwChuong1@tkw
Chuong1@tkw
 
New Goals of PARES: Spanish Archives Web Portal
New Goals of PARES: Spanish Archives Web PortalNew Goals of PARES: Spanish Archives Web Portal
New Goals of PARES: Spanish Archives Web Portal
 
Dossier premsa The Family Run 2015 - Sant Cugat
Dossier premsa The Family Run 2015 - Sant CugatDossier premsa The Family Run 2015 - Sant Cugat
Dossier premsa The Family Run 2015 - Sant Cugat
 
Marina Angelaki - PASTEUR4OA: Supporting Open Access Policies
Marina Angelaki - PASTEUR4OA: Supporting Open Access PoliciesMarina Angelaki - PASTEUR4OA: Supporting Open Access Policies
Marina Angelaki - PASTEUR4OA: Supporting Open Access Policies
 
Bissell Steam Mop
Bissell Steam Mop Bissell Steam Mop
Bissell Steam Mop
 
Relaciones de tablas en una base de datos
Relaciones de tablas en una base de datosRelaciones de tablas en una base de datos
Relaciones de tablas en una base de datos
 
Catalogo muebles de bano Altai Visobath Kerlanic
Catalogo muebles de bano Altai Visobath KerlanicCatalogo muebles de bano Altai Visobath Kerlanic
Catalogo muebles de bano Altai Visobath Kerlanic
 
Social Media und Internet: Wo stehen wir heute, welche Trends und Visionen ko...
Social Media und Internet: Wo stehen wir heute, welche Trends und Visionen ko...Social Media und Internet: Wo stehen wir heute, welche Trends und Visionen ko...
Social Media und Internet: Wo stehen wir heute, welche Trends und Visionen ko...
 
"El Principito" para MJ
"El Principito" para MJ"El Principito" para MJ
"El Principito" para MJ
 
Caso compañía manufacturera k
Caso compañía manufacturera kCaso compañía manufacturera k
Caso compañía manufacturera k
 
El Proyecto de Reintroducción del Quebrantahuesos (Gypaetus barbatus) en Anda...
El Proyecto de Reintroducción del Quebrantahuesos (Gypaetus barbatus) en Anda...El Proyecto de Reintroducción del Quebrantahuesos (Gypaetus barbatus) en Anda...
El Proyecto de Reintroducción del Quebrantahuesos (Gypaetus barbatus) en Anda...
 
AUTOEVALUACIÓN CULTURA Y ESTRUCTURA AFECTIVA
AUTOEVALUACIÓN  CULTURA Y ESTRUCTURA AFECTIVAAUTOEVALUACIÓN  CULTURA Y ESTRUCTURA AFECTIVA
AUTOEVALUACIÓN CULTURA Y ESTRUCTURA AFECTIVA
 
Seguridad en las empresas
Seguridad en las empresasSeguridad en las empresas
Seguridad en las empresas
 

Similar to Challenges in Maintaining a High Performance Search Engine Written in Java

Reverse engineering
Reverse engineeringReverse engineering
Reverse engineeringSaswat Padhi
 
Dynamic Languages in Production: Progress and Open Challenges
Dynamic Languages in Production: Progress and Open ChallengesDynamic Languages in Production: Progress and Open Challenges
Dynamic Languages in Production: Progress and Open Challengesbcantrill
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
Engineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the FutureEngineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the FutureBob Rhubart
 
Application Profiling for Memory and Performance
Application Profiling for Memory and PerformanceApplication Profiling for Memory and Performance
Application Profiling for Memory and Performancepradeepfn
 
Azug - successfully breeding rabits
Azug - successfully breeding rabitsAzug - successfully breeding rabits
Azug - successfully breeding rabitsYves Goeleven
 
VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012Eonblast
 
Java in High Frequency Trading
Java in High Frequency TradingJava in High Frequency Trading
Java in High Frequency TradingViktor Sovietov
 
Session 01 - Introduction to Java
Session 01 - Introduction to JavaSession 01 - Introduction to Java
Session 01 - Introduction to JavaPawanMM
 
Application Profiling for Memory and Performance
Application Profiling for Memory and PerformanceApplication Profiling for Memory and Performance
Application Profiling for Memory and PerformanceWSO2
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
Introduction to Java Part-2
Introduction to Java Part-2Introduction to Java Part-2
Introduction to Java Part-2RatnaJava
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
Network support for resource disaggregation in next-generation datacenters
Network support for resource disaggregation in next-generation datacentersNetwork support for resource disaggregation in next-generation datacenters
Network support for resource disaggregation in next-generation datacentersSangjin Han
 

Similar to Challenges in Maintaining a High Performance Search Engine Written in Java (20)

Reverse engineering
Reverse engineeringReverse engineering
Reverse engineering
 
Dynamic Languages in Production: Progress and Open Challenges
Dynamic Languages in Production: Progress and Open ChallengesDynamic Languages in Production: Progress and Open Challenges
Dynamic Languages in Production: Progress and Open Challenges
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Engineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the FutureEngineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the Future
 
Application Profiling for Memory and Performance
Application Profiling for Memory and PerformanceApplication Profiling for Memory and Performance
Application Profiling for Memory and Performance
 
Azug - successfully breeding rabits
Azug - successfully breeding rabitsAzug - successfully breeding rabits
Azug - successfully breeding rabits
 
VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012
 
Java in High Frequency Trading
Java in High Frequency TradingJava in High Frequency Trading
Java in High Frequency Trading
 
Session 01 - Introduction to Java
Session 01 - Introduction to JavaSession 01 - Introduction to Java
Session 01 - Introduction to Java
 
Surge2012
Surge2012Surge2012
Surge2012
 
Application Profiling for Memory and Performance
Application Profiling for Memory and PerformanceApplication Profiling for Memory and Performance
Application Profiling for Memory and Performance
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Introduction to Java Part-2
Introduction to Java Part-2Introduction to Java Part-2
Introduction to Java Part-2
 
Loom promises: be there!
Loom promises: be there!Loom promises: be there!
Loom promises: be there!
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
Java on the Mainframe
Java on the MainframeJava on the Mainframe
Java on the Mainframe
 
Network support for resource disaggregation in next-generation datacenters
Network support for resource disaggregation in next-generation datacentersNetwork support for resource disaggregation in next-generation datacenters
Network support for resource disaggregation in next-generation datacenters
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Recently uploaded

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Challenges in Maintaining a High Performance Search Engine Written in Java

  • 1. Challenges in maintaing a high-performance Search-Engine written in Java Simon Willnauer Apache Lucene Core Committer & PMC Chair simonw@apache.org / simon.willnauer@searchworkings.com
  • 2. Who am I? • Lucene Core Committer • Project Management Committee Chair (PMC) • Apache Member • Co-Founder BerlinBuzzwords • Co-Founder on Searchworkings.org / Searchworkings.com 2
  • 4. Agenda • Its all about performance ...eerrr community • It’s Java so its fast? • Challenges we faced and solved in the last years • Testing, Performance, Concurrency and Resource Utilization • Questions 4
  • 5. Lets talk about Lucene • Apache TLP since 2001 • Grandfather of projects like Mahout, Hadoop, Nutch, Tika • Used by thousands of applications world wide • Apache 2.0 licensed • Core has Zero-Dependency • Developed and Maintained by Volunteers 5
  • 6. Just a search engine - so what’s the big deal? • True - Just software! • Massive community - with big expectations (YOU?) • Mission critical for lots of companies • End-user expects instant results independent of the request complexity • New features sometimes require major changes • Our contract is trust - we need to maintain trust! 6
  • 7. Trust & Passion • ~ 30 committers (~ 10 active, some are payed to work on Lucene) • All technical communication are public (JIRA, Mailinglist, IRC) • Consensus is king! - usually :) • No lead developer or architect • No stand-ups, meetings or roadmap • Up to 10k mails per month • No passion, no progress! • The Apache way: Community over Code 7
  • 8. Community? / 0-Dependencies? yesterday today tomorrow 8
  • 9. What are folks looking for? Confident Users API - Stability Lucene 3x Lucene 4 Innovation Performance New Users Happy Users 9
  • 10. Maintaining a Library - it’s tricky! • hides a lot technical details • synchronization, data-structures, algorithms • Most of the users don’t look behind the scenes • If you don’t look behind the scenes, it’s on us to make sure that the library is as efficient as possible... • A good reputation needs to be maintained.... • Lucene 4 promises a lot but it has to prove itself 10
  • 11. But lets talk about fun challenges... • Written in Java... ;) • A ton of different use-cases • Nothing is impossible - or why LowercaseFilter supports supplementary characters • There are Bugs, there are always Bugs! • Mission Critical Software - shipped for free 11
  • 12. We are working in Java so.... • No need to know the machine & your environment • Use JDK Collections, they are fast • Short Lived Objects are Free • Concurrency is straight forward • IO is trivial • Method Calls are fast - there is a JIT, no? • Unicode is there and it works • No memory problems - there is a GC, right? 12
  • 13. Really? No! 13
  • 14. Mechanical Sympathy (Jackie Steward / Martin Thompson) “The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.” Henry Peteroski 14
  • 15. Where to focus on? CPU Cache Utilization Cost of a Monitor / CAS Space/CPU Utilization Need of mutability Concurrency Cost & Need of a Multiple Writers Model Compute up-front? Concrete or Virtual? JVM memory allocation Environment Can we allow stack allocation? JIT - Friendliness? Impact on GC Can we specialized a data-strucutres Amount of Objects (Long & Short Living) Any exploitable data properties Compression Can I reuse this object in the short term? Do we need 2 bytes per Character? 15
  • 16. What we focus on... Materialized Data structures for Java HEAP Write, Commit, Merge Prevent False Sharing Space/CPU Utilization Write Once & Read - Only Concurrency Single Writer - Multiple Readers No Java Collections where scale is an issue carefully test what JIT likes Environment Guarantee continuous memory allocation prevent invokeVirtual where possible Impact on GC Data Structures with Constant number of objects Finite State Transducers / Machines MemoryMap | NIO Strings can share prefix & suffix Compression Exploit FS / OS Caches Materialize strings to bytes UTF-8 by default or custom encoding Write, Commit, Merge 16
  • 17. Making things fast and stable is... an engineering effort! & takes time! Go tell it your boss! 17
  • 18. But is it necessary? • Yes & No - it all depends on finding the hotspots • Measure & Optimize for you use-case. • Data-structures are not general purpose (like the don’t support deletes) • Follow the 80 / 20 rule • Enforce Efficiency by design • Java Iterators are a good example of how not to do it! • Remember you OS is highly optimized, make use of it! 18
  • 19. Enough high level - concrete problems please! • Challenge: Idle is no-good! • Challenge: One Data-Structure to rule them all? • Challenge: How how to test a library • Challenge: What’s needed for a 20000% performance improvement 19
  • 20. Challenge: Idle is no-good • Building an index is a CPU & IO intensive task • Lucene is full of indexes (thats basically all it does) • Ultimate Goal is to scale up with CPUs and saturate IO at the same time Don’t go crazy! • Keep your code complexity in mind • Other people might need to maintain / extend this 20
  • 21. Here is the problem WTF? 21
  • 22. A closer look... IndexWriter Multi-Threaded DocumentsWriter Thread Thread Thread Thread Thread State State State State State do d d d do d d d do d d d do d d d do d d d merge segments in memory Single-Threaded Answer: it gives Merge on flush do do you threads a do do do doc break and it’s having a drink with Flush to Disk your slow-as-s**t IO System Directory 22
  • 23. Our Solution IndexWriter DocumentsWriter Multi-Threaded DWPT DWPT DWPT DWPT DWPT ddo d d ddo dd ddo d d ddo d d ddo dd Flush to Disk Directory 23
  • 24. The Result vs. 620 sec on 3.x Indexing Ingest Rate over time with Lucene 4.0 & DWPT Indexing 7 Million 4kb wikipedia documents 24
  • 25. Challenge: One Data-Structure to Rule them all? • Like most other systems writing datastructures to disk Lucene didn’t expose it for extension • Major problem for researchers, engineers who know what they are doing • Special use-cases need special solutions • Unique ID Field usually is a 1 to 1 key to document mapping • Holding a posting list pointer is a wasteful • Term lookup + disk seek vs. Term lookup + read • Research is active in this area (integer encoding for instance) 25
  • 26. 10000 ft view IndexWriter IndexReader Directory FileSystem 26
  • 27. Introducing an extra layer IndexWriter IndexReader Flex API Codec Directory FileSystem 27
  • 28. For Backwards Compatibility you know? Available Codecs Index ? ? Lucene 5 Lucene 4 segment ? Lucene 3 id title << read >> Lucene 4 Lucene 4 Index Index Reader Writer Index segment << merge >> segment id title Lucene 3 Lucene 3 id title Lucene 5 Lucene 5 28
  • 29. Using the right tool for the job.. Switching to Memory PostingsFormat 29
  • 30. Using the right tool for the job.. Switching to BlockTreeTermIndex 30
  • 31. Challenge: How to test a library • A library typically has: • lots of interfaces & abstract classes • tons of parameters • needs to handle user input gracefully • Ideally we test all combinations of Interfaces, parameters and user inputs? • Yeah - right! 31
  • 32. What’s wrong with Unit-Test • Short answer: Nothing! • But... • 1 Run == 1000 Runs? (only cover regression?) • Boundaries are rarely reached • Waste of CPU cycles • Test usually run against a single implementation • How to test against the full Unicode-Range? 32
  • 33. An Example The method to test: The test: The result: 33
  • 34. Can it fail? It can! ...after 53139 Runs • Boundaries are everywhere • There is no positive value for Integer.MIN • But how to repeat / debug? 34
  • 35. Solution: A Randomized UnitTest Framework • Disclaimer: this stuff has been around for ages - not our invention! • Random selection of: • Interface Implementations • Input Parameters like # iterations, # threads, # cache sizes, intervals, ... • Random Valid Unicode Strings (Breaking JVM for fun and profit) • Throttling IO • Random Low Level Data-Strucutures • And many more... 35
  • 36. Make sure your unit tests fail - eventually! • Framework is build for Lucene • Currently factored out into a general purpose framework • Check it out on: https://github.com/carrotsearch/randomizedtesting • Wanna help the Lucene Project? • Run our tests and report the failure! 36
  • 37. Challenge: What’s needed for a 20k% Performance improvement. BEER! COFFEE! FUN! 37
  • 38. The Problem: Fuzzy Search • Retrieve all documents containing a given term within a Levenshtein Distance of <= 2 • Given: a sorted dictionary of terms • Trivial Solution: Brute Force - filter(terms, LD(2, queryTerm)) • Problem: it’s damn slow! • O(t) terms examined, t=number of terms in all docs for that field. Exhaustively compares each term. We would prefer O(log2t) instead. • O(n2) comparison function, n=length of term. Levenshtein dynamic programming. We would prefer O(n) instead. 38
  • 39. Solution: Turn Queries into Automatons • Read a crazy Paper about building Levenshtein Automaton and implement it. (sounds easy - right?) • Only explore subtrees that can lead to an accept state of some finite state machine. • AutomatonQuery traverses the term dictionary and the state machine in parallel • Imagine the index as a state machine that recognizes Terms and transduces matching Documents. • AutomatonQuery represents a user’s search needs as a FSM. • The intersection of the two emits search results 39
  • 40. Solution: Turn Queries into Finite State Machines Example DFA for “dogs” Levenshtein Distance 1 Finite-State Queries in Lucene Accepts: “dugs” Robert Muir o rmuir@apache.org u0000-f, g ,h-n, o, p-uffff d g 40
  • 41. Turns out to be a massive improvement! In Lucene 3 this is about 0.1 - 0.2 QPS 41
  • 42. Berlin Buzzwords 2012 • Conference on High-Scalability, NoSQL and Search • 600+ Attendees, 50 Sessions, Trainings etc. • http://www.berlinbuzzwords.com 42
  • 43. Who we are? Who builds Lucene? • There is no Who! • There is no such thing as “the creator of Lucene” • It’s a true team effort • That and only that makes Lucene what it is today! 43