SlideShare uma empresa Scribd logo
1 de 35
How Cisco's Pulse uses
Lucene/Solr to put Social
Networks to Work
                 24 Jun 2010


     Sonali Sambhus
     Thangam Arumugam
     Stephen Bochinski
Slides posted for download at
the end of this presentation;
  full replay available within                     Introduction
   ~48 hours of live webcast




                    Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                                2
About the presenters


         Sonali Sambhus
         Senior search architect and Engineering Manager, Cisco Pulse
         Platform; founding member of the Cisco Pulse Business Unit.
         M.S. Computer Engineering, Rutgers.

         Thangam Arumugam
         Senior software architect, Cisco Systems. Architect, Cisco Pulse
         Platform; founding member of the Cisco Pulse Business Unit.
         BE Computer Science and Engineering, Bharathiar University, India.

         Stephen Bochinski
         Software Engineer at Cisco Systems.
         Solr Software developer for the Cisco Pulse Platform.
         BSc Computer Science, UC San Diego


               Lucid Imagination, Inc. – http://www.lucidimagination.com   3   3
Agenda

    About Cisco PulseTM
    Performance Use Case
    Optimizing stored field retrieval performance
      using Field Cache
    Optimizing query performance using MMAP
    Real Time Snapshot Feature
    Performance efficient methods for highlighting text
    Q&A


            Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                        4
About Cisco Pulse




Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                            5
Background for Cisco Pulse

                  Cisco is not just about Routers & Switches!



                    Cisco’s Emerging Technology Group (ETG)
                   focuses on new markets and technologies
                     which will be the ‘next’ wave for Cisco



                            Cisco Pulse is a brain child of ETG



                    Cisco Pulse is a shipping product targeted
                            for Enterprise Customers

            Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                        6
Cisco Pulse


 Automatically
   discover
     what people know
     who they know
     the information
     they value
                                                                          Cisco Network



              Lucid Imagination, Inc. – http://www.lucidimagination.com                   7
Cisco Pulse Search and
Analytics Platform

       Enabling Collaboration Across Boundaries

  Automatically discover expertise
  Collaborate in a single click
  Surface and share info in real-time
  Navigate video to the spoken word
  Integrated into the tools people use




      “If we knew what we know, we would be three
         times more productive than we are today!”
                   Lucid Imagination, Inc. – http://www.lucidimagination.com   8
1   How We Do It
    Automated Network Discovery




                                                                          Pulse Collect Appliance


                                                                          Cisco Network




              Lucid Imagination, Inc. – http://www.lucidimagination.com                             9
2   How We Do It
    Social Search and Analytics


                                                                           Pulse Connect




                                                                           Social Graph
                                                               Expertise                   Documents

                                              Profile                                                  Media




               Lucid Imagination, Inc. – http://www.lucidimagination.com                                       10
Performance Use Case Description




     Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                 11
An Approach to specifying a
Performance Use Case

  (1 ) Data
          Number of Records
          Index Size (Gig)
          Size per records
  (2) Search Application Requirements
          Search features: Faceting, sorting etc
          Number of records retrieved per query
  (3) Query Performance Goals
          Concurrent Query Rate


              Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                          12
Three Dimensions of
Query Performance
     Data                                                  Nature of Content
                      Number of records in the index                      35 Million
                      Index Size                                          6 Gig
                      Number & size                                       14 string fields
                      of stored fields per record                         each of size 85 bytes
                      Term Distribution in the index                      Following Zipf’sLaw

     Search                                                    Requirements
     Application      Number of records retrieved per
     Needs            search query                                        500 records
                      Number of stored fields retrieved
                      per query                         14 string fields
                                                                          Boolean queries
                      Search Features (such as                            without any advanced
                      sorting/faceting/..)                                search features

              Lucid Imagination, Inc. – http://www.lucidimagination.com                           13
Three Dimensions of Query
Performance (CONTINUED)


   Query                                                           Goals
   Performance     Concurrent Query Rate                                   3 QPS
   Needs           Average Query Length                                    3 terms
                   Query Response Time Budget                              Less than 300 ms
                                                                           Less than 300 ms for
                   First Time vs                                           first time as well as
                   Subsequent queries                                      subsequent queries




             Lucid Imagination, Inc. – http://www.lucidimagination.com                             14
Optimizing Stored Field Retrieval
 Performance using Field Cache




     Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                 15
Standard Caching Capabilities in Solr
   Solr 1.4 has LRU & Fast LRU Caches
     Filter Cache: Used for filter queries, faceting, sorting
     QueryResultCache: Used to Store Doc Ids specific to a query
     Document Cache: The documentCache stores Lucene Document
     objects that have been fetched
   Lucene Field Cache
     Used for sorting and faceting. Not managed by Solr
   Limitations of these Caches with respect to the use case:
     First time queries are slow; subsequent
     queries hit the cache and are fast
     Even for subsequent queries only recently
     used queries are fast -- and not all queries.

                   Lucid Imagination, Inc. – http://www.lucidimagination.com   16
Root Cause of poor query
performance in our use case
   What does Lucene do when it gets a query:
     #1 Retrieve the top documents
     #2 Retrieve stored fields for these documents
   Stored fields are stored in .fdt& .fdx Lucene files
  In this use case, since we retrieve a large
  number of documents (500) and their stored fields

            #2 was high due to increased Disk IO




                  Lucid Imagination, Inc. – http://www.lucidimagination.com   17
Optimizing stored field retrieval:
Leverage Field Cache
   What is Field Cache
     Lucene Class Field Cache caches field values Per segment, per doc id ,
     Per Field
   Solution for optimizing stored field retrieval
   Solr Customization (JIRA SOLR-1961)
     SolrIndexSearcherholds the Field Cache for its own segment
     When retrieving Stored Fields, the data is read from Field Cache
     instead of from disk
     The Field Cache is warmed whenever a new Searcher comes up
     Selective Field Caching ie. Ability to configure select fields.
   Performance Improvement:
   Query time reduced from 3 seconds to 1.5 seconds
                   Lucid Imagination, Inc. – http://www.lucidimagination.com   18
Limitations of Field Cache Solution
   Lucene Field Cache does not support multi valued fields
   The Field has to be an indexed field
   Lucene only supports a finite number of distinct Field Values per
   Field for Field Cache Class
   JVM Memory Consumption increases due to holding FieldCache in
   memory.

  Memory consumption =
  Number of fields cached * Number of documents
  * 8 bytes per reference
  + SUM (Number of unique values of the field
         * average length of term)
  * 2 (chars use 2 bytes) * String overhead (40 bytes)



                 Lucid Imagination, Inc. – http://www.lucidimagination.com   19
Optimizing Query Performance using
             MMAP




      Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                  20
Optimizing query performance
Next Step: Leverage Lucene MMAP
   What is MMAP
     Lucene MMAP provides a way to map index files into memory.
   Optimizing query performance: Leverage MMAP
     Reduce Cost of Disk IO by MMAP select index files which are used in
     computation of document list
     Added Lucene customization to MMAP only select files (SOLR-1969)
     Added customization to MMAP new index files after a commit &
     optimize
   Caveats
     Increases JVM Memory Usage as much as the size of index files.




                 Lucid Imagination, Inc. – http://www.lucidimagination.com   21
Performance Optimization Result
For this Use Case


        Average Query Response Time Speedup (ms)
3500
3000
2500
2000
                  3226
1500
1000
 500
                                                                          377
   0

           Default Solr                                   Customized Solr (Field Cache +
                                                                    MMAP)

              Lucid Imagination, Inc. – http://www.lucidimagination.com                    22
Operational Optimization
 for Full Index Backups:
   Real Time Snapshot




 Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                             23
Operational optimization with full index
hot backups (Real Time Snapshot)
   Currently Solr does not provide a direct method to get the
   snapshot explicitly when the index being written
   Cisco Pulse Team came up with a solution & packaged a script
   using replication methods to take online snapshot
   The snapshot can be taken at any time and replicated

USE CASES
     Index Restore
     Adding another Node in a cluster
     Snapshot for offline analysis
     Useful in case of real time indexing & querying


                 Lucid Imagination, Inc. – http://www.lucidimagination.com   24
About Lucene Index
   Segments & Commit
      Segment N File - N is the latest segment number of
     segments
      holds the references to all the files in the segments that are active.
     Commit creates a Segment N file
   Commit Point
     Includes the Index version and the files.
   Index Deletion Policy
     Controls the index segment cleanup


                 Lucid Imagination, Inc. – http://www.lucidimagination.com     25
Lucene Hot SnapShot
   SnapShot Procedures
Follows the solr replication strategy:
     Commit First: This will ensure that all the data are made
     into the index are in the commit point.
  http://master_host:port/solr/core/ingest?commit=true


      Take a Snapshot: No need to halt Index writing here.
  http://master_host:port/solr/core/replication?command=backup




                 Lucid Imagination, Inc. – http://www.lucidimagination.com   26
SnapShot Configuration
solrconfig.xml: SnapShot can be taken on Optimize or Commit event.
<requestHandler name="/replication"
  class="solr.ReplicationHandler" >
<lst name="master">
<str name="replicateAfter">commit</str>
<str name="replicateAfter">optimize</str>
<str name="replicateAfter">startup</str>
</lst>
</requestHandler>




                Lucid Imagination, Inc. – http://www.lucidimagination.com   27
Performance Efficient Method for
       Highlighting Text




     Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                 28
Highlighting Use Case
   Index Stats
     Using a small index, 20K Documents
     Our queries return up to 4000 documents at a time
     The documents have many small fields or larger fields with unordered
     text. We must be able to use highlighting on these fields.
   Performance Considerations
     Queries must be < 300 ms
     Using default Solr Highlighting, this could not be achieved for all
     queries
   Rule of Thumb
     Optimize the slowest part of the query (try to get biggest bang for
     buck)

                  Lucid Imagination, Inc. – http://www.lucidimagination.com   29   29
About Solr Highlighting Capabilities
   Solr Highlighting Features
     Useful for standard search engine highlighting where context and
     document content is important
     Works on indices with and without term vectors (slower w/o term
     vectors)
     Has a definite impact on query time
     Very configurable, allows more advanced options like using regex’s for
     generating fragments
   Solr Highlighting Implementation
     Gets fragments of the text in a given field and uses this for matching
     Finds matching terms and returns the match along with the terms
     surrounding it

                  Lucid Imagination, Inc. – http://www.lucidimagination.com   30
An Approach To Performance
Efficient Highlighting
   Performance Efficient Highlighting Features
     Has very little performance impact
     Works well where context around term match is unimportant
     Will only work using term vectors
     Doesn’t retain position information in match (return in order of terms
     in query)
   Performance Efficient Highlighting Implementation
     Is done by iterating through term vectors for each field
     When a match is found, add it as highlighted term (support for
     matching phrases)




                  Lucid Imagination, Inc. – http://www.lucidimagination.com   31
Modified Highlighting Performance
   Performance gains become more apparent when many results are
  returned.
   Our implementation sees 5x decrease in query time.
   4000 document query
     Solr Highlighting                                           ~800 ms
     Modified Highlighting                                       ~160 ms




                  Lucid Imagination, Inc. – http://www.lucidimagination.com   32
Conclusion and Q&A




Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                            33
Recap


  Query Performance Optimizations
    InBuilt Solr capabilities
    Field Cache Customization
    MMAP Customization
  Real Time Snapshot
    Used for Index Backup/Restore
  Performance Efficient Methods of Highlighting


              Hope you found this useful!
                  Lucid Imagination, Inc. – http://www.lucidimagination.com   34
Q&A            Slides posted at
http://bit.ly/lucid-cisco
(Full replay available within
 ~48 hours of live webcast)

Lucid Imagination, Inc. – http://www.lucidimagination.com   35

Mais conteúdo relacionado

Destaque

Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Lucidworks (Archived)
 
IAMAS 2010 First presentation
IAMAS 2010 First presentationIAMAS 2010 First presentation
IAMAS 2010 First presentation
ocrock
 
Hellosong
HellosongHellosong
Hellosong
tanica
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Lucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Lucidworks (Archived)
 

Destaque (17)

Presentacion Ingles
Presentacion InglesPresentacion Ingles
Presentacion Ingles
 
Presentation
PresentationPresentation
Presentation
 
What’s new in apache solr 1.4
What’s new in apache solr 1.4What’s new in apache solr 1.4
What’s new in apache solr 1.4
 
Jonh Lennon
Jonh LennonJonh Lennon
Jonh Lennon
 
Sudarshan Gaikaiwari - Lucene @ Yelp
Sudarshan Gaikaiwari - Lucene @ YelpSudarshan Gaikaiwari - Lucene @ Yelp
Sudarshan Gaikaiwari - Lucene @ Yelp
 
Shining new light on lucene solr performance and monitoring
Shining new light on lucene solr performance and monitoringShining new light on lucene solr performance and monitoring
Shining new light on lucene solr performance and monitoring
 
HTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコルHTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコル
 
Solr lucene search revolution
Solr lucene search revolutionSolr lucene search revolution
Solr lucene search revolution
 
Mains aux fleurs
Mains aux fleursMains aux fleurs
Mains aux fleurs
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Customized Navigation Using SOLR
Customized Navigation Using SOLRCustomized Navigation Using SOLR
Customized Navigation Using SOLR
 
IAMAS 2010 First presentation
IAMAS 2010 First presentationIAMAS 2010 First presentation
IAMAS 2010 First presentation
 
Hellosong
HellosongHellosong
Hellosong
 
Adobe Photoshop
Adobe PhotoshopAdobe Photoshop
Adobe Photoshop
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 

Mais de Lucidworks (Archived)

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Lucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Lucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Lucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Lucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Lucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Lucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
Lucidworks (Archived)
 
Seeley yonik solr performance key innovations
Seeley yonik   solr performance key innovationsSeeley yonik   solr performance key innovations
Seeley yonik solr performance key innovations
Lucidworks (Archived)
 
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks EnterpriseImplementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Lucidworks (Archived)
 
Building specialized industry applications using Solr, and migration from FAS...
Building specialized industry applications using Solr, and migration from FAS...Building specialized industry applications using Solr, and migration from FAS...
Building specialized industry applications using Solr, and migration from FAS...
Lucidworks (Archived)
 

Mais de Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
 
Seeley yonik solr performance key innovations
Seeley yonik   solr performance key innovationsSeeley yonik   solr performance key innovations
Seeley yonik solr performance key innovations
 
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks EnterpriseImplementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
 
Building specialized industry applications using Solr, and migration from FAS...
Building specialized industry applications using Solr, and migration from FAS...Building specialized industry applications using Solr, and migration from FAS...
Building specialized industry applications using Solr, and migration from FAS...
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 

How cisco’s pulse uses lucene solr to put social networks to work

  • 1. How Cisco's Pulse uses Lucene/Solr to put Social Networks to Work 24 Jun 2010 Sonali Sambhus Thangam Arumugam Stephen Bochinski
  • 2. Slides posted for download at the end of this presentation; full replay available within Introduction ~48 hours of live webcast Lucid Imagination, Inc. – http://www.lucidimagination.com 2
  • 3. About the presenters Sonali Sambhus Senior search architect and Engineering Manager, Cisco Pulse Platform; founding member of the Cisco Pulse Business Unit. M.S. Computer Engineering, Rutgers. Thangam Arumugam Senior software architect, Cisco Systems. Architect, Cisco Pulse Platform; founding member of the Cisco Pulse Business Unit. BE Computer Science and Engineering, Bharathiar University, India. Stephen Bochinski Software Engineer at Cisco Systems. Solr Software developer for the Cisco Pulse Platform. BSc Computer Science, UC San Diego Lucid Imagination, Inc. – http://www.lucidimagination.com 3 3
  • 4. Agenda About Cisco PulseTM Performance Use Case Optimizing stored field retrieval performance using Field Cache Optimizing query performance using MMAP Real Time Snapshot Feature Performance efficient methods for highlighting text Q&A Lucid Imagination, Inc. – http://www.lucidimagination.com 4
  • 5. About Cisco Pulse Lucid Imagination, Inc. – http://www.lucidimagination.com 5
  • 6. Background for Cisco Pulse Cisco is not just about Routers & Switches! Cisco’s Emerging Technology Group (ETG) focuses on new markets and technologies which will be the ‘next’ wave for Cisco Cisco Pulse is a brain child of ETG Cisco Pulse is a shipping product targeted for Enterprise Customers Lucid Imagination, Inc. – http://www.lucidimagination.com 6
  • 7. Cisco Pulse Automatically discover what people know who they know the information they value Cisco Network Lucid Imagination, Inc. – http://www.lucidimagination.com 7
  • 8. Cisco Pulse Search and Analytics Platform Enabling Collaboration Across Boundaries  Automatically discover expertise  Collaborate in a single click  Surface and share info in real-time  Navigate video to the spoken word  Integrated into the tools people use “If we knew what we know, we would be three times more productive than we are today!” Lucid Imagination, Inc. – http://www.lucidimagination.com 8
  • 9. 1 How We Do It Automated Network Discovery Pulse Collect Appliance Cisco Network Lucid Imagination, Inc. – http://www.lucidimagination.com 9
  • 10. 2 How We Do It Social Search and Analytics Pulse Connect Social Graph Expertise Documents Profile Media Lucid Imagination, Inc. – http://www.lucidimagination.com 10
  • 11. Performance Use Case Description Lucid Imagination, Inc. – http://www.lucidimagination.com 11
  • 12. An Approach to specifying a Performance Use Case (1 ) Data Number of Records Index Size (Gig) Size per records (2) Search Application Requirements Search features: Faceting, sorting etc Number of records retrieved per query (3) Query Performance Goals Concurrent Query Rate Lucid Imagination, Inc. – http://www.lucidimagination.com 12
  • 13. Three Dimensions of Query Performance Data Nature of Content Number of records in the index 35 Million Index Size 6 Gig Number & size 14 string fields of stored fields per record each of size 85 bytes Term Distribution in the index Following Zipf’sLaw Search Requirements Application Number of records retrieved per Needs search query 500 records Number of stored fields retrieved per query 14 string fields Boolean queries Search Features (such as without any advanced sorting/faceting/..) search features Lucid Imagination, Inc. – http://www.lucidimagination.com 13
  • 14. Three Dimensions of Query Performance (CONTINUED) Query Goals Performance Concurrent Query Rate 3 QPS Needs Average Query Length 3 terms Query Response Time Budget Less than 300 ms Less than 300 ms for First Time vs first time as well as Subsequent queries subsequent queries Lucid Imagination, Inc. – http://www.lucidimagination.com 14
  • 15. Optimizing Stored Field Retrieval Performance using Field Cache Lucid Imagination, Inc. – http://www.lucidimagination.com 15
  • 16. Standard Caching Capabilities in Solr Solr 1.4 has LRU & Fast LRU Caches Filter Cache: Used for filter queries, faceting, sorting QueryResultCache: Used to Store Doc Ids specific to a query Document Cache: The documentCache stores Lucene Document objects that have been fetched Lucene Field Cache Used for sorting and faceting. Not managed by Solr Limitations of these Caches with respect to the use case: First time queries are slow; subsequent queries hit the cache and are fast Even for subsequent queries only recently used queries are fast -- and not all queries. Lucid Imagination, Inc. – http://www.lucidimagination.com 16
  • 17. Root Cause of poor query performance in our use case What does Lucene do when it gets a query: #1 Retrieve the top documents #2 Retrieve stored fields for these documents Stored fields are stored in .fdt& .fdx Lucene files In this use case, since we retrieve a large number of documents (500) and their stored fields #2 was high due to increased Disk IO Lucid Imagination, Inc. – http://www.lucidimagination.com 17
  • 18. Optimizing stored field retrieval: Leverage Field Cache What is Field Cache Lucene Class Field Cache caches field values Per segment, per doc id , Per Field Solution for optimizing stored field retrieval Solr Customization (JIRA SOLR-1961) SolrIndexSearcherholds the Field Cache for its own segment When retrieving Stored Fields, the data is read from Field Cache instead of from disk The Field Cache is warmed whenever a new Searcher comes up Selective Field Caching ie. Ability to configure select fields. Performance Improvement: Query time reduced from 3 seconds to 1.5 seconds Lucid Imagination, Inc. – http://www.lucidimagination.com 18
  • 19. Limitations of Field Cache Solution Lucene Field Cache does not support multi valued fields The Field has to be an indexed field Lucene only supports a finite number of distinct Field Values per Field for Field Cache Class JVM Memory Consumption increases due to holding FieldCache in memory. Memory consumption = Number of fields cached * Number of documents * 8 bytes per reference + SUM (Number of unique values of the field * average length of term) * 2 (chars use 2 bytes) * String overhead (40 bytes) Lucid Imagination, Inc. – http://www.lucidimagination.com 19
  • 20. Optimizing Query Performance using MMAP Lucid Imagination, Inc. – http://www.lucidimagination.com 20
  • 21. Optimizing query performance Next Step: Leverage Lucene MMAP What is MMAP Lucene MMAP provides a way to map index files into memory. Optimizing query performance: Leverage MMAP Reduce Cost of Disk IO by MMAP select index files which are used in computation of document list Added Lucene customization to MMAP only select files (SOLR-1969) Added customization to MMAP new index files after a commit & optimize Caveats Increases JVM Memory Usage as much as the size of index files. Lucid Imagination, Inc. – http://www.lucidimagination.com 21
  • 22. Performance Optimization Result For this Use Case Average Query Response Time Speedup (ms) 3500 3000 2500 2000 3226 1500 1000 500 377 0 Default Solr Customized Solr (Field Cache + MMAP) Lucid Imagination, Inc. – http://www.lucidimagination.com 22
  • 23. Operational Optimization for Full Index Backups: Real Time Snapshot Lucid Imagination, Inc. – http://www.lucidimagination.com 23
  • 24. Operational optimization with full index hot backups (Real Time Snapshot) Currently Solr does not provide a direct method to get the snapshot explicitly when the index being written Cisco Pulse Team came up with a solution & packaged a script using replication methods to take online snapshot The snapshot can be taken at any time and replicated USE CASES Index Restore Adding another Node in a cluster Snapshot for offline analysis Useful in case of real time indexing & querying Lucid Imagination, Inc. – http://www.lucidimagination.com 24
  • 25. About Lucene Index Segments & Commit Segment N File - N is the latest segment number of segments holds the references to all the files in the segments that are active. Commit creates a Segment N file Commit Point Includes the Index version and the files. Index Deletion Policy Controls the index segment cleanup Lucid Imagination, Inc. – http://www.lucidimagination.com 25
  • 26. Lucene Hot SnapShot SnapShot Procedures Follows the solr replication strategy: Commit First: This will ensure that all the data are made into the index are in the commit point. http://master_host:port/solr/core/ingest?commit=true Take a Snapshot: No need to halt Index writing here. http://master_host:port/solr/core/replication?command=backup Lucid Imagination, Inc. – http://www.lucidimagination.com 26
  • 27. SnapShot Configuration solrconfig.xml: SnapShot can be taken on Optimize or Commit event. <requestHandler name="/replication" class="solr.ReplicationHandler" > <lst name="master"> <str name="replicateAfter">commit</str> <str name="replicateAfter">optimize</str> <str name="replicateAfter">startup</str> </lst> </requestHandler> Lucid Imagination, Inc. – http://www.lucidimagination.com 27
  • 28. Performance Efficient Method for Highlighting Text Lucid Imagination, Inc. – http://www.lucidimagination.com 28
  • 29. Highlighting Use Case Index Stats Using a small index, 20K Documents Our queries return up to 4000 documents at a time The documents have many small fields or larger fields with unordered text. We must be able to use highlighting on these fields. Performance Considerations Queries must be < 300 ms Using default Solr Highlighting, this could not be achieved for all queries Rule of Thumb Optimize the slowest part of the query (try to get biggest bang for buck) Lucid Imagination, Inc. – http://www.lucidimagination.com 29 29
  • 30. About Solr Highlighting Capabilities Solr Highlighting Features Useful for standard search engine highlighting where context and document content is important Works on indices with and without term vectors (slower w/o term vectors) Has a definite impact on query time Very configurable, allows more advanced options like using regex’s for generating fragments Solr Highlighting Implementation Gets fragments of the text in a given field and uses this for matching Finds matching terms and returns the match along with the terms surrounding it Lucid Imagination, Inc. – http://www.lucidimagination.com 30
  • 31. An Approach To Performance Efficient Highlighting Performance Efficient Highlighting Features Has very little performance impact Works well where context around term match is unimportant Will only work using term vectors Doesn’t retain position information in match (return in order of terms in query) Performance Efficient Highlighting Implementation Is done by iterating through term vectors for each field When a match is found, add it as highlighted term (support for matching phrases) Lucid Imagination, Inc. – http://www.lucidimagination.com 31
  • 32. Modified Highlighting Performance Performance gains become more apparent when many results are returned. Our implementation sees 5x decrease in query time. 4000 document query Solr Highlighting ~800 ms Modified Highlighting ~160 ms Lucid Imagination, Inc. – http://www.lucidimagination.com 32
  • 33. Conclusion and Q&A Lucid Imagination, Inc. – http://www.lucidimagination.com 33
  • 34. Recap Query Performance Optimizations InBuilt Solr capabilities Field Cache Customization MMAP Customization Real Time Snapshot Used for Index Backup/Restore Performance Efficient Methods of Highlighting Hope you found this useful! Lucid Imagination, Inc. – http://www.lucidimagination.com 34
  • 35. Q&A Slides posted at http://bit.ly/lucid-cisco (Full replay available within ~48 hours of live webcast) Lucid Imagination, Inc. – http://www.lucidimagination.com 35