Cisco’s new PulseTM delivers a powerful platform that uses embedded Lucene/Solr search technology to harvest expertise hidden in your organization’s social network. Using cutting edge enterprise search techniques developed at Cisco with the help of Lucid Imagination, Cisco combines automated content discovery with social search and analytics, indexing a broad range of media in real time — including email, documents, and even video — that deliver results into tools you use every day.
How cisco’s pulse uses lucene solr to put social networks to work
1. How Cisco's Pulse uses
Lucene/Solr to put Social
Networks to Work
24 Jun 2010
Sonali Sambhus
Thangam Arumugam
Stephen Bochinski
2. Slides posted for download at
the end of this presentation;
full replay available within Introduction
~48 hours of live webcast
Lucid Imagination, Inc. – http://www.lucidimagination.com
2
3. About the presenters
Sonali Sambhus
Senior search architect and Engineering Manager, Cisco Pulse
Platform; founding member of the Cisco Pulse Business Unit.
M.S. Computer Engineering, Rutgers.
Thangam Arumugam
Senior software architect, Cisco Systems. Architect, Cisco Pulse
Platform; founding member of the Cisco Pulse Business Unit.
BE Computer Science and Engineering, Bharathiar University, India.
Stephen Bochinski
Software Engineer at Cisco Systems.
Solr Software developer for the Cisco Pulse Platform.
BSc Computer Science, UC San Diego
Lucid Imagination, Inc. – http://www.lucidimagination.com 3 3
4. Agenda
About Cisco PulseTM
Performance Use Case
Optimizing stored field retrieval performance
using Field Cache
Optimizing query performance using MMAP
Real Time Snapshot Feature
Performance efficient methods for highlighting text
Q&A
Lucid Imagination, Inc. – http://www.lucidimagination.com
4
6. Background for Cisco Pulse
Cisco is not just about Routers & Switches!
Cisco’s Emerging Technology Group (ETG)
focuses on new markets and technologies
which will be the ‘next’ wave for Cisco
Cisco Pulse is a brain child of ETG
Cisco Pulse is a shipping product targeted
for Enterprise Customers
Lucid Imagination, Inc. – http://www.lucidimagination.com
6
7. Cisco Pulse
Automatically
discover
what people know
who they know
the information
they value
Cisco Network
Lucid Imagination, Inc. – http://www.lucidimagination.com 7
8. Cisco Pulse Search and
Analytics Platform
Enabling Collaboration Across Boundaries
Automatically discover expertise
Collaborate in a single click
Surface and share info in real-time
Navigate video to the spoken word
Integrated into the tools people use
“If we knew what we know, we would be three
times more productive than we are today!”
Lucid Imagination, Inc. – http://www.lucidimagination.com 8
9. 1 How We Do It
Automated Network Discovery
Pulse Collect Appliance
Cisco Network
Lucid Imagination, Inc. – http://www.lucidimagination.com 9
10. 2 How We Do It
Social Search and Analytics
Pulse Connect
Social Graph
Expertise Documents
Profile Media
Lucid Imagination, Inc. – http://www.lucidimagination.com 10
11. Performance Use Case Description
Lucid Imagination, Inc. – http://www.lucidimagination.com
11
12. An Approach to specifying a
Performance Use Case
(1 ) Data
Number of Records
Index Size (Gig)
Size per records
(2) Search Application Requirements
Search features: Faceting, sorting etc
Number of records retrieved per query
(3) Query Performance Goals
Concurrent Query Rate
Lucid Imagination, Inc. – http://www.lucidimagination.com
12
13. Three Dimensions of
Query Performance
Data Nature of Content
Number of records in the index 35 Million
Index Size 6 Gig
Number & size 14 string fields
of stored fields per record each of size 85 bytes
Term Distribution in the index Following Zipf’sLaw
Search Requirements
Application Number of records retrieved per
Needs search query 500 records
Number of stored fields retrieved
per query 14 string fields
Boolean queries
Search Features (such as without any advanced
sorting/faceting/..) search features
Lucid Imagination, Inc. – http://www.lucidimagination.com 13
14. Three Dimensions of Query
Performance (CONTINUED)
Query Goals
Performance Concurrent Query Rate 3 QPS
Needs Average Query Length 3 terms
Query Response Time Budget Less than 300 ms
Less than 300 ms for
First Time vs first time as well as
Subsequent queries subsequent queries
Lucid Imagination, Inc. – http://www.lucidimagination.com 14
15. Optimizing Stored Field Retrieval
Performance using Field Cache
Lucid Imagination, Inc. – http://www.lucidimagination.com
15
16. Standard Caching Capabilities in Solr
Solr 1.4 has LRU & Fast LRU Caches
Filter Cache: Used for filter queries, faceting, sorting
QueryResultCache: Used to Store Doc Ids specific to a query
Document Cache: The documentCache stores Lucene Document
objects that have been fetched
Lucene Field Cache
Used for sorting and faceting. Not managed by Solr
Limitations of these Caches with respect to the use case:
First time queries are slow; subsequent
queries hit the cache and are fast
Even for subsequent queries only recently
used queries are fast -- and not all queries.
Lucid Imagination, Inc. – http://www.lucidimagination.com 16
17. Root Cause of poor query
performance in our use case
What does Lucene do when it gets a query:
#1 Retrieve the top documents
#2 Retrieve stored fields for these documents
Stored fields are stored in .fdt& .fdx Lucene files
In this use case, since we retrieve a large
number of documents (500) and their stored fields
#2 was high due to increased Disk IO
Lucid Imagination, Inc. – http://www.lucidimagination.com 17
18. Optimizing stored field retrieval:
Leverage Field Cache
What is Field Cache
Lucene Class Field Cache caches field values Per segment, per doc id ,
Per Field
Solution for optimizing stored field retrieval
Solr Customization (JIRA SOLR-1961)
SolrIndexSearcherholds the Field Cache for its own segment
When retrieving Stored Fields, the data is read from Field Cache
instead of from disk
The Field Cache is warmed whenever a new Searcher comes up
Selective Field Caching ie. Ability to configure select fields.
Performance Improvement:
Query time reduced from 3 seconds to 1.5 seconds
Lucid Imagination, Inc. – http://www.lucidimagination.com 18
19. Limitations of Field Cache Solution
Lucene Field Cache does not support multi valued fields
The Field has to be an indexed field
Lucene only supports a finite number of distinct Field Values per
Field for Field Cache Class
JVM Memory Consumption increases due to holding FieldCache in
memory.
Memory consumption =
Number of fields cached * Number of documents
* 8 bytes per reference
+ SUM (Number of unique values of the field
* average length of term)
* 2 (chars use 2 bytes) * String overhead (40 bytes)
Lucid Imagination, Inc. – http://www.lucidimagination.com 19
21. Optimizing query performance
Next Step: Leverage Lucene MMAP
What is MMAP
Lucene MMAP provides a way to map index files into memory.
Optimizing query performance: Leverage MMAP
Reduce Cost of Disk IO by MMAP select index files which are used in
computation of document list
Added Lucene customization to MMAP only select files (SOLR-1969)
Added customization to MMAP new index files after a commit &
optimize
Caveats
Increases JVM Memory Usage as much as the size of index files.
Lucid Imagination, Inc. – http://www.lucidimagination.com 21
22. Performance Optimization Result
For this Use Case
Average Query Response Time Speedup (ms)
3500
3000
2500
2000
3226
1500
1000
500
377
0
Default Solr Customized Solr (Field Cache +
MMAP)
Lucid Imagination, Inc. – http://www.lucidimagination.com 22
23. Operational Optimization
for Full Index Backups:
Real Time Snapshot
Lucid Imagination, Inc. – http://www.lucidimagination.com
23
24. Operational optimization with full index
hot backups (Real Time Snapshot)
Currently Solr does not provide a direct method to get the
snapshot explicitly when the index being written
Cisco Pulse Team came up with a solution & packaged a script
using replication methods to take online snapshot
The snapshot can be taken at any time and replicated
USE CASES
Index Restore
Adding another Node in a cluster
Snapshot for offline analysis
Useful in case of real time indexing & querying
Lucid Imagination, Inc. – http://www.lucidimagination.com 24
25. About Lucene Index
Segments & Commit
Segment N File - N is the latest segment number of
segments
holds the references to all the files in the segments that are active.
Commit creates a Segment N file
Commit Point
Includes the Index version and the files.
Index Deletion Policy
Controls the index segment cleanup
Lucid Imagination, Inc. – http://www.lucidimagination.com 25
26. Lucene Hot SnapShot
SnapShot Procedures
Follows the solr replication strategy:
Commit First: This will ensure that all the data are made
into the index are in the commit point.
http://master_host:port/solr/core/ingest?commit=true
Take a Snapshot: No need to halt Index writing here.
http://master_host:port/solr/core/replication?command=backup
Lucid Imagination, Inc. – http://www.lucidimagination.com 26
27. SnapShot Configuration
solrconfig.xml: SnapShot can be taken on Optimize or Commit event.
<requestHandler name="/replication"
class="solr.ReplicationHandler" >
<lst name="master">
<str name="replicateAfter">commit</str>
<str name="replicateAfter">optimize</str>
<str name="replicateAfter">startup</str>
</lst>
</requestHandler>
Lucid Imagination, Inc. – http://www.lucidimagination.com 27
29. Highlighting Use Case
Index Stats
Using a small index, 20K Documents
Our queries return up to 4000 documents at a time
The documents have many small fields or larger fields with unordered
text. We must be able to use highlighting on these fields.
Performance Considerations
Queries must be < 300 ms
Using default Solr Highlighting, this could not be achieved for all
queries
Rule of Thumb
Optimize the slowest part of the query (try to get biggest bang for
buck)
Lucid Imagination, Inc. – http://www.lucidimagination.com 29 29
30. About Solr Highlighting Capabilities
Solr Highlighting Features
Useful for standard search engine highlighting where context and
document content is important
Works on indices with and without term vectors (slower w/o term
vectors)
Has a definite impact on query time
Very configurable, allows more advanced options like using regex’s for
generating fragments
Solr Highlighting Implementation
Gets fragments of the text in a given field and uses this for matching
Finds matching terms and returns the match along with the terms
surrounding it
Lucid Imagination, Inc. – http://www.lucidimagination.com 30
31. An Approach To Performance
Efficient Highlighting
Performance Efficient Highlighting Features
Has very little performance impact
Works well where context around term match is unimportant
Will only work using term vectors
Doesn’t retain position information in match (return in order of terms
in query)
Performance Efficient Highlighting Implementation
Is done by iterating through term vectors for each field
When a match is found, add it as highlighted term (support for
matching phrases)
Lucid Imagination, Inc. – http://www.lucidimagination.com 31
32. Modified Highlighting Performance
Performance gains become more apparent when many results are
returned.
Our implementation sees 5x decrease in query time.
4000 document query
Solr Highlighting ~800 ms
Modified Highlighting ~160 ms
Lucid Imagination, Inc. – http://www.lucidimagination.com 32
34. Recap
Query Performance Optimizations
InBuilt Solr capabilities
Field Cache Customization
MMAP Customization
Real Time Snapshot
Used for Index Backup/Restore
Performance Efficient Methods of Highlighting
Hope you found this useful!
Lucid Imagination, Inc. – http://www.lucidimagination.com 34
35. Q&A Slides posted at
http://bit.ly/lucid-cisco
(Full replay available within
~48 hours of live webcast)
Lucid Imagination, Inc. – http://www.lucidimagination.com 35