SlideShare uma empresa Scribd logo
1 de 27
Baixar para ler offline
High Performance Solr
Shalin Shekhar Mangar
shalin@apache.org
https://twitter.com/shalinmangar
Performance constraints
• CPU
• Memory
• Disk
• Network
Tuning (CPU) Queries
• Phrase query
• Boolean query (AND)
• Boolean query (OR)
• Wildcard
• Fuzzy
• Soundex
• …roughly in order of increasing cost
• Query performance inversely proportional to matches (doc frequency)
Tuning (CPU) Queries
• Reduce frequent-term queries
• Remove stopwords
• Try CommonGramsFilter
• Index pruning (advanced)
• Some function queries match ALL documents -
terribly inefficient
Tuning (CPU) Queries
• Make efficient use of caches
• Watch those eviction counts
• Beware of NOW in date range queries. Use NOW/DAY or NOW/HOUR
• No need to cache every filter
• Use fq={!cache=false}year:[2005 TO *]
• Specify cost for non-cached filters for efficiency
• fq={!geofilt sfield=location pt=22,-127 d=50 cache=false
cost=50}
• Use PostFilters for very expensive filters (cache=false, cost > 100)
Tuning (CPU) Queries
• Warm those caches
• Auto-warming
• Warming queries
• firstSearcher
• newSearcher
Tuning (CPU) Queries
• Stop using primitive number/date fields if you are performing range queries
• facet.query (sometimes) or facet.range are also range queries
• Use Trie* Fields
• When performing range queries on a string field (rare use-case), use frange
to trade off memory for speed
• It will un-invert the field
• No additional cost is paid if the field is already being used for sorting or
other function queries
• fq={!frange l=martin u=rowling}author_last_name instead of
fq=author_last_name:[martin TO rowling]
Tuning (CPU) Queries
• Faceting methods
• facet.method=enum - great for less unique values
• facet.enum.cache.minDf - use filter cache or iterate
through DocsEnum
• facet.method=fc
• facet.method=fcs (per-segment)
• facet.sort=index faster than facet.sort=count but useless
in typical cases
Tuning (CPU) Queries
• ReRankQueryParser
• Like a PostFilter but for queries!
• Run expensive queries at the very last
• Solr 4.9+ only (soon to be released)
Tuning (CPU) Queries
• Divide and conquer
• Shard’em out
• Use multiple CPUs
• Sometime multiple cores are the answer even for
small indexes and specially for high-updates
Tuning Memory Usage
• Use DocValues for sorting/faceting/grouping
• There are docValueFormats: {‘default’, ‘memory’,
‘direct’} with different trade-offs.
• default - Helps avoid OOM but uses disk and OS
page cache
• memory - compressed in-memory format
• direct - no-compression, in-memory format
Tuning Memory usage
• termIndexInterval - Choose how often terms are
loaded into term dictionary. Default is 128.
Tuning Memory Usage
• Garbage Collection pauses kill search performance
• GC pauses expire ZK sessions in SolrCloud
leading to many problems
• Large heap sizes are almost never the answer
• Leave a lot of memory for the OS page cache
• http://wiki.apache.org/solr/ShawnHeisey
Tuning Disk usage
• Atomic updates are costly
• Lookup from transaction log
• Lookup from Index (all stored fields)
• Combine
• Index
Tuning Disk Usage
• Experiment with merge policies
• TieredMergePolicy is great but
LogByteSizeMergePolicy can be better if multiple
indexes are sharing a single disk
• Increase buffer size - ramBufferSizeMB (>1024M
doesn’t help, may reduce performance)
Tuning Disk Usage
• Always hard commit once in a while
• Best to use autoCommit and maxDocs
• Trims transaction logs
• Solution for slow startup times
• Use autoSoftCommit for new searchers
• commitWithin is a great way to commit frequently
Tuning Network
• Batch writes together as much as possible
• Use CloudSolrServer in SolrCloud always
• Routes updates intelligently to correct leader
• ConcurrentUpdateSolrServer (previously known as
StreamingUpdateSolrServer) for indexing in non-
Cloud mode
• Don’t use it for querying!
Tuning network
• Share HttpClient instance for all Solrj clients or just
re-use the same client object
• Disable retries on HttpClient
Tuning Network
• Distributed Search is optimised if you ask for
fl=id,score only
• Avoid numShard*rows stored field lookups
• Saves numShard network calls
Tuning Network
• Consider setting up a caching proxy such as squid or varnish in front of
your Solr cluster
• Solr can emit the right cache headers if configured in solrconfig.xml
• Last-Modified and ETag headers are generated based on the
properties of the index such as last searcher open time
• You can even force new ETag headers by changing the ETag seed
value
• <httpCaching never304=“true”><cacheControl>max-age=30, public</
cacheControl></httpCaching>
• The above config will set responses to be cached for 30s by your
caching proxy unless the index is modifed.
Avoid wastage
• Don’t store what you don’t need back
• Use stored=false
• Don’t index what you don’t search
• Use indexed=false
• Don’t retrieve what you don’t need back
• Don’t use fl=* unless necessary
• Don’t use rows=10 when all you need is numFound
Reduce indexed info
• omitNorms=true - Use if you don’t need index-time boosts
• omitTermFreqAndPositions=true - Use if you don’t need
term frequencies and positions
• No fuzzy query, no phrase queries
• Can do simple exists check, can do simple AND/OR
searches on terms
• No scoring difference whether the term exists once or a
thousand times
DocValue tricks & gotchas
• DocValue field should be stored=false, indexed=false
• It can still be retrieved using fl=field(my_dv_field)
• If you store DocValue field, it uses extra space as a stored
field also.
• In future, update-able doc value fields will be supported
by Solr but they’ll work only if stored=false,
indexed=false
• DocValues save disk space also (all values, next to each
other lead to very efficient compression)
Deep paging
• Bulk exporting documents from Solr will bring it to
its knees
• Enter deep paging and cursorMark parameter
• Specify cursorMark=* on the first request
• Use the returned ‘nextCursorMark’ value as the
nextCursorMark parameter
Classic paging vs Deep paging
LucidWorks Open Source
• Effortless AWS deployment and monitoring http://
www.github.com/lucidworks/solr-scale-tk
• Logstash for Solr: https://github.com/LucidWorks/
solrlogmanager
• Banana (Kibana for Solr): https://github.com/LucidWorks/
banana
• Data Quality Toolkit: https://github.com/LucidWorks/data-quality
• Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/
Lucene and Solr, different file formats, pipelines, Logstash
LucidWorks
• We’re hiring!
• Work on open source Apache Lucene/Solr
• Help our customers win
• Work remotely from home! Location no bar!
• Contact me at shalin@apache.org

Mais conteúdo relacionado

Mais procurados

PostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication CheatsheetPostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication CheatsheetAlexey Lesovsky
 
Solr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSolr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSematext Group, Inc.
 
카프카, 산전수전 노하우
카프카, 산전수전 노하우카프카, 산전수전 노하우
카프카, 산전수전 노하우if kakao
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk finalRachel Warren
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performancePostgreSQL-Consulting
 
Microsoft SQL Server - Files and Filegroups
Microsoft SQL Server - Files and FilegroupsMicrosoft SQL Server - Files and Filegroups
Microsoft SQL Server - Files and FilegroupsNaji El Kotob
 
Understanding PostgreSQL LW Locks
Understanding PostgreSQL LW LocksUnderstanding PostgreSQL LW Locks
Understanding PostgreSQL LW LocksJignesh Shah
 
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San JoseThe Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San JoseNikolay Samokhvalov
 
Best Practices for Becoming an Exceptional Postgres DBA
Best Practices for Becoming an Exceptional Postgres DBA Best Practices for Becoming an Exceptional Postgres DBA
Best Practices for Becoming an Exceptional Postgres DBA EDB
 
APEX Reporting on external data sources
APEX Reporting on external data sourcesAPEX Reporting on external data sources
APEX Reporting on external data sourcesRodrigo Mesquita
 
How to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better PerformanceHow to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better Performanceoysteing
 
Scylla core dump debugging tools
Scylla core dump debugging toolsScylla core dump debugging tools
Scylla core dump debugging toolsTomasz Grabiec
 
Microsoft SQL Server Query Tuning
Microsoft SQL Server Query TuningMicrosoft SQL Server Query Tuning
Microsoft SQL Server Query TuningMark Ginnebaugh
 
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best PracticesAmazon Web Services
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)NAVER D2
 
Vacuum in PostgreSQL
Vacuum in PostgreSQLVacuum in PostgreSQL
Vacuum in PostgreSQLRafia Sabih
 
How to Reduce Your Database Total Cost of Ownership with TimescaleDB
How to Reduce Your Database Total Cost of Ownership with TimescaleDBHow to Reduce Your Database Total Cost of Ownership with TimescaleDB
How to Reduce Your Database Total Cost of Ownership with TimescaleDBTimescale
 

Mais procurados (20)

PostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication CheatsheetPostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication Cheatsheet
 
Solr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSolr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for You
 
카프카, 산전수전 노하우
카프카, 산전수전 노하우카프카, 산전수전 노하우
카프카, 산전수전 노하우
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk final
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
Microsoft SQL Server - Files and Filegroups
Microsoft SQL Server - Files and FilegroupsMicrosoft SQL Server - Files and Filegroups
Microsoft SQL Server - Files and Filegroups
 
Understanding PostgreSQL LW Locks
Understanding PostgreSQL LW LocksUnderstanding PostgreSQL LW Locks
Understanding PostgreSQL LW Locks
 
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San JoseThe Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
 
Best Practices for Becoming an Exceptional Postgres DBA
Best Practices for Becoming an Exceptional Postgres DBA Best Practices for Becoming an Exceptional Postgres DBA
Best Practices for Becoming an Exceptional Postgres DBA
 
Mongo db 최범균
Mongo db 최범균Mongo db 최범균
Mongo db 최범균
 
APEX Reporting on external data sources
APEX Reporting on external data sourcesAPEX Reporting on external data sources
APEX Reporting on external data sources
 
How to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better PerformanceHow to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better Performance
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
Scylla core dump debugging tools
Scylla core dump debugging toolsScylla core dump debugging tools
Scylla core dump debugging tools
 
Microsoft SQL Server Query Tuning
Microsoft SQL Server Query TuningMicrosoft SQL Server Query Tuning
Microsoft SQL Server Query Tuning
 
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
 
PostgreSQL: Advanced indexing
PostgreSQL: Advanced indexingPostgreSQL: Advanced indexing
PostgreSQL: Advanced indexing
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)
 
Vacuum in PostgreSQL
Vacuum in PostgreSQLVacuum in PostgreSQL
Vacuum in PostgreSQL
 
How to Reduce Your Database Total Cost of Ownership with TimescaleDB
How to Reduce Your Database Total Cost of Ownership with TimescaleDBHow to Reduce Your Database Total Cost of Ownership with TimescaleDB
How to Reduce Your Database Total Cost of Ownership with TimescaleDB
 

Semelhante a High Performance Solr

Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudAnshum Gupta
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceEnkitec
 
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxStrata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxManish Maheshwari
 
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than EverApache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than EverLucidworks (Archived)
 
OGG Architecture Performance
OGG Architecture PerformanceOGG Architecture Performance
OGG Architecture PerformanceEnkitec
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
What's New in Apache Solr 4.10
What's New in Apache Solr 4.10What's New in Apache Solr 4.10
What's New in Apache Solr 4.10Anshum Gupta
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaManish Maheshwari
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Bobby Curtis
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in JavaRuben Badaró
 
WebObjects Optimization
WebObjects OptimizationWebObjects Optimization
WebObjects OptimizationWO Community
 
Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?DoiT International
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Performance optimization - JavaScript
Performance optimization - JavaScriptPerformance optimization - JavaScript
Performance optimization - JavaScriptFilip Mares
 
Integrating Apache Pulsar with Big Data Ecosystem
Integrating Apache Pulsar with Big Data EcosystemIntegrating Apache Pulsar with Big Data Ecosystem
Integrating Apache Pulsar with Big Data EcosystemStreamNative
 
Cassandra
CassandraCassandra
Cassandraexsuns
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Speedment, Inc.
 

Semelhante a High Performance Solr (20)

Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
 
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxStrata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptx
 
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than EverApache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
 
OGG Architecture Performance
OGG Architecture PerformanceOGG Architecture Performance
OGG Architecture Performance
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
What's New in Apache Solr 4.10
What's New in Apache Solr 4.10What's New in Apache Solr 4.10
What's New in Apache Solr 4.10
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling Impala
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
 
Drupal performance
Drupal performanceDrupal performance
Drupal performance
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
 
WebObjects Optimization
WebObjects OptimizationWebObjects Optimization
WebObjects Optimization
 
Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Performance optimization - JavaScript
Performance optimization - JavaScriptPerformance optimization - JavaScript
Performance optimization - JavaScript
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Integrating Apache Pulsar with Big Data Ecosystem
Integrating Apache Pulsar with Big Data EcosystemIntegrating Apache Pulsar with Big Data Ecosystem
Integrating Apache Pulsar with Big Data Ecosystem
 
Cassandra
CassandraCassandra
Cassandra
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 

Mais de Shalin Shekhar Mangar

Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
Solr BoF (Birds of a Feather) session at Fifth Elephant 2018Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
Solr BoF (Birds of a Feather) session at Fifth Elephant 2018Shalin Shekhar Mangar
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Shalin Shekhar Mangar
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Shalin Shekhar Mangar
 
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksShalin Shekhar Mangar
 
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupInside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupShalin Shekhar Mangar
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Shalin Shekhar Mangar
 
GIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big DataGIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big DataShalin Shekhar Mangar
 
Get involved with the Apache Software Foundation
Get involved with the Apache Software FoundationGet involved with the Apache Software Foundation
Get involved with the Apache Software FoundationShalin Shekhar Mangar
 

Mais de Shalin Shekhar Mangar (11)

Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
Solr BoF (Birds of a Feather) session at Fifth Elephant 2018Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6
 
Intro to Apache Solr
Intro to Apache SolrIntro to Apache Solr
Intro to Apache Solr
 
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networks
 
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupInside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene Meetup
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
 
GIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big DataGIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big Data
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
SolrCloud and Shard Splitting
SolrCloud and Shard SplittingSolrCloud and Shard Splitting
SolrCloud and Shard Splitting
 
Get involved with the Apache Software Foundation
Get involved with the Apache Software FoundationGet involved with the Apache Software Foundation
Get involved with the Apache Software Foundation
 

Último

Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 

Último (20)

Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 

High Performance Solr

  • 1. High Performance Solr Shalin Shekhar Mangar shalin@apache.org https://twitter.com/shalinmangar
  • 2. Performance constraints • CPU • Memory • Disk • Network
  • 3. Tuning (CPU) Queries • Phrase query • Boolean query (AND) • Boolean query (OR) • Wildcard • Fuzzy • Soundex • …roughly in order of increasing cost • Query performance inversely proportional to matches (doc frequency)
  • 4. Tuning (CPU) Queries • Reduce frequent-term queries • Remove stopwords • Try CommonGramsFilter • Index pruning (advanced) • Some function queries match ALL documents - terribly inefficient
  • 5. Tuning (CPU) Queries • Make efficient use of caches • Watch those eviction counts • Beware of NOW in date range queries. Use NOW/DAY or NOW/HOUR • No need to cache every filter • Use fq={!cache=false}year:[2005 TO *] • Specify cost for non-cached filters for efficiency • fq={!geofilt sfield=location pt=22,-127 d=50 cache=false cost=50} • Use PostFilters for very expensive filters (cache=false, cost > 100)
  • 6. Tuning (CPU) Queries • Warm those caches • Auto-warming • Warming queries • firstSearcher • newSearcher
  • 7. Tuning (CPU) Queries • Stop using primitive number/date fields if you are performing range queries • facet.query (sometimes) or facet.range are also range queries • Use Trie* Fields • When performing range queries on a string field (rare use-case), use frange to trade off memory for speed • It will un-invert the field • No additional cost is paid if the field is already being used for sorting or other function queries • fq={!frange l=martin u=rowling}author_last_name instead of fq=author_last_name:[martin TO rowling]
  • 8. Tuning (CPU) Queries • Faceting methods • facet.method=enum - great for less unique values • facet.enum.cache.minDf - use filter cache or iterate through DocsEnum • facet.method=fc • facet.method=fcs (per-segment) • facet.sort=index faster than facet.sort=count but useless in typical cases
  • 9. Tuning (CPU) Queries • ReRankQueryParser • Like a PostFilter but for queries! • Run expensive queries at the very last • Solr 4.9+ only (soon to be released)
  • 10. Tuning (CPU) Queries • Divide and conquer • Shard’em out • Use multiple CPUs • Sometime multiple cores are the answer even for small indexes and specially for high-updates
  • 11. Tuning Memory Usage • Use DocValues for sorting/faceting/grouping • There are docValueFormats: {‘default’, ‘memory’, ‘direct’} with different trade-offs. • default - Helps avoid OOM but uses disk and OS page cache • memory - compressed in-memory format • direct - no-compression, in-memory format
  • 12. Tuning Memory usage • termIndexInterval - Choose how often terms are loaded into term dictionary. Default is 128.
  • 13. Tuning Memory Usage • Garbage Collection pauses kill search performance • GC pauses expire ZK sessions in SolrCloud leading to many problems • Large heap sizes are almost never the answer • Leave a lot of memory for the OS page cache • http://wiki.apache.org/solr/ShawnHeisey
  • 14. Tuning Disk usage • Atomic updates are costly • Lookup from transaction log • Lookup from Index (all stored fields) • Combine • Index
  • 15. Tuning Disk Usage • Experiment with merge policies • TieredMergePolicy is great but LogByteSizeMergePolicy can be better if multiple indexes are sharing a single disk • Increase buffer size - ramBufferSizeMB (>1024M doesn’t help, may reduce performance)
  • 16. Tuning Disk Usage • Always hard commit once in a while • Best to use autoCommit and maxDocs • Trims transaction logs • Solution for slow startup times • Use autoSoftCommit for new searchers • commitWithin is a great way to commit frequently
  • 17. Tuning Network • Batch writes together as much as possible • Use CloudSolrServer in SolrCloud always • Routes updates intelligently to correct leader • ConcurrentUpdateSolrServer (previously known as StreamingUpdateSolrServer) for indexing in non- Cloud mode • Don’t use it for querying!
  • 18. Tuning network • Share HttpClient instance for all Solrj clients or just re-use the same client object • Disable retries on HttpClient
  • 19. Tuning Network • Distributed Search is optimised if you ask for fl=id,score only • Avoid numShard*rows stored field lookups • Saves numShard network calls
  • 20. Tuning Network • Consider setting up a caching proxy such as squid or varnish in front of your Solr cluster • Solr can emit the right cache headers if configured in solrconfig.xml • Last-Modified and ETag headers are generated based on the properties of the index such as last searcher open time • You can even force new ETag headers by changing the ETag seed value • <httpCaching never304=“true”><cacheControl>max-age=30, public</ cacheControl></httpCaching> • The above config will set responses to be cached for 30s by your caching proxy unless the index is modifed.
  • 21. Avoid wastage • Don’t store what you don’t need back • Use stored=false • Don’t index what you don’t search • Use indexed=false • Don’t retrieve what you don’t need back • Don’t use fl=* unless necessary • Don’t use rows=10 when all you need is numFound
  • 22. Reduce indexed info • omitNorms=true - Use if you don’t need index-time boosts • omitTermFreqAndPositions=true - Use if you don’t need term frequencies and positions • No fuzzy query, no phrase queries • Can do simple exists check, can do simple AND/OR searches on terms • No scoring difference whether the term exists once or a thousand times
  • 23. DocValue tricks & gotchas • DocValue field should be stored=false, indexed=false • It can still be retrieved using fl=field(my_dv_field) • If you store DocValue field, it uses extra space as a stored field also. • In future, update-able doc value fields will be supported by Solr but they’ll work only if stored=false, indexed=false • DocValues save disk space also (all values, next to each other lead to very efficient compression)
  • 24. Deep paging • Bulk exporting documents from Solr will bring it to its knees • Enter deep paging and cursorMark parameter • Specify cursorMark=* on the first request • Use the returned ‘nextCursorMark’ value as the nextCursorMark parameter
  • 25. Classic paging vs Deep paging
  • 26. LucidWorks Open Source • Effortless AWS deployment and monitoring http:// www.github.com/lucidworks/solr-scale-tk • Logstash for Solr: https://github.com/LucidWorks/ solrlogmanager • Banana (Kibana for Solr): https://github.com/LucidWorks/ banana • Data Quality Toolkit: https://github.com/LucidWorks/data-quality • Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/ Lucene and Solr, different file formats, pipelines, Logstash
  • 27. LucidWorks • We’re hiring! • Work on open source Apache Lucene/Solr • Help our customers win • Work remotely from home! Location no bar! • Contact me at shalin@apache.org