SlideShare uma empresa Scribd logo
1 de 29
SolrCloud: Searching Big Data
Shalin Shekhar Mangar
Subset of o ptio nal featuresin Solr to enableand
simplify horizontal scaling asearch index using
sharding and replication.
Goals
performance, scalability, high-availability,
simplicity, and elasticity
What is SolrCloud?
Terminology
●
ZooKeeper: Distributed coordination servicethat
providescentralized configuration, cluster state
management, and leader election
●
Node: JVM processbound to aspecific port on amachine;
hoststheSolr web application
●
Collection: Search index distributed acrossmultiple
nodes; each collection hasaname, shard count, and
replication factor
●
Replication Factor: Number of copiesof adocument in
acollection
• Shard: Logical sliceof acollection; each shard hasaname, hash
range, leader, and replication factor. Documentsareassigned to
oneand only oneshard per collection using ahash-based
document routing strategy
• Replica: Solr index that hostsacopy of ashard in acollection;
behind thescenes, each replicaisimplemented asaSolr core
• Leader: Replicain ashard that assumesspecial dutiesneeded to
support distributed indexing in Solr; each shard hasoneand only
oneleader at any timeand leadersareelected using ZooKeeper
Terminology
High-level Architecture
Collection == Distributed Index
A collection isa distributed index defined by:
• named configuration stored in ZooKeeper
• number of shards: documents are distributed
across N partitions of the index
• document routing strategy: how documents get
assigned to shards
• replication factor: how many copiesof each
document in thecollection
Collections API:
http://localhost:8983/solr/admin/collections?
action=create&name=logstash4solr&replicationFactor=
2&numShards=2&collection.configName=logs
Collection == Distributed Index
●
Collection has a fixed number of shards
- existing shardscan besplit
●
When to shard?
- Largenumber of docs
- Largedocument sizes
- Parallelization during indexing and
queries
- Datapartitioning (custom hashing)
Sharding
●
Each shard coversahash-range
●
Default: Hash ID into 32-bit integer, map to range
- leadsto balanced (roughly) shards
●
Custom-hashing (examplein afew slides)
●
Tri-level: app!user!doc
●
Implicit: no hash-rangeset for shards
Document Routing
• Why replicate?
- High-availability
- Load balancing
●
How does it work in SolrCloud?
- Near-real-time, not master-slave
- Leader forwards to replicas in parallel,
waits for response
- Error handling during indexing is tricky
Replication
Example: Indexing
Example: Querying
1. Get cluster statefrom ZK
2. Routedocument directly to
leader (hash on doc ID)
3. Persist document on durable
storage(tlog)
4. Forward to healthy replicas
5. Acknowledgewrite succeed to
client
Distributed Indexing
●
Additional responsibilitiesduring indexing only! Not a
master node
●
Leader isareplica(handlesqueries)
●
Acceptsupdaterequestsfor theshard
●
Incrementsthe_version_ on thenew or updated doc
●
Sendsupdates(in parallel) to all replicas
Shard Leader
Distributed Queries
1. Query client can beZK awareor just
query viaaload balancer
2. Client can send query to any nodein the
cluster
3. Controller nodedistributesthequery to
areplicafor each shard to identify
documentsmatching query
4. Controller nodesortstheresultsfrom
step 3 and issuesasecond query for all
fieldsfor apageof results
Scalability / Stability Highlights
●
All nodesin cluster perform indexing and execute
queries; no master node
●
Distributed indexing: No SPoF, high throughput via
direct updatesto leaders, automated failover to new
leader
●
Distributed queries: Add replicasto scale-out qps;
parallelizecomplex query computations; fault-tolerance
●
Indexing / queriescontinueso long asthereis1 healthy
replicaper shard
SolrCloud and CAP
●
A distributed system should be: Consistent, Available, and
Partition tolerant
●
CAPsayspick 2 of the3! (slightly morenuanced than that
in reality)
●
SolrCloud favorsconsistency over write-availability (CP)
●
All replicasin ashard havethesamedata
●
Activereplicasetsconcept (writesaccepted so long asa
shard hasat least oneactivereplicaavailable)
SolrCloud and CAP
• No toolsto detect or fix consistency issuesin Solr
– Reads go to one replica; no concept of quorum
– Writes must fail if consistency cannot be
guaranteed (SOLR-5468)
ZooKeeper
●
Isavery good thing ... clustersareazoo!
●
Centralized configuration management
●
Cluster statemanagement
●
Leader election (shard leader and overseer)
●
Overseer distributed work queue
●
LiveNodes
– Ephemeral znodesused to signal aserver isgone
●
Needs3 nodesfor quorum in production
ZooKeeper: Centralized Configuration
●
Storeconfig filesin
ZooKeeper
●
Solr nodespull config
during coreinitialization
●
Config setscan be“shared”
acrosscollections
●
Changesareuploaded to ZK
and then collectionsshould
bereloaded
ZooKeeper: State Management
●
Keep track of /live_nodesznode
●
Ephemeral nodes
●
ZooKeeper client timeout
●
Collection metadataand replicastatein /clusterstate.json
●
Every corehaswatchersfor /live_nodesand
/clusterstate.json
●
Leader election
●
ZooKeeper sequencenumberson ephemeral znodes
Overseer
●
What doesit do?
– Persistscollection statechangeeventsto ZooKeeper
– Controller for Collection API commands
– Ordered updates
– Oneper cluster (for all collections); elected using leader election
●
How doesit work?
– Asynchronous(pub/sub messaging)
– ZooKeeper asdistributed queuerecipe
– Automated failover to ahealthy node
– Can beassigned to adedicated node(SOLR-5476)
Custom Hashing
●
Routedocumentsto specific shardsbased on ashard key
component in thedocument ID
●
Send all log messagesfrom thesamesystem to the
sameshard
●
Direct queriesto specific shards: q=...&_route_=httpd
{
"id" : ”httpd!2",
"level_s" : ”ERROR",
"lang_s" : "en",
...
},
Hash:
shardKey!docID
Custom Hashing Highlights
●
Co-locatedocumentshaving acommon property in thesame
shard
- e.g. docshaving IDshttpd!21 and httpd!33 will
bein thesameshard
• Scale-up thereplicasfor specific shardsto addresshigh query
and/or indexing volumefrom specific apps
• Not asmuch control over thedistribution of keys
- httpd, mysql, and collectd all in same shard
• Can split unbalanced shards when using custom hashing
• Can split shards into two sub-shards
• Live splitting! No downtime needed!
• Requests start being forwarded to sub-shards
automatically
• Expensive operation: Use as required during low
traffic
Shard Splitting
Other features / highlights
• Near-Real-Time Search: Documentsarevisiblewithin a
second or so after being indexed
• Partial Document Update: Just updatethefieldsyou need to
changeon existing documents
• Optimistic Locking: Ensureupdatesareapplied to thecorrect
version of adocument
• Transaction log: Better recoverability; peer-sync between nodes
after hiccups
• HTTPS
• Use HDFS for storing indexes
• UseMapReduce for building index (SOLR-1301)
More?
• Workshop: Apache Solr in Minutes tomorrow
• https://cwiki.apache.org/confluence/display/solr/Ap
ache+Solr+Reference+Guide
• shalin@apache.org
• http://twitter.com/shalinmangar
• http://shal.in
Attributions
• Tim Potter's slides on “Introduction to SolrCloud” at
Lucene/Solr Exchange 2014
– http://twitter.com/thelabdude
• Erik Hatcher's slides on “Solr: Search at the speed of
light” at JavaZone 2009
– http://twitter.com/ErikHatcher
GIDS2014: SolrCloud: Searching Big Data

Mais conteúdo relacionado

Mais procurados

Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudthelabdude
 
Solr cluster with SolrCloud at lucenerevolution (tutorial)
Solr cluster with SolrCloud at lucenerevolution (tutorial)Solr cluster with SolrCloud at lucenerevolution (tutorial)
Solr cluster with SolrCloud at lucenerevolution (tutorial)searchbox-com
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solrthelabdude
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Nitin S
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scaleAnshum Gupta
 
Scaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsScaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsAnshum Gupta
 
SolrCloud Failover and Testing
SolrCloud Failover and TestingSolrCloud Failover and Testing
SolrCloud Failover and TestingMark Miller
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
How to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr clusterHow to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr clusterlucenerevolution
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudAnshum Gupta
 
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLucidworks
 
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...Lucidworks
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Lucidworks
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Shalin Shekhar Mangar
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environmentlucenerevolution
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 

Mais procurados (20)

Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
Solr cluster with SolrCloud at lucenerevolution (tutorial)
Solr cluster with SolrCloud at lucenerevolution (tutorial)Solr cluster with SolrCloud at lucenerevolution (tutorial)
Solr cluster with SolrCloud at lucenerevolution (tutorial)
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scale
 
Apache SolrCloud
Apache SolrCloudApache SolrCloud
Apache SolrCloud
 
Scaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsScaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of Collections
 
SolrCloud Failover and Testing
SolrCloud Failover and TestingSolrCloud Failover and Testing
SolrCloud Failover and Testing
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
How to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr clusterHow to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr cluster
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
 
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environment
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 

Destaque

Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Shalin Shekhar Mangar
 
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Lucidworks
 
第10回solr勉強会 solr cloudの導入事例
第10回solr勉強会 solr cloudの導入事例第10回solr勉強会 solr cloudの導入事例
第10回solr勉強会 solr cloudの導入事例Ken Hirose
 
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...Lucidworks
 
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
Why Is My Solr Slow?: Presented by Mike Drob, ClouderaWhy Is My Solr Slow?: Presented by Mike Drob, Cloudera
Why Is My Solr Slow?: Presented by Mike Drob, ClouderaLucidworks
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 

Destaque (9)

Intro to Apache Solr
Intro to Apache SolrIntro to Apache Solr
Intro to Apache Solr
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6
 
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
 
SolrCloud and Shard Splitting
SolrCloud and Shard SplittingSolrCloud and Shard Splitting
SolrCloud and Shard Splitting
 
第10回solr勉強会 solr cloudの導入事例
第10回solr勉強会 solr cloudの導入事例第10回solr勉強会 solr cloudの導入事例
第10回solr勉強会 solr cloudの導入事例
 
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
 
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
Why Is My Solr Slow?: Presented by Mike Drob, ClouderaWhy Is My Solr Slow?: Presented by Mike Drob, Cloudera
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 

Semelhante a GIDS2014: SolrCloud: Searching Big Data

Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Elasticsearch Data Analyses
Elasticsearch Data AnalysesElasticsearch Data Analyses
Elasticsearch Data AnalysesAlaa Elhadba
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systexJames Chen
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Anthony Baker
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
Solr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationSolr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationNitin Sharma
 
Cdcr apachecon-talk
Cdcr apachecon-talkCdcr apachecon-talk
Cdcr apachecon-talkAmrit Sarkar
 
Everything You Need To Know About Persistent Storage in Kubernetes
Everything You Need To Know About Persistent Storage in KubernetesEverything You Need To Know About Persistent Storage in Kubernetes
Everything You Need To Know About Persistent Storage in KubernetesThe {code} Team
 
HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta MapR Technologies
 
Comparison between zookeeper, etcd 3 and other distributed coordination systems
Comparison between zookeeper, etcd 3 and other distributed coordination systemsComparison between zookeeper, etcd 3 and other distributed coordination systems
Comparison between zookeeper, etcd 3 and other distributed coordination systemsImesha Sudasingha
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in ScalaAlex Payne
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitinbloomreacheng
 
NetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksNetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksRuslan Meshenberg
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018Roy Russo
 

Semelhante a GIDS2014: SolrCloud: Searching Big Data (20)

Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Elasticsearch Data Analyses
Elasticsearch Data AnalysesElasticsearch Data Analyses
Elasticsearch Data Analyses
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
 
Scalable Web Apps
Scalable Web AppsScalable Web Apps
Scalable Web Apps
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Solr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationSolr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin Presentation
 
Cdcr apachecon-talk
Cdcr apachecon-talkCdcr apachecon-talk
Cdcr apachecon-talk
 
Everything You Need To Know About Persistent Storage in Kubernetes
Everything You Need To Know About Persistent Storage in KubernetesEverything You Need To Know About Persistent Storage in Kubernetes
Everything You Need To Know About Persistent Storage in Kubernetes
 
HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta
 
Spark 1.0
Spark 1.0Spark 1.0
Spark 1.0
 
Comparison between zookeeper, etcd 3 and other distributed coordination systems
Comparison between zookeeper, etcd 3 and other distributed coordination systemsComparison between zookeeper, etcd 3 and other distributed coordination systems
Comparison between zookeeper, etcd 3 and other distributed coordination systems
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in Scala
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
 
NetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksNetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talks
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
 
MYSQL
MYSQLMYSQL
MYSQL
 

Último

Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsBert Jan Schrijver
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburgmasabamasaba
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durbanmasabamasaba
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 

Último (20)

Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 

GIDS2014: SolrCloud: Searching Big Data

  • 1. SolrCloud: Searching Big Data Shalin Shekhar Mangar
  • 2. Subset of o ptio nal featuresin Solr to enableand simplify horizontal scaling asearch index using sharding and replication. Goals performance, scalability, high-availability, simplicity, and elasticity What is SolrCloud?
  • 3. Terminology ● ZooKeeper: Distributed coordination servicethat providescentralized configuration, cluster state management, and leader election ● Node: JVM processbound to aspecific port on amachine; hoststheSolr web application ● Collection: Search index distributed acrossmultiple nodes; each collection hasaname, shard count, and replication factor ● Replication Factor: Number of copiesof adocument in acollection
  • 4. • Shard: Logical sliceof acollection; each shard hasaname, hash range, leader, and replication factor. Documentsareassigned to oneand only oneshard per collection using ahash-based document routing strategy • Replica: Solr index that hostsacopy of ashard in acollection; behind thescenes, each replicaisimplemented asaSolr core • Leader: Replicain ashard that assumesspecial dutiesneeded to support distributed indexing in Solr; each shard hasoneand only oneleader at any timeand leadersareelected using ZooKeeper Terminology
  • 6. Collection == Distributed Index A collection isa distributed index defined by: • named configuration stored in ZooKeeper • number of shards: documents are distributed across N partitions of the index • document routing strategy: how documents get assigned to shards • replication factor: how many copiesof each document in thecollection
  • 8. ● Collection has a fixed number of shards - existing shardscan besplit ● When to shard? - Largenumber of docs - Largedocument sizes - Parallelization during indexing and queries - Datapartitioning (custom hashing) Sharding
  • 9. ● Each shard coversahash-range ● Default: Hash ID into 32-bit integer, map to range - leadsto balanced (roughly) shards ● Custom-hashing (examplein afew slides) ● Tri-level: app!user!doc ● Implicit: no hash-rangeset for shards Document Routing
  • 10. • Why replicate? - High-availability - Load balancing ● How does it work in SolrCloud? - Near-real-time, not master-slave - Leader forwards to replicas in parallel, waits for response - Error handling during indexing is tricky Replication
  • 13. 1. Get cluster statefrom ZK 2. Routedocument directly to leader (hash on doc ID) 3. Persist document on durable storage(tlog) 4. Forward to healthy replicas 5. Acknowledgewrite succeed to client Distributed Indexing
  • 14. ● Additional responsibilitiesduring indexing only! Not a master node ● Leader isareplica(handlesqueries) ● Acceptsupdaterequestsfor theshard ● Incrementsthe_version_ on thenew or updated doc ● Sendsupdates(in parallel) to all replicas Shard Leader
  • 15. Distributed Queries 1. Query client can beZK awareor just query viaaload balancer 2. Client can send query to any nodein the cluster 3. Controller nodedistributesthequery to areplicafor each shard to identify documentsmatching query 4. Controller nodesortstheresultsfrom step 3 and issuesasecond query for all fieldsfor apageof results
  • 16. Scalability / Stability Highlights ● All nodesin cluster perform indexing and execute queries; no master node ● Distributed indexing: No SPoF, high throughput via direct updatesto leaders, automated failover to new leader ● Distributed queries: Add replicasto scale-out qps; parallelizecomplex query computations; fault-tolerance ● Indexing / queriescontinueso long asthereis1 healthy replicaper shard
  • 17. SolrCloud and CAP ● A distributed system should be: Consistent, Available, and Partition tolerant ● CAPsayspick 2 of the3! (slightly morenuanced than that in reality) ● SolrCloud favorsconsistency over write-availability (CP) ● All replicasin ashard havethesamedata ● Activereplicasetsconcept (writesaccepted so long asa shard hasat least oneactivereplicaavailable)
  • 18. SolrCloud and CAP • No toolsto detect or fix consistency issuesin Solr – Reads go to one replica; no concept of quorum – Writes must fail if consistency cannot be guaranteed (SOLR-5468)
  • 19. ZooKeeper ● Isavery good thing ... clustersareazoo! ● Centralized configuration management ● Cluster statemanagement ● Leader election (shard leader and overseer) ● Overseer distributed work queue ● LiveNodes – Ephemeral znodesused to signal aserver isgone ● Needs3 nodesfor quorum in production
  • 20. ZooKeeper: Centralized Configuration ● Storeconfig filesin ZooKeeper ● Solr nodespull config during coreinitialization ● Config setscan be“shared” acrosscollections ● Changesareuploaded to ZK and then collectionsshould bereloaded
  • 21. ZooKeeper: State Management ● Keep track of /live_nodesznode ● Ephemeral nodes ● ZooKeeper client timeout ● Collection metadataand replicastatein /clusterstate.json ● Every corehaswatchersfor /live_nodesand /clusterstate.json ● Leader election ● ZooKeeper sequencenumberson ephemeral znodes
  • 22. Overseer ● What doesit do? – Persistscollection statechangeeventsto ZooKeeper – Controller for Collection API commands – Ordered updates – Oneper cluster (for all collections); elected using leader election ● How doesit work? – Asynchronous(pub/sub messaging) – ZooKeeper asdistributed queuerecipe – Automated failover to ahealthy node – Can beassigned to adedicated node(SOLR-5476)
  • 23. Custom Hashing ● Routedocumentsto specific shardsbased on ashard key component in thedocument ID ● Send all log messagesfrom thesamesystem to the sameshard ● Direct queriesto specific shards: q=...&_route_=httpd { "id" : ”httpd!2", "level_s" : ”ERROR", "lang_s" : "en", ... }, Hash: shardKey!docID
  • 24. Custom Hashing Highlights ● Co-locatedocumentshaving acommon property in thesame shard - e.g. docshaving IDshttpd!21 and httpd!33 will bein thesameshard • Scale-up thereplicasfor specific shardsto addresshigh query and/or indexing volumefrom specific apps • Not asmuch control over thedistribution of keys - httpd, mysql, and collectd all in same shard • Can split unbalanced shards when using custom hashing
  • 25. • Can split shards into two sub-shards • Live splitting! No downtime needed! • Requests start being forwarded to sub-shards automatically • Expensive operation: Use as required during low traffic Shard Splitting
  • 26. Other features / highlights • Near-Real-Time Search: Documentsarevisiblewithin a second or so after being indexed • Partial Document Update: Just updatethefieldsyou need to changeon existing documents • Optimistic Locking: Ensureupdatesareapplied to thecorrect version of adocument • Transaction log: Better recoverability; peer-sync between nodes after hiccups • HTTPS • Use HDFS for storing indexes • UseMapReduce for building index (SOLR-1301)
  • 27. More? • Workshop: Apache Solr in Minutes tomorrow • https://cwiki.apache.org/confluence/display/solr/Ap ache+Solr+Reference+Guide • shalin@apache.org • http://twitter.com/shalinmangar • http://shal.in
  • 28. Attributions • Tim Potter's slides on “Introduction to SolrCloud” at Lucene/Solr Exchange 2014 – http://twitter.com/thelabdude • Erik Hatcher's slides on “Solr: Search at the speed of light” at JavaZone 2009 – http://twitter.com/ErikHatcher