SlideShare uma empresa Scribd logo
1 de 27
The Fifth Elephant 2013, Bangalore
12th July 2013
SolrCloud and NoSQL
Anshum Gupta
The Fifth Elephant 2013, Bangalore
12th July 20132
Who am I?
• Anshum Gupta
• Search and related stuff for around 8 years now
• Apache Lucene since 2006, Solr since 2010
• Currently:
• Helped launch the first AWS search service, CloudSearch.
• Places I‟ve worked at:
The Fifth Elephant 2013, Bangalore
12th July 2013
Big Data
• Real Value = Process +
Store + Search
• Search
- No longer expensive
- Affordable
- Necessity
- Can get as complicated as
you‟d want it to get.
3
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Data
Search
The Fifth Elephant 2013, Bangalore
12th July 2013
NoSQL Databases
•Wikipedia says:
A NoSQL database provides a mechanism for storage and retrieval of data that
use looser consistency models than traditional relational databases in order to
achieve horizontal scaling and higher availability. Some authors refer to them as
"Not only SQL" to emphasize that some NoSQL systems do allow SQL-like query
language to be used.
•Non-traditional data stores
•Doesn‟t use / isn‟t designed around SQL
•May not give full ACID guarantees
- Offers other advantages such as greater scalability as a
tradeoff
•Distributed, fault-tolerant architecture
The Fifth Elephant 2013, Bangalore
12th July 2013
DB Rankings: Overall
Source: http://db-engines.com/en/ranking
The Fifth Elephant 2013, Bangalore
12th July 2013
Search Engine Rankings
Source: http://db-engines.com/en/ranking/search+engine
The Fifth Elephant 2013, Bangalore
12th July 2013
MongoDB
• Data Model: BSON
• Distributed Model: Sharded master-slave async
replication.
• Consistency: Per table write lock.
• Search:
- Built in full text search, large gaps with „search‟ players.
- Alternate and popular solution: Use another search solution
along with MongoDB, Solr?. Consistency issues and more.
The Fifth Elephant 2013, Bangalore
12th July 2013
Cassandra
• Data Model: Column based data store.
• Distributed Model: Uses consistent hashing for
distributed updates.
• Consistency: Timestamps for consistency.
• Search
- Lucandra : Lucene based search.
- Solandra : Solr based search.
The Fifth Elephant 2013, Bangalore
12th July 20139
• Implements principles from the Amazon Dynamo paper.
• Riak Search - Distributed index and full-text search
engine.
- Merge Index – Storage backed used by Riak Search. It‟s a pure
Erlang storage format and among other things uses the Apache
Lucene file format.
- Riak Solr – Adds a subset of Apache Solr HTTP capabilities to
Riak Search.
• Yokozuna
- “next generation of Riak Search that marries Riak with Apache
Solr”.
- Sits alongside of Riak.
The Fifth Elephant 2013, Bangalore
12th July 201310
The story so far…
• Different approaches for:
- Data Model
- Distributed Update handling
- Consistency management
• Work reasonably well on different fronts as far as
storage is concerned.
• Search:
- There‟s barely anything native and in the core.
- (Almost) Everyone is trying to fuse together with Lucene/Solr.
The Fifth Elephant 2013, Bangalore
12th July 201311
Adding Search to NoSQL
• To begin with, wasn‟t built for that
• Compromises
• Integration is the buzzword.
• Lucandra, Solandra…No strong contender yet.
The Fifth Elephant 2013, Bangalore
12th July 201312
Adding NoSQL to Search
• Already store documents
• With growing data, more intuitive for this to happen
• More intuitive = makes more sense = easier (perhaps)
• No key player as yet.
The Fifth Elephant 2013, Bangalore
12th July 2013
The Fifth Elephant 2013, Bangalore
12th July 2013
Apache Solr 4 at a glance
• Document Oriented NoSQL Search Server
- Data-format agnostic (JSON, XML, CSV, binary)
- Schema-less options (more coming soon)
• Distributed
- Multi-tenanted
• Fault Tolerant
- HA + No single points of failure
• Atomic Updates
• Optimistic Concurrency
• Near Real-time Search
• Full-Text search + Hit Highlighting
• Tons of specialized queries: Faceted
search, grouping, pseudo-join, spatial search, functions
The desire for these
features drove some
of the “SolrCloud”
architecture
The Fifth Elephant 2013, Bangalore
12th July 2013
SolrCloud Design Goals
• Automatic Distributed Indexing
• HA for Writes
• Durable Writes
• Near Real-time Search
• Real-time get
• Optimistic Concurrency
The Fifth Elephant 2013, Bangalore
12th July 2013
SolrCloud
• Distributed Indexing designed from the ground up to
accommodate desired features
• CAP Theorem
- Consistency, Availability, Partition Tolerance (saying goes “choose 2”)
- Reality: Must handle P – the real choice is tradeoffs between C and A
• Ended up with a CP system (roughly)
- Value Consistency over Availability
- Eventual consistency is incompatible with optimistic concurrency
- Closest to MongoDB in architecture
• We still do well with Availability
- All N replicas of a shard must go down before we lose writability for that
shard
- For a network partition, the “big” partition remains active (i.e. Availability
isn‟t “on” or “off”)
The Fifth Elephant 2013, Bangalore
12th July 2013
SolrCloud
shard1
replica2
replica3
replica2
replica3
ZooKeeper
quorum
ZK
nod
e
ZK
node
ZK
nod
e
ZK
node
ZK
node
/configs
/myconf
solrconfig.xml
schema.xml
/clusterstate.json
/aliases.json
/livenodes
server1:8983/solr
server2:8983/solr/collections
/collection1
configName=myconf
/shards
/shard1
server1:8983/solr
server2:8983/solr
/shard2
server3:8983/solr
server4:8983/solr
http://.../solr/collection1/query?q=awesome
Load-balanced
sub-request
replica1
shard2
replica1
ZooKeeper holds cluster state
• Nodes in the cluster
• Collections in the cluster
• Schema & config for each
collection
• Shards in each collection
• Replicas in each shard
• Collection aliases
The Fifth Elephant 2013, Bangalore
12th July 2013
Shard1 Shard2
Replica1 Replica3
Replica2 Replica4
Distributed Indexing
http://.../solr/collection1/update
• Update sent to any node
• Solr determines what shard the document is on, and forwards to shard leader
• Shard Leader versions document and forwards to all other shard replicas
• HA for updates (if one leader fails, another takes it‟s place)
Document Update
Leader
Non leading replica
The Fifth Elephant 2013, Bangalore
12th July 2013
Optimistic Concurrency
• Conditional update based on document version
Solr
2. Modify
document,
retaining
_version_
4. Go back to
step #1 if fail
code=409
client
The Fifth Elephant 2013, Bangalore
12th July 2013
Distributed Query Requests
 Distributed query across all shards in the collection
http://localhost:8983/solr/collection1/query?q=foo
 Explicitly specify node addresses to load-balance across
shards=localhost:8983/solr|localhost:8900/solr,
localhost:7574/solr|localhost:7500/solr
 A list of equivalent nodes are separated by “|”
 Different phases of the same distributed request use the same node
 Specify logical shards to search across
shards=NY,NJ,CT
 Specify multiple collections to search across
collection=collection1,collection2
 public CloudSolrServer(String zkHost)
 ZK aware SolrJ Java client that load-balances across all nodes in cluster
 Calculate where document belongs and directly send to shard leader (new)
The Fifth Elephant 2013, Bangalore
12th July 2013
Document Routing
80000000-bfffffff
00000000-3fffffff
40000000-7fffffff
c0000000-ffffffff
shard1shard4
shard3 shard2
id = BigCo!doc5
9f2
7
3c71
(MurmurHash3)
q=my_query
shard.keys=BigCo!
9f27 0000 9f27 ffffto
(hash)
shard1
numShards=4
router=compositeId
Hash
Ring
The Fifth Elephant 2013, Bangalore
12th July 2013
Durable Writes
• Lucene flushes writes to disk on a “commit”
- Uncommitted docs are lost on a crash (at lucene level)
• Solr 4 maintains it‟s own transaction log
- Contains uncommitted documents
- Services real-time get requests
- Recovery (log replay on restart)
- Supports distributed “peer sync”
• Writes forwarded to multiple shard replicas
- A replica can go away forever w/o collection data loss
- A replica can do a fast “peer sync” if it‟s only slightly out of
date
- A replica can do a full index replication (copy) from a leader.
The Fifth Elephant 2013, Bangalore
12th July 2013
Collections API
 Create a new document collection
http://localhost:8983/solr/admin/collections?
action=CREATE
&name=mycollection
&numShards=4
&replicationFactor=3
CREATE DELETE ALIAS
SPLITSHARD DELETESHARD RELOAD
The Fifth Elephant 2013, Bangalore
12th July 2013
Solr 4.3: Seamless Online Shard Splitting
Shard2_0
Shard1
replica
leader
Shard2
replica
leader
Shard3
replica
leader
Shard2_1
1. http://localhost:8983/solr/admin/collections?action=SPLITSHARD&col
lection=mycollection&shard=Shard2
2. New sub-shards created in “construction” state
3. Leader starts forwarding applicable updates, which are buffered by the sub-shards
4. Leader index is split and installed on the sub-shards
5. Sub-shards apply buffered updates then become “active” leaders and old shard
becomes “inactive”
update
The Fifth Elephant 2013, Bangalore
12th July 2013
Solr 4.4: Schemaless
• “Schemaless” really normally means that the client(s) have an implicit
schema.
• “No Schema” impossible for anything based on Lucene
- A field must be indexed the same way across documents
• Dynamic fields: convention over configuration
- Only pre-define types of fields, not fields themselves
- No guessing. Any field name ending in _i is an integer
• “Guessed Schema” or “Type Guessing”
- For previously unknown fields, guess using JSON type as a hint
- Coming soon (4.4?) based on the Dynamic Schema work
• Many disadvantages to guessing
- Lose ability to catch field naming errors
- Can‟t optimize based on types
- Guessing incorrectly means having to start over
The Fifth Elephant 2013, Bangalore
12th July 2013
Bangalore Apache Lucene/Solr Meetup
 1 meetup already
 Almost 150 members
 Another one coming up soon…
 Join us at: http://www.meetup.com/Bangalore-Apache-
Solr-Lucene-Group/
The Fifth Elephant 2013, Bangalore
12th July 2013
Twitter: @anshumgupta
LinkedIn: http://www.linkedin.com/in/anshumgupta
Blog: http://www.anshumgupta.net
Thanks!

Mais conteúdo relacionado

Destaque

Destaque (20)

Top Node.js Metrics to Watch
Top Node.js Metrics to WatchTop Node.js Metrics to Watch
Top Node.js Metrics to Watch
 
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
 
Webinar: Search and Recommenders
Webinar: Search and RecommendersWebinar: Search and Recommenders
Webinar: Search and Recommenders
 
Webinar: Fusion for Business Intelligence
Webinar: Fusion for Business IntelligenceWebinar: Fusion for Business Intelligence
Webinar: Fusion for Business Intelligence
 
Apache Solr 5.0 and beyond
Apache Solr 5.0 and beyondApache Solr 5.0 and beyond
Apache Solr 5.0 and beyond
 
Understanding the Solr security framework - Lucene Solr Revolution 2015
Understanding the Solr security framework - Lucene Solr Revolution 2015Understanding the Solr security framework - Lucene Solr Revolution 2015
Understanding the Solr security framework - Lucene Solr Revolution 2015
 
What's New in Apache Solr 4.10
What's New in Apache Solr 4.10What's New in Apache Solr 4.10
What's New in Apache Solr 4.10
 
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & SparkWebinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0
 
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
 
Solr JDBC: Presented by Kevin Risden, Avalon Consulting
Solr JDBC: Presented by Kevin Risden, Avalon ConsultingSolr JDBC: Presented by Kevin Risden, Avalon Consulting
Solr JDBC: Presented by Kevin Risden, Avalon Consulting
 
Scaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsScaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of Collections
 
it's just search
it's just searchit's just search
it's just search
 
Ease of use in Apache Solr
Ease of use in Apache SolrEase of use in Apache Solr
Ease of use in Apache Solr
 
Solr security frameworks
Solr security frameworksSolr security frameworks
Solr security frameworks
 
SolrCloud Cluster management via APIs
SolrCloud Cluster management via APIsSolrCloud Cluster management via APIs
SolrCloud Cluster management via APIs
 
Using Apache Solr for Images as Big Data: Presented by Kerry Koitzsch, Wipro...
Using Apache Solr for Images as Big Data: Presented by Kerry Koitzsch,  Wipro...Using Apache Solr for Images as Big Data: Presented by Kerry Koitzsch,  Wipro...
Using Apache Solr for Images as Big Data: Presented by Kerry Koitzsch, Wipro...
 
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
 
Working with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrWorking with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache Solr
 
Managing a SolrCloud cluster using APIs
Managing a SolrCloud cluster using APIsManaging a SolrCloud cluster using APIs
Managing a SolrCloud cluster using APIs
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

  • 1. The Fifth Elephant 2013, Bangalore 12th July 2013 SolrCloud and NoSQL Anshum Gupta
  • 2. The Fifth Elephant 2013, Bangalore 12th July 20132 Who am I? • Anshum Gupta • Search and related stuff for around 8 years now • Apache Lucene since 2006, Solr since 2010 • Currently: • Helped launch the first AWS search service, CloudSearch. • Places I‟ve worked at:
  • 3. The Fifth Elephant 2013, Bangalore 12th July 2013 Big Data • Real Value = Process + Store + Search • Search - No longer expensive - Affordable - Necessity - Can get as complicated as you‟d want it to get. 3 Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Data Search
  • 4. The Fifth Elephant 2013, Bangalore 12th July 2013 NoSQL Databases •Wikipedia says: A NoSQL database provides a mechanism for storage and retrieval of data that use looser consistency models than traditional relational databases in order to achieve horizontal scaling and higher availability. Some authors refer to them as "Not only SQL" to emphasize that some NoSQL systems do allow SQL-like query language to be used. •Non-traditional data stores •Doesn‟t use / isn‟t designed around SQL •May not give full ACID guarantees - Offers other advantages such as greater scalability as a tradeoff •Distributed, fault-tolerant architecture
  • 5. The Fifth Elephant 2013, Bangalore 12th July 2013 DB Rankings: Overall Source: http://db-engines.com/en/ranking
  • 6. The Fifth Elephant 2013, Bangalore 12th July 2013 Search Engine Rankings Source: http://db-engines.com/en/ranking/search+engine
  • 7. The Fifth Elephant 2013, Bangalore 12th July 2013 MongoDB • Data Model: BSON • Distributed Model: Sharded master-slave async replication. • Consistency: Per table write lock. • Search: - Built in full text search, large gaps with „search‟ players. - Alternate and popular solution: Use another search solution along with MongoDB, Solr?. Consistency issues and more.
  • 8. The Fifth Elephant 2013, Bangalore 12th July 2013 Cassandra • Data Model: Column based data store. • Distributed Model: Uses consistent hashing for distributed updates. • Consistency: Timestamps for consistency. • Search - Lucandra : Lucene based search. - Solandra : Solr based search.
  • 9. The Fifth Elephant 2013, Bangalore 12th July 20139 • Implements principles from the Amazon Dynamo paper. • Riak Search - Distributed index and full-text search engine. - Merge Index – Storage backed used by Riak Search. It‟s a pure Erlang storage format and among other things uses the Apache Lucene file format. - Riak Solr – Adds a subset of Apache Solr HTTP capabilities to Riak Search. • Yokozuna - “next generation of Riak Search that marries Riak with Apache Solr”. - Sits alongside of Riak.
  • 10. The Fifth Elephant 2013, Bangalore 12th July 201310 The story so far… • Different approaches for: - Data Model - Distributed Update handling - Consistency management • Work reasonably well on different fronts as far as storage is concerned. • Search: - There‟s barely anything native and in the core. - (Almost) Everyone is trying to fuse together with Lucene/Solr.
  • 11. The Fifth Elephant 2013, Bangalore 12th July 201311 Adding Search to NoSQL • To begin with, wasn‟t built for that • Compromises • Integration is the buzzword. • Lucandra, Solandra…No strong contender yet.
  • 12. The Fifth Elephant 2013, Bangalore 12th July 201312 Adding NoSQL to Search • Already store documents • With growing data, more intuitive for this to happen • More intuitive = makes more sense = easier (perhaps) • No key player as yet.
  • 13. The Fifth Elephant 2013, Bangalore 12th July 2013
  • 14. The Fifth Elephant 2013, Bangalore 12th July 2013 Apache Solr 4 at a glance • Document Oriented NoSQL Search Server - Data-format agnostic (JSON, XML, CSV, binary) - Schema-less options (more coming soon) • Distributed - Multi-tenanted • Fault Tolerant - HA + No single points of failure • Atomic Updates • Optimistic Concurrency • Near Real-time Search • Full-Text search + Hit Highlighting • Tons of specialized queries: Faceted search, grouping, pseudo-join, spatial search, functions The desire for these features drove some of the “SolrCloud” architecture
  • 15. The Fifth Elephant 2013, Bangalore 12th July 2013 SolrCloud Design Goals • Automatic Distributed Indexing • HA for Writes • Durable Writes • Near Real-time Search • Real-time get • Optimistic Concurrency
  • 16. The Fifth Elephant 2013, Bangalore 12th July 2013 SolrCloud • Distributed Indexing designed from the ground up to accommodate desired features • CAP Theorem - Consistency, Availability, Partition Tolerance (saying goes “choose 2”) - Reality: Must handle P – the real choice is tradeoffs between C and A • Ended up with a CP system (roughly) - Value Consistency over Availability - Eventual consistency is incompatible with optimistic concurrency - Closest to MongoDB in architecture • We still do well with Availability - All N replicas of a shard must go down before we lose writability for that shard - For a network partition, the “big” partition remains active (i.e. Availability isn‟t “on” or “off”)
  • 17. The Fifth Elephant 2013, Bangalore 12th July 2013 SolrCloud shard1 replica2 replica3 replica2 replica3 ZooKeeper quorum ZK nod e ZK node ZK nod e ZK node ZK node /configs /myconf solrconfig.xml schema.xml /clusterstate.json /aliases.json /livenodes server1:8983/solr server2:8983/solr/collections /collection1 configName=myconf /shards /shard1 server1:8983/solr server2:8983/solr /shard2 server3:8983/solr server4:8983/solr http://.../solr/collection1/query?q=awesome Load-balanced sub-request replica1 shard2 replica1 ZooKeeper holds cluster state • Nodes in the cluster • Collections in the cluster • Schema & config for each collection • Shards in each collection • Replicas in each shard • Collection aliases
  • 18. The Fifth Elephant 2013, Bangalore 12th July 2013 Shard1 Shard2 Replica1 Replica3 Replica2 Replica4 Distributed Indexing http://.../solr/collection1/update • Update sent to any node • Solr determines what shard the document is on, and forwards to shard leader • Shard Leader versions document and forwards to all other shard replicas • HA for updates (if one leader fails, another takes it‟s place) Document Update Leader Non leading replica
  • 19. The Fifth Elephant 2013, Bangalore 12th July 2013 Optimistic Concurrency • Conditional update based on document version Solr 2. Modify document, retaining _version_ 4. Go back to step #1 if fail code=409 client
  • 20. The Fifth Elephant 2013, Bangalore 12th July 2013 Distributed Query Requests  Distributed query across all shards in the collection http://localhost:8983/solr/collection1/query?q=foo  Explicitly specify node addresses to load-balance across shards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr  A list of equivalent nodes are separated by “|”  Different phases of the same distributed request use the same node  Specify logical shards to search across shards=NY,NJ,CT  Specify multiple collections to search across collection=collection1,collection2  public CloudSolrServer(String zkHost)  ZK aware SolrJ Java client that load-balances across all nodes in cluster  Calculate where document belongs and directly send to shard leader (new)
  • 21. The Fifth Elephant 2013, Bangalore 12th July 2013 Document Routing 80000000-bfffffff 00000000-3fffffff 40000000-7fffffff c0000000-ffffffff shard1shard4 shard3 shard2 id = BigCo!doc5 9f2 7 3c71 (MurmurHash3) q=my_query shard.keys=BigCo! 9f27 0000 9f27 ffffto (hash) shard1 numShards=4 router=compositeId Hash Ring
  • 22. The Fifth Elephant 2013, Bangalore 12th July 2013 Durable Writes • Lucene flushes writes to disk on a “commit” - Uncommitted docs are lost on a crash (at lucene level) • Solr 4 maintains it‟s own transaction log - Contains uncommitted documents - Services real-time get requests - Recovery (log replay on restart) - Supports distributed “peer sync” • Writes forwarded to multiple shard replicas - A replica can go away forever w/o collection data loss - A replica can do a fast “peer sync” if it‟s only slightly out of date - A replica can do a full index replication (copy) from a leader.
  • 23. The Fifth Elephant 2013, Bangalore 12th July 2013 Collections API  Create a new document collection http://localhost:8983/solr/admin/collections? action=CREATE &name=mycollection &numShards=4 &replicationFactor=3 CREATE DELETE ALIAS SPLITSHARD DELETESHARD RELOAD
  • 24. The Fifth Elephant 2013, Bangalore 12th July 2013 Solr 4.3: Seamless Online Shard Splitting Shard2_0 Shard1 replica leader Shard2 replica leader Shard3 replica leader Shard2_1 1. http://localhost:8983/solr/admin/collections?action=SPLITSHARD&col lection=mycollection&shard=Shard2 2. New sub-shards created in “construction” state 3. Leader starts forwarding applicable updates, which are buffered by the sub-shards 4. Leader index is split and installed on the sub-shards 5. Sub-shards apply buffered updates then become “active” leaders and old shard becomes “inactive” update
  • 25. The Fifth Elephant 2013, Bangalore 12th July 2013 Solr 4.4: Schemaless • “Schemaless” really normally means that the client(s) have an implicit schema. • “No Schema” impossible for anything based on Lucene - A field must be indexed the same way across documents • Dynamic fields: convention over configuration - Only pre-define types of fields, not fields themselves - No guessing. Any field name ending in _i is an integer • “Guessed Schema” or “Type Guessing” - For previously unknown fields, guess using JSON type as a hint - Coming soon (4.4?) based on the Dynamic Schema work • Many disadvantages to guessing - Lose ability to catch field naming errors - Can‟t optimize based on types - Guessing incorrectly means having to start over
  • 26. The Fifth Elephant 2013, Bangalore 12th July 2013 Bangalore Apache Lucene/Solr Meetup  1 meetup already  Almost 150 members  Another one coming up soon…  Join us at: http://www.meetup.com/Bangalore-Apache- Solr-Lucene-Group/
  • 27. The Fifth Elephant 2013, Bangalore 12th July 2013 Twitter: @anshumgupta LinkedIn: http://www.linkedin.com/in/anshumgupta Blog: http://www.anshumgupta.net Thanks!

Notas do Editor

  1. - You can see the range of any shard in clusterstate.jsonHashing based on the “id” only has some advantages vs hashing based on a different field. Clients can be more generic and not know/care what addressing scheme is being used when dealing with individual documents. The “id” always fully defines where a document lives.Enabled highly scalable multi-tenanted applications