SlideShare a Scribd company logo
1 of 32
© Rocana, Inc. All Rights Reserved. | 1
Joey Echeverria, Platform Technical Lead - @fwiffo
Data Day Seattle 2016
Real-time Search on Terabytes of Data Per Day
Lessons Learned
© Rocana, Inc. All Rights Reserved. | 2
Joey
• Where I work: Rocana – Platform Technical Lead
• Where I used to work: Cloudera (’11-’15), NSA
• Distributed systems, security, data processing, big data
© Rocana, Inc. All Rights Reserved. | 3
© Rocana, Inc. All Rights Reserved. | 4
Context
• We built a system for large scale realtime collection, processing, and
analysis of event-oriented machine data
• On prem or in the cloud, but not SaaS
• Supportability is a big deal for us
• Predictability of performance under load and failures
• Ease of configuration and operation
• Behavior in wacky environments
• All of our decisions are informed by this - YMMV
© Rocana, Inc. All Rights Reserved. | 5
What I mean by “scale”
• Typical: 10s of TB of new data per day
• Average event size ~200-500 bytes
• 20TB per day
• @200 bytes = 1.2M events / second, ~109.9B events / day, 40.1T events / year
• @500 bytes = 509K events / second, ~43.9B events / day, 16T events / year,
• Retaining years of data online for query
© Rocana, Inc. All Rights Reserved. | 6
General purpose search – the good parts
• We originally built against Solr Cloud (but most of this goes for Elastic
Search too)
• Amazing feature set for general purpose search
• Good support for moderate scale
• Excellent at
• Content search – news sites, document repositories
• Finite size datasets – product catalogs, job postings, things you prune
• Low(er) cardinality datasets that (mostly) fit in memory
© Rocana, Inc. All Rights Reserved. | 7
Problems with general purpose search systems
• Fixed shard allocation models – always N partitions
• Multi-level and semantic partitioning is painful without building your own
macro query planner
• All shards open all the time; poor resource control for high retention
• APIs are record-at-a-time focused for NRT indexing; poor ingest
performance (aka: please stop making everything REST!)
• Ingest concurrency is wonky
• High write amplification on data we know won’t change
• Other smaller stuff…
© Rocana, Inc. All Rights Reserved. | 8
“Well actually…”
Plenty of ways to push general purpose systems
(We tried many of them)
• Using multiple collections as partitions, macro query planning
• Running multiple JVMs per node for better utilization
• Pushing historical searches into another system
• Building weirdo caches of things
At some point the cost of hacking outweighed the cost of building
© Rocana, Inc. All Rights Reserved. | 9
Warning!
• This is not a condemnation of general purpose search systems!
• Unless the sky is falling, use one of those systems
© Rocana, Inc. All Rights Reserved. | 10
We built a thing: Rocana Search
High cardinality, low latency, parallel search system for time-oriented events
© Rocana, Inc. All Rights Reserved. | 11
Key Goals for Rocana Search
• Higher indexing throughput per node than Solr for time-oriented event
data
• Scale horizontally better than Solr
• Support an arbitrary number of dynamically created partitions
• Arbitrarily large amounts of indexed data on disk
• all data queryable without wasting resources for infrequently used data
• Ability to add/remove Search nodes dynamically without any manual
restarts or rebalances
© Rocana, Inc. All Rights Reserved. | 12
Some Key Features of Rocana Search
• Fully parallelized ingest and query, built for large clusters
• Every node is an indexer
Hadoop Node
Rocana Search
Hadoop Node
Rocana SearchHadoop Node
Rocana Search
Hadoop Node
Rocana Search
Kafka
© Rocana, Inc. All Rights Reserved. | 13
Some Key Features of Rocana Search
• Every node is a query coordinator and executor
Query Client
Rocana Search
Coord Exec
Rocana Search
Coord Exec
Rocana Search
Coord Exec
Rocana Search
Coord Exec
© Rocana, Inc. All Rights Reserved. | 14
Architecture
(A single node)
RS
HDFS
MetadataIndex Management Coordinator
ExecutorLucene Indexes
Query Client
Kafka
Data Producers
ZK
© Rocana, Inc. All Rights Reserved. | 15
Sharding Model: datasets, partitions, and slices
• A search dataset is split into partitions by a partition strategy
• Think: “By year, month, day”
• Partitioning invisible to queries (e.g. `time:[x TO y] AND host:z` works normally)
• Partitions are divided into slices to support lock-free parallel writes
• Think: “This day has 20 slices, each of which is independent for write”
• Number of slices == Kafka partitions
© Rocana, Inc. All Rights Reserved. | 16
Datasets, partitions, and slices
Dataset “events”
Partition “2016/01/01”
Slice 0 Slice 1
Slice 2 Slice N
Partition “2016/01/02”
Slice 0 Slice 1
Slice 2 Slice N
© Rocana, Inc. All Rights Reserved. | 17
From events to partitions to slices Partition 2016/01/01
Slice 0
Slice 1
Topic events
KP 0
KP 1
Event 1
2016/01/01
Event 2
2016/01/01
Event 3
2016/01/02
Partition 2016/01/02
Slice 0
Slice 1
E1
E2
E3
© Rocana, Inc. All Rights Reserved. | 18
Assigning slices to nodes
Node 1
Partition 2016/01/01
S 0 S 2
Partition 2016/01/02
S 0 S 2
Partition 2016/01/03
S 0 S 2
Partition 2016/01/04
S 0 S 2
Topic eventsKP 0 KP 2 KP 1 KP 3
Node 2
Partition 2016/01/01
S 1 S 3
Partition 2016/01/02
S 1 S 3
Partition 2016/01/03
S 1 S 3
Partition 2016/01/04
S 1 S 3
© Rocana, Inc. All Rights Reserved. | 19
The write path
• One of the search nodes is the exclusive owner of KP 0 and KP 1
• Consume a batch of events
• Use the partition strategy to figure out to which RS partition it belongs
• Kafka messages carry the partition so we know the slice
• Event written to the proper partition/slice
• Eventually the indexes are committed
• If the partition or slice is new, metadata service is informed
© Rocana, Inc. All Rights Reserved. | 20
Query
• Queries submitted to coordinator via RPC
• Coordinator parses query and aggressively prunes partitions to search by
analyzing predicates
• Coordinator schedules and monitors fragments, merges results, responds
to client
• Fragments are submitted to executors for processing
• Executors search exactly what they’re told, stream to coordinator
• Fragment is generated for every slice that may contain data
© Rocana, Inc. All Rights Reserved. | 21
Some benefits of the design
• Search processes are on the same nodes as the HDFS DataNode
• First replica of any event received by search from Kafka is written locally
• Unless nodes fail, all reads are local (HDFS short circuit reads)
• Linux kernel page cache is useful here
• HDFS caching could also be used (not yet doing this)
• Search uses off-heap block cache as well
• In case of failure, any search node can read any index
• HDFS overhead winds up being very little, still get the advantages
© Rocana, Inc. All Rights Reserved. | 22
Benchmarks
© Rocana, Inc. All Rights Reserved. | 23
Disclaimer
• Yes we are a vendor
• No you shouldn't take our word
• Ask me about POCs
• Yes I'll show you results anyway 
© Rocana, Inc. All Rights Reserved. | 24
Event ingest and indexing
• Most recent data we have for Rocana Search vs. Solr
• G1GC garbage collector
• CDH 5.5.1
• AWS (d2.2xlarge) – 8 cpus, 60 GB RAM
• 4 data nodes, 8 Kafka partitions
• 32 Solr shards
• 4 day run, ~1.4 TiB indexed on disk
© Rocana, Inc. All Rights Reserved. | 25
Event ingest and indexing
Day 1
(325 byte events)
Day 2
(685 byte events)
Day 3
(2.2 KiB events)
Day 4
(685 byte events)
Solr 12,500 eps
3.9 MiB/s
9,000 eps
5.9 MiB/s
2,500 eps
5.4 MiB/s
5,500 eps
3.6 MiB/s
Rocana Search 46,800 eps
14.5 MiB/s
43,800 eps
28.6 MiB/s
15,000 eps
32 MiB/s
43,800 eps
28.6 MiB/s
RS vs. Solr 3.7x faster 4.9x faster 6.0x faster 8.0x faster
© Rocana, Inc. All Rights Reserved. | 26
Query
• 6 hour query, facet by 4 fields (host, service, location, event_type_id)
• 2 scenarios
• Query under no active ingest
• Query while ingesting events for the time period being queried
• Queries repeated ~105 times against each system
• Take advantage of OS page cache and Solr/RS block cache
© Rocana, Inc. All Rights Reserved. | 27
Query (No Ingest)
© Rocana, Inc. All Rights Reserved. | 28
Query (No Ingest)
No Ingest Percentile Solr (sec) Rocana Search
(sec)
Comparison
6 hour query, 325
byte events
50th 1.8 2.95 Solr 1.6x faster
90th 1.9 3.12 Solr 1.6x faster
95th 2.1 3.25 Solr 1.5x faster
© Rocana, Inc. All Rights Reserved. | 29
Query (Simultaneous Ingest)
© Rocana, Inc. All Rights Reserved. | 30
Query (Simultaneous Ingest)
Simultaneous
Ingest
Percentile Solr (sec) Rocana Search
(sec)
Comparison
6 hour query, 325
byte events
50th 5.7 10.0 Solr 1.75x faster
75th 9.8 11.5 Solr 1.17x faster
90th 24.2 12.2 RS 2x faster
95th 34.3 13.0 RS 2.6x faster
© Rocana, Inc. All Rights Reserved. | 31
What we’ve really shown
In the context of search, scale means:
• High cardinality: Billions of unique events per day
• High speed ingest: Hundreds of thousands of events per second
• Not having to age data out of the dataset
• Handling large, concurrent queries, while ingesting data
• Fully utilizing modern hardware
These things are very possible
© Rocana, Inc. All Rights Reserved. | 32
Thank you!
Questions?
joey@rocana.com
@fwiffo
The Rocana Search Team:
• Michael Peterson - @quux00
• Mark Tozzi - @not_napoleon
• Brad Cupit - @bradcupit
• Brett Hoerner - @bretthoerner
• Joey Echeverria - @fwiffo
• Eric Sammer - @esammer
• Marvin Anderson

More Related Content

Viewers also liked

Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...
Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...
Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...
Hortonworks
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Lucidworks
 

Viewers also liked (17)

NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
 
Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...
Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...
Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...
 
Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & ...
Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & ...Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & ...
Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & ...
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
 
How to Run Solr on Docker and Why
How to Run Solr on Docker and WhyHow to Run Solr on Docker and Why
How to Run Solr on Docker and Why
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
 
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseApache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zur...
Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zur...Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zur...
Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zur...
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
 

More from Joey Echeverria

Debugging Apache Spark
Debugging Apache SparkDebugging Apache Spark
Debugging Apache Spark
Joey Echeverria
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and Cloudera
Joey Echeverria
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
Joey Echeverria
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
Joey Echeverria
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itch
Joey Echeverria
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
Joey Echeverria
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
Joey Echeverria
 

More from Joey Echeverria (12)

Debugging Apache Spark
Debugging Apache SparkDebugging Apache Spark
Debugging Apache Spark
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop Security
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and Cloudera
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
 
Big data security
Big data securityBig data security
Big data security
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itch
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Real-time Search on Terabytes of Data Per Day: Lessons Learned

  • 1. © Rocana, Inc. All Rights Reserved. | 1 Joey Echeverria, Platform Technical Lead - @fwiffo Data Day Seattle 2016 Real-time Search on Terabytes of Data Per Day Lessons Learned
  • 2. © Rocana, Inc. All Rights Reserved. | 2 Joey • Where I work: Rocana – Platform Technical Lead • Where I used to work: Cloudera (’11-’15), NSA • Distributed systems, security, data processing, big data
  • 3. © Rocana, Inc. All Rights Reserved. | 3
  • 4. © Rocana, Inc. All Rights Reserved. | 4 Context • We built a system for large scale realtime collection, processing, and analysis of event-oriented machine data • On prem or in the cloud, but not SaaS • Supportability is a big deal for us • Predictability of performance under load and failures • Ease of configuration and operation • Behavior in wacky environments • All of our decisions are informed by this - YMMV
  • 5. © Rocana, Inc. All Rights Reserved. | 5 What I mean by “scale” • Typical: 10s of TB of new data per day • Average event size ~200-500 bytes • 20TB per day • @200 bytes = 1.2M events / second, ~109.9B events / day, 40.1T events / year • @500 bytes = 509K events / second, ~43.9B events / day, 16T events / year, • Retaining years of data online for query
  • 6. © Rocana, Inc. All Rights Reserved. | 6 General purpose search – the good parts • We originally built against Solr Cloud (but most of this goes for Elastic Search too) • Amazing feature set for general purpose search • Good support for moderate scale • Excellent at • Content search – news sites, document repositories • Finite size datasets – product catalogs, job postings, things you prune • Low(er) cardinality datasets that (mostly) fit in memory
  • 7. © Rocana, Inc. All Rights Reserved. | 7 Problems with general purpose search systems • Fixed shard allocation models – always N partitions • Multi-level and semantic partitioning is painful without building your own macro query planner • All shards open all the time; poor resource control for high retention • APIs are record-at-a-time focused for NRT indexing; poor ingest performance (aka: please stop making everything REST!) • Ingest concurrency is wonky • High write amplification on data we know won’t change • Other smaller stuff…
  • 8. © Rocana, Inc. All Rights Reserved. | 8 “Well actually…” Plenty of ways to push general purpose systems (We tried many of them) • Using multiple collections as partitions, macro query planning • Running multiple JVMs per node for better utilization • Pushing historical searches into another system • Building weirdo caches of things At some point the cost of hacking outweighed the cost of building
  • 9. © Rocana, Inc. All Rights Reserved. | 9 Warning! • This is not a condemnation of general purpose search systems! • Unless the sky is falling, use one of those systems
  • 10. © Rocana, Inc. All Rights Reserved. | 10 We built a thing: Rocana Search High cardinality, low latency, parallel search system for time-oriented events
  • 11. © Rocana, Inc. All Rights Reserved. | 11 Key Goals for Rocana Search • Higher indexing throughput per node than Solr for time-oriented event data • Scale horizontally better than Solr • Support an arbitrary number of dynamically created partitions • Arbitrarily large amounts of indexed data on disk • all data queryable without wasting resources for infrequently used data • Ability to add/remove Search nodes dynamically without any manual restarts or rebalances
  • 12. © Rocana, Inc. All Rights Reserved. | 12 Some Key Features of Rocana Search • Fully parallelized ingest and query, built for large clusters • Every node is an indexer Hadoop Node Rocana Search Hadoop Node Rocana SearchHadoop Node Rocana Search Hadoop Node Rocana Search Kafka
  • 13. © Rocana, Inc. All Rights Reserved. | 13 Some Key Features of Rocana Search • Every node is a query coordinator and executor Query Client Rocana Search Coord Exec Rocana Search Coord Exec Rocana Search Coord Exec Rocana Search Coord Exec
  • 14. © Rocana, Inc. All Rights Reserved. | 14 Architecture (A single node) RS HDFS MetadataIndex Management Coordinator ExecutorLucene Indexes Query Client Kafka Data Producers ZK
  • 15. © Rocana, Inc. All Rights Reserved. | 15 Sharding Model: datasets, partitions, and slices • A search dataset is split into partitions by a partition strategy • Think: “By year, month, day” • Partitioning invisible to queries (e.g. `time:[x TO y] AND host:z` works normally) • Partitions are divided into slices to support lock-free parallel writes • Think: “This day has 20 slices, each of which is independent for write” • Number of slices == Kafka partitions
  • 16. © Rocana, Inc. All Rights Reserved. | 16 Datasets, partitions, and slices Dataset “events” Partition “2016/01/01” Slice 0 Slice 1 Slice 2 Slice N Partition “2016/01/02” Slice 0 Slice 1 Slice 2 Slice N
  • 17. © Rocana, Inc. All Rights Reserved. | 17 From events to partitions to slices Partition 2016/01/01 Slice 0 Slice 1 Topic events KP 0 KP 1 Event 1 2016/01/01 Event 2 2016/01/01 Event 3 2016/01/02 Partition 2016/01/02 Slice 0 Slice 1 E1 E2 E3
  • 18. © Rocana, Inc. All Rights Reserved. | 18 Assigning slices to nodes Node 1 Partition 2016/01/01 S 0 S 2 Partition 2016/01/02 S 0 S 2 Partition 2016/01/03 S 0 S 2 Partition 2016/01/04 S 0 S 2 Topic eventsKP 0 KP 2 KP 1 KP 3 Node 2 Partition 2016/01/01 S 1 S 3 Partition 2016/01/02 S 1 S 3 Partition 2016/01/03 S 1 S 3 Partition 2016/01/04 S 1 S 3
  • 19. © Rocana, Inc. All Rights Reserved. | 19 The write path • One of the search nodes is the exclusive owner of KP 0 and KP 1 • Consume a batch of events • Use the partition strategy to figure out to which RS partition it belongs • Kafka messages carry the partition so we know the slice • Event written to the proper partition/slice • Eventually the indexes are committed • If the partition or slice is new, metadata service is informed
  • 20. © Rocana, Inc. All Rights Reserved. | 20 Query • Queries submitted to coordinator via RPC • Coordinator parses query and aggressively prunes partitions to search by analyzing predicates • Coordinator schedules and monitors fragments, merges results, responds to client • Fragments are submitted to executors for processing • Executors search exactly what they’re told, stream to coordinator • Fragment is generated for every slice that may contain data
  • 21. © Rocana, Inc. All Rights Reserved. | 21 Some benefits of the design • Search processes are on the same nodes as the HDFS DataNode • First replica of any event received by search from Kafka is written locally • Unless nodes fail, all reads are local (HDFS short circuit reads) • Linux kernel page cache is useful here • HDFS caching could also be used (not yet doing this) • Search uses off-heap block cache as well • In case of failure, any search node can read any index • HDFS overhead winds up being very little, still get the advantages
  • 22. © Rocana, Inc. All Rights Reserved. | 22 Benchmarks
  • 23. © Rocana, Inc. All Rights Reserved. | 23 Disclaimer • Yes we are a vendor • No you shouldn't take our word • Ask me about POCs • Yes I'll show you results anyway 
  • 24. © Rocana, Inc. All Rights Reserved. | 24 Event ingest and indexing • Most recent data we have for Rocana Search vs. Solr • G1GC garbage collector • CDH 5.5.1 • AWS (d2.2xlarge) – 8 cpus, 60 GB RAM • 4 data nodes, 8 Kafka partitions • 32 Solr shards • 4 day run, ~1.4 TiB indexed on disk
  • 25. © Rocana, Inc. All Rights Reserved. | 25 Event ingest and indexing Day 1 (325 byte events) Day 2 (685 byte events) Day 3 (2.2 KiB events) Day 4 (685 byte events) Solr 12,500 eps 3.9 MiB/s 9,000 eps 5.9 MiB/s 2,500 eps 5.4 MiB/s 5,500 eps 3.6 MiB/s Rocana Search 46,800 eps 14.5 MiB/s 43,800 eps 28.6 MiB/s 15,000 eps 32 MiB/s 43,800 eps 28.6 MiB/s RS vs. Solr 3.7x faster 4.9x faster 6.0x faster 8.0x faster
  • 26. © Rocana, Inc. All Rights Reserved. | 26 Query • 6 hour query, facet by 4 fields (host, service, location, event_type_id) • 2 scenarios • Query under no active ingest • Query while ingesting events for the time period being queried • Queries repeated ~105 times against each system • Take advantage of OS page cache and Solr/RS block cache
  • 27. © Rocana, Inc. All Rights Reserved. | 27 Query (No Ingest)
  • 28. © Rocana, Inc. All Rights Reserved. | 28 Query (No Ingest) No Ingest Percentile Solr (sec) Rocana Search (sec) Comparison 6 hour query, 325 byte events 50th 1.8 2.95 Solr 1.6x faster 90th 1.9 3.12 Solr 1.6x faster 95th 2.1 3.25 Solr 1.5x faster
  • 29. © Rocana, Inc. All Rights Reserved. | 29 Query (Simultaneous Ingest)
  • 30. © Rocana, Inc. All Rights Reserved. | 30 Query (Simultaneous Ingest) Simultaneous Ingest Percentile Solr (sec) Rocana Search (sec) Comparison 6 hour query, 325 byte events 50th 5.7 10.0 Solr 1.75x faster 75th 9.8 11.5 Solr 1.17x faster 90th 24.2 12.2 RS 2x faster 95th 34.3 13.0 RS 2.6x faster
  • 31. © Rocana, Inc. All Rights Reserved. | 31 What we’ve really shown In the context of search, scale means: • High cardinality: Billions of unique events per day • High speed ingest: Hundreds of thousands of events per second • Not having to age data out of the dataset • Handling large, concurrent queries, while ingesting data • Fully utilizing modern hardware These things are very possible
  • 32. © Rocana, Inc. All Rights Reserved. | 32 Thank you! Questions? joey@rocana.com @fwiffo The Rocana Search Team: • Michael Peterson - @quux00 • Mark Tozzi - @not_napoleon • Brad Cupit - @bradcupit • Brett Hoerner - @bretthoerner • Joey Echeverria - @fwiffo • Eric Sammer - @esammer • Marvin Anderson

Editor's Notes

  1. YMMV Not necessarily true for you Enterprise software – shipping stuff to people Fine grained events – logs, user behavior, etc. For everything – solving the problem of “enterprise wide” ops, so it’s everything from everywhere from everyone for all time (until they run out of money for nodes). This isn’t condemnation of general purpose search engines as much as what we had to do for our domain
  2. YMMV Not necessarily true for you Enterprise software – shipping stuff to people Fine grained events – logs, user behavior, etc. For everything – solving the problem of “enterprise wide” ops, so it’s everything from everywhere from everyone for all time (until they run out of money for nodes). This isn’t condemnation of general purpose search engines as much as what we had to do for our domain
  3. It does most of what you want for most cases most of the time. They’ve solved some really hard problems. Content search (e.g. news sites, document repos), finite size datasets (e.g. product catalogs), low cardinality datasets that fit in memory. Not us.
  4. Flexible systems with a bevy of full text search features Moderate and fixed document count: big by historical standards, small by ours. Design reflects these assumptions. Fixed sharding at index creation. Partition events into N buckets. For long retention time-based systems, this isn’t how we think. Let’s keep it until it’s painful. Then we add boxes. When that’s painful, we prune. Not sure what that looks like. Repartitioning is not feasible at scale. Partitions count should be dynamic. Multi-level partitioning is painful without building your own query layer; by range(time), then hash(region) or identity(region). All shards are open all the time. Implicit assumption that either you 1. have queries that touch the data evenly or 2. have inifinite resources. Recent time events are hotter than distant, but distant still needs to be available for query. Poor cache control. Recent data should be in cache. Historical scans shouldn’t push recent data out of cache. APIs are extremely “single record” focused. REST with record-at-a-time is absolutely abysmal for high throughput systems. Batch indexing is not useful. No in between. Read replicas are expensive and homogenous. Ideally we have 3 read replicas for the last N days and 1 for others. Replicas (for performance) should take up space in memory, but not on disk. Ingest concurrency tends to be wonky; whole lotta locking going on. Anecdotally, it’s difficult to get Solr Cloud to light up all cores on a box without running multiple JVMs; something is weird. We can get the benefits of NRT indexing speed with fewer writer checkpoints because our ingest pipeline acts as a reliable log. We recover from Kafka based on the last time the writer checkpointed so we can checkpoint very infrequently if we want. We know our data doesn’t change, or changes very little, after a certain point, so we can optimize and freeze indexes reducing write amplification from compactions.
  5. There are plenty of ways we could of pushed the general purpose systems, and we did. We layered our own partitioning and shard selection on top of Solr Cloud with time-based collection round robining. That got us pretty far, but not far enough. We were starting to do a lot of query rewriting and scheduling. Run mulitple JVMs per box. Gross. Unsupportable. Push historical queries out of search to a system such as spark. Build weird caches of frequent data sets. At some point, the cost of hacking outweighed the cost of building.
  6. Flexible systems with a bevy of full text search features Moderate and fixed document count: big by historical standards, small by ours. Design reflects these assumptions. Fixed sharding at index creation. Partition events into N buckets. For long retention time-based systems, this isn’t how we think. Let’s keep it until it’s painful. Then we add boxes. When that’s painful, we prune. Not sure what that looks like. Repartitioning is not feasible at scale. Partitions count should be dynamic. Multi-level partitioning is painful without building your own query layer; by range(time), then hash(region) or identity(region). All shards are open all the time. Implicit assumption that either you 1. have queries that touch the data evenly or 2. have inifinite resources. Recent time events are hotter than distant, but distant still needs to be available for query. Poor cache control. Recent data should be in cache. Historical scans shouldn’t push recent data out of cache. APIs are extremely “single record” focused. REST with record-at-a-time is absolutely abysmal for high throughput systems. Batch indexing is not useful. No in between. Read replicas are expensive and homogenous. Ideally we have 3 read replicas for the last N days and 1 for others. Replicas (for performance) should take up space in memory, but not on disk. Ingest concurrency tends to be wonky; whole lotta locking going on. Anecdotally, it’s difficult to get Solr Cloud to light up all cores on a box without running multiple JVMs; something is weird. We can get the benefits of NRT indexing speed with fewer writer checkpoints because our ingest pipeline acts as a reliable log. We recover from Kafka based on the last time the writer checkpointed so we can checkpoint very infrequently if we want. We know our data doesn’t change, or changes very little, after a certain point, so we can optimize and freeze indexes reducing write amplification from compactions.