Druid was implemented at Conviva to improve their streaming analytics capabilities. Previously they used Hadoop batch jobs and Spark streaming, but saw challenges with reliability and performance like query timeouts. With Druid they saw improvements through optimizations like data locality tuning, additional query tiers, and Kubernetes improvements on Google Cloud. This helped add a "9" to their reliability. However, challenges remained around cost, high cardinality queries, and rapid disaster recovery which they continue working to address.
3. Highlights
● Streaming Analytics at Conviva
● Before Druid
● Druid Usage and Challenges (Query Timeouts, Reliable Data Ingestion...)
● Solutions
● Outcomes (add a 9 to our reliability...)
12. Before (2019 and earlier)
Data Pipeline
● Hadoop MR Batch Jobs (5m) with rollup
(Hourly and Daily)
● Spark Streaming (mini batches)
● Serving from HBase using Phoenix (SQL)
● SQL Query Gateway
Data Center Locations
● On-premise
● Cloud (AWS): Hot Backup
14. Druid since 2019
Data Pipeline
● Started at Druid 3.x
● Native Druid Query Gateway
● Hadoop and Spark for batch ingestion
● Spark Streaming (Real Time, mini batches)
● Akka (Scala) Streaming and Spark Streaming
● Elastic and Imply Clarity for query/log analysis
Data Center Locations
● On Premise
● Amazon AWS
● Google GCP
18. Analytics over Query Logs
Query Start/End Timestamps
Run Times Entity Distribution
Time-out Distribution
Detailed Study of Query Access Patterns, Timestamps, Time-outs, etc.
19. Challenges/Solutions (Reliability)
Reliability Issues
● Query Timeouts
● Reliable Ingestion & Query Speed Balance
● Query Performance (High Avg Time)
● Query Runtime Fluctuations
● Random Ingestion Task Failures
Solutions
● Druid Configuration & Tuning
● Data Locality, Dynamic Partitions, Multi Tenancy
● Tuning Brokers; Extra Tier 3 for recent Data and Queries
● Query/Context Updates
● Resource and Configuration Adjustment
20. Challenge/Solutions
Solutions
● Resource Optimizations; On-prem+GCP
● Created in-house Dimensional index
● Using Native Query instead of Druid SQL
● K8s+Helm Chart Improvements on GCP
● Supervisor Optimizations and disabling Historicals
Challenges
● Cost (specially on Cloud)
● Querying High Cardinality Measures
● SQL Metadata Performance due to Wide Rows
● Rapid Disaster Recovery
● Real Time Cluster