SlideShare uma empresa Scribd logo
1 de 55
1 © Hortonworks Inc. 2011–2018. All rights reserved
An Introduction to Druid
Nishant Bangarwa
Software Developer
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
2
Agenda
History and Motivation
Introduction
Data Storage Format
Druid Architecture – Indexing and Querying Data
Druid In Production
Recent Improvements
3 © Hortonworks Inc. 2011–2018. All rights reserved
HISTORY
• Druid Open sourced in late 2012
• Initial Use case
• Power ad-tech analytics product
• Requirements
• Query any combination of metrics and
dimensions
• Scalability : trillions of events/day
• Real-time : data freshness
• Streaming Ingestion
• Interactive : low latency queries
4 © Hortonworks Inc. 2011–2018. All rights reserved
4
How Big is the initial use case ?
5 © Hortonworks Inc. 2011–2018. All rights reserved
5
MOTIVATION
• Business Intelligence Queries
• Arbitrary slicing and dicing of data
• Interactive real time visualizations on
Complex data streams
• Answer BI questions
• How many unique male visitors visited my
website last month ?
• How many products were sold last quarter
broken down by a demographic and product
category ?
• Not interested in dumping entire dataset
6 © Hortonworks Inc. 2011–2018. All rights reserved
Introduction
7 © Hortonworks Inc. 2011–2018. All rights reserved
7
What is Druid ?
• Column-oriented distributed datastore
• Sub-Second query times
• Realtime streaming ingestion
• Arbitrary slicing and dicing of data
• Automatic Data Summarization
• Approximate algorithms (hyperLogLog, theta)
• Scalable to petabytes of data
• Highly available
8 © Hortonworks Inc. 2011–2018. All rights reserved
8
Companies Using Druid
9 © Hortonworks Inc. 2011–2018. All rights reserved
Druid Architecture
10 © Hortonworks Inc. 2011–2018. All rights reserved
1
Node Types
• Realtime Nodes
• Historical Nodes
• Broker Nodes
• Coordinator Nodes
11 © Hortonworks Inc. 2011–2018. All rights reserved
Realtime
Nodes
Historical
Nodes
1
Druid Architecture
Batch Data
Event
Historical
Nodes
Broker
Nodes
Realtime
Index Tasks
Streaming
Data
Historical
Nodes
Handoff
12 © Hortonworks Inc. 2011–2018. All rights reserved
1
Druid Architecture
Batch Data
Queries
Metadata
Store
Coordinator
Nodes
Zookeepe
r
Historical
Nodes
Broker
Nodes
Realtime
Index Tasks
Streaming
Data
Handoff
13 © Hortonworks Inc. 2011–2018. All rights reserved
Storage Format
14 © Hortonworks Inc. 2011–2018. All rights reserved
Druid: Segments
• Data in Druid is stored in Segment Files.
• Partitioned by time
• Ideally, segment files are each smaller than 1GB.
• If files are large, smaller time partitions are needed.
Time
Segment 1:
Monday
Segment 2:
Tuesday
Segment 3:
Wednesday
Segment 4:
Thursday
Segment 5_2:
Friday
Segment 5_1:
Friday
15 © Hortonworks Inc. 2011–2018. All rights reserved
1
Example Wikipedia Edit Dataset
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
Timestamp Dimensions Metrics
16 © Hortonworks Inc. 2011–2018. All rights reserved
1
Data Rollup
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
timestamp page language city country count sum_added sum_deleted min_added max_added ….
2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 32
2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 43
2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 12
Rollup by hour
17 © Hortonworks Inc. 2011–2018. All rights reserved
1
Dictionary Encoding
• Create and store Ids for each value
• e.g. page column
⬢ Values - Justin Bieber, Ke$ha, Selena Gomes
⬢ Encoding - Justin Bieber : 0, Ke$ha: 1, Selena Gomes: 2
⬢ Column Data - [0 0 0 1 1 2]
• city column - [0 0 0 1 1 1]
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
18 © Hortonworks Inc. 2011–2018. All rights reserved
1
Bitmap Indices
• Store Bitmap Indices for each value
⬢ Justin Bieber -> [0, 1, 2] -> [1 1 1 0 0 0]
⬢ Ke$ha -> [3, 4] -> [0 0 0 1 1 0]
⬢ Selena Gomes -> [5] -> [0 0 0 0 0 1]
• Queries
⬢ Justin Bieber or Ke$ha -> [1 1 1 0 0 0] OR [0 0 0 1 1 0] -> [1 1 1 1 1 0]
⬢ language = en and country = CA -> [1 1 1 1 1 1] AND [0 0 0 1 1 1] -> [0 0 0 1 1 1]
• Indexes compressed with Concise or Roaring encoding
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:01:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:01:35Z Ke$ha en Calgary CA 43 99
2011-01-01T00:01:35Z Selena Gomes en Calgary CA 12 53
19 © Hortonworks Inc. 2011–2018. All rights reserved
1
Approximate Sketch Columns
timestamp page userid language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber user1111111 en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber user1111111 en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber user2222222 en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha user3333333 en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha user4444444 en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes user1111111 en Calgary CA 12 53
timestamp page language city country count sum_added sum_delete
d
min_added Userid_sket
ch
….
2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 {sketch}
2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 {sketch}
2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 {sketch}
Rollup by hour
20 © Hortonworks Inc. 2011–2018. All rights reserved
Approximate Sketch Columns
• Better rollup for high cardinality columns e.g userid
• Reduced storage size
• Use Cases
• Fast approximate distinct counts
• Approximate histograms
• Funnel/retention analysis
• Limitation
• Not possible to do exact counts
• filter on individual row values
21 © Hortonworks Inc. 2011–2018. All rights reserved
Indexing Data
22 © Hortonworks Inc. 2011–2018. All rights reserved
Indexing Service
• Indexing is performed by
• Overlord
• Middle Managers
• Peons
• Middle Managers spawn peons which runs ingestion
tasks
• Each peon runs 1 task
• Task definition defines which task to run and its
properties
23 © Hortonworks Inc. 2011–2018. All rights reserved
2
Streaming Ingestion : Realtime Index Tasks
• Ability to ingest streams of data
• Stores data in write-optimized structure
• Periodically converts write-optimized structure
to read-optimized segments
• Event query-able as soon as it is ingested
• Both push and pull based ingestion
24 © Hortonworks Inc. 2011–2018. All rights reserved
Streaming Ingestion : Tranquility
• Helper library for coordinating
streaming ingestion
• Simple API to send events to
druid
• Transparently Manages
• Realtime index Task Creation
• Partitioning and Replication
• Schema Evolution
• Can be used with your favourite
ETL framework e.g Flink, Nifi,
Samza, Spark, Storm
• At-least once ingestion
25 © Hortonworks Inc. 2011–2018. All rights reserved
Kafka Indexing Service (experimental)
• Supports Exactly once ingestion
• Messages pulled by Kafka Index Tasks
• Each Kafka Index Task consumes from a set
of partitions with specific start and end
offset
• Each message verified to ensure sequence
• Kafka Offsets and corresponding segments
persisted in same metadata transaction
atomically
• Kafka Supervisor
• embedded inside overlord
• Manages kafka index tasks
• Retry failed tasks
Task 1
Task 2
Task 3
26 © Hortonworks Inc. 2011–2018. All rights reserved
Batch Ingestion
• HadoopIndexTask
• Peon launches Hadoop MR job
• Mappers read data
• Reducers create Druid segment files
• Index Task
• Runs in single JVM i.e peon
• Suitable for data sizes(<1G)
• Integrations with Apache HIVE and Spark for Batch Ingestion
27 © Hortonworks Inc. 2011–2018. All rights reserved
Querying Data
28 © Hortonworks Inc. 2011–2018. All rights reserved
Querying Data from Druid
• Druid supports
• JSON Queries over HTTP
• In built SQL (experimental)
• Querying libraries available for
• Python
• R
• Ruby
• Javascript
• Clojure
• PHP
• Multiple Open source UI tools
29 © Hortonworks Inc. 2011–2018. All rights reserved
2
JSON Over HTTP
• HTTP Rest API
• Queries and results expressed in JSON
• Multiple Query Types
• Time Boundary
• Timeseries
• TopN
• GroupBy
• Select
• Segment Metadata
30 © Hortonworks Inc. 2011–2018. All rights reserved
In built SQL (experimental)
• Apache Calcite based parser and planner
• Ability to connect druid to any BI tool that supports JDBC
• SQL via JSON over HTTP
• Supports Approximate queries
• APPROX_COUNT_DISTINCT(col)
• Ability to do Fast Approx TopN queries
• APPROX_QUANTILE(column, probability)
31 © Hortonworks Inc. 2011–2018. All rights reserved
Integrated with multiple Open Source UI tools
• Superset –
• Developed at AirBnb
• In Apache Incubation since May 2017
• Grafana – Druid plugin
• Metabase
• With in-built SQL, connect with any BI tool supporting JDBC
32 © Hortonworks Inc. 2011–2018. All rights reserved
Druid in Production
33 © Hortonworks Inc. 2011–2018. All rights reserved
Druid in Production
 Is Druid suitable for my Use case ?
 Will Druid meet my performance requirements at scale ?
 How complex is it to Operate and Manage Druid cluster ?
 How to monitor a Druid cluster ?
 High Availability ?
 How to upgrade Druid cluster without downtime ?
 Security ?
34 © Hortonworks Inc. 2011–2018. All rights reserved
Suitable Use Cases
• Powering Interactive user facing applications
• Arbitrary slicing and dicing of large datasets
• User behavior analysis
• measuring distinct counts
• retention analysis
• funnel analysis
• A/B testing
• Exploratory analytics/root cause analysis
• Not interested in dumping entire dataset
35 © Hortonworks Inc. 2011–2018. All rights reserved
Performance and Scalability : Fast Facts
Most Events per Day
300 Billion Events / Day
(Metamarkets)
Most Computed Metrics
1 Billion Metrics / Min
(Jolata)
Largest Cluster
200 Nodes
(Metamarkets)
Largest Hourly Ingestion
2TB per Hour
(Netflix)
36 © Hortonworks Inc. 2011–2018. All rights reserved
3
Performance Numbers
• Query Latency
• average - 500ms
• 90%ile < 1sec
• 95%ile < 5sec
• 99%ile < 10 sec
• Query Volume
• 1000s queries per minute
• Benchmarking code
• https://github.com/druid-
io/druid-benchmark
37 © Hortonworks Inc. 2011–2018. All rights reserved
Simplified Druid Cluster Management with Ambari
 Install, configure and manage Druid and all external dependencies from Ambari
 Easy to enable HA, Security, Monitoring …
38 © Hortonworks Inc. 2011–2018. All rights reserved
Simplified Druid Cluster Management with Ambari
39 © Hortonworks Inc. 2011–2018. All rights reserved
Monitoring a Druid Cluster
• Each Druid Node emits metrics for
• Query performance
• Ingestion Rate
• JVM Health
• Query Cache performance
• System health
• Emitted as JSON objects to a runtime log file or over HTTP to other services
• Emitters available for Ambari Metrics Server, Graphite, StatsD, Kafka
• Easy to implement your own metrics emitter
40 © Hortonworks Inc. 2011–2018. All rights reserved
Monitoring using Ambari Metrics Server
• HDP 2.6.1 contains pre-defined grafana dashboards
• Health of Druid Nodes
• Ingestion
• Query performance
• Easy to create new dashboards and setup alerts
• Auto configured when both Druid and Ambari Metrics Server are installed
41 © Hortonworks Inc. 2011–2018. All rights reserved
Monitoring using Ambari Metrics Server
42 © Hortonworks Inc. 2011–2018. All rights reserved
Monitoring using Ambari Metrics Server
43 © Hortonworks Inc. 2011–2018. All rights reserved
High Availability
• Deploy Coordinator/Overlord on multiple instances
• Leader election in zookeeper
• Broker – install multiple brokers
• Use druid Router/ Any Load balancer to route queries to brokers
• Realtime Index Tasks – create redundant tasks.
• Historical Nodes – create load rule with replication factor >= 2 (default = 2)
44 © Hortonworks Inc. 2011–2018. All rights reserved
4
Rolling Upgrades
 Shared Nothing Architecture
⬢ Maintain backwards compatibility
⬢ Data redundancy
 Upgrade one Druid Component at a time
⬢ No Downtime
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
3
45 © Hortonworks Inc. 2011–2018. All rights reserved
Security
• Supports Authentication via Kerberos/ SPNEGO
• Easy Wizard based kerberos security enablement via Ambari
Druid
KDC server
User
Browser1 kinit user
2 Token
46 © Hortonworks Inc. 2011–2018. All rights reserved
4
Summary
• Easy installation and management via Ambari
• Real-time
• Ingestion latency < seconds.
• Query latency < seconds.
• Arbitrary slice and dice big data like ninja
• No more pre-canned drill downs.
• Query with more fine-grained granularity.
• High availability and Rolling deployment capabilities
• Secure and Production ready
• Vibrant and Active community
• Available as Tech Preview in HDP 2.6.1
47 © Hortonworks Inc. 2011–2018. All rights reserved
4
• Druid website – http://druid.io
• Druid User Group - dev@druid.incubator.apache.org
• Druid Dev Group – users@druid.incubator.apache.org
Useful Resources
48 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you
Twitter - @NishantBangarwa
Email - nbangarwa@hortonworks.com
49 © Hortonworks Inc. 2011–2018. All rights reserved
Questions?
50 © Hortonworks Inc. 2011–2018. All rights reserved
Extending Core Druid
• Plugin Based Architecture
• leverage Guice in order to load extensions at runtime
• Possible to add extension to
• Add a new deep storage implementation
• Add a new Firehose for Ingestion
• Add Aggregators
• Add Complex metrics
• Add new Query types
• Add new Jersey resources
• Bundle your extension with all the other Druid extensions
51 © Hortonworks Inc. 2011–2018. All rights reserved
Performance : Approximate Algorithms
• Ability to Store Approximate Data Sketches for high cardinality columns e.g userid
• Reduced storage size
• Use Cases
• Fast approximate distinct counts
• Approximate Top-K queries
• Approximate histograms
• Funnel/retention analysis
• Limitation
• Not possible to do exact counts
• filter on individual row values
52 © Hortonworks Inc. 2011–2018. All rights reserved
Superset
• Python backend
• Flask app builder
• Authentication
• Pandas for rich analytics
• SqlAlchemy for SQL toolkit
• Javascript frontend
• React, NVD3
• Deep integration with Druid
53 © Hortonworks Inc. 2011–2018. All rights reserved
Superset Rich Dashboarding Capabilities: Treemaps
54 © Hortonworks Inc. 2011–2018. All rights reserved
Superset Rich Dashboarding Capabilities: Sunburst
55 © Hortonworks Inc. 2011–2018. All rights reserved
Superset UI Provides Powerful Visualizations
Rich library of dashboard visualizations:
Basic:
• Bar Charts
• Pie Charts
• Line Charts
Advanced:
• Sankey Diagrams
• Treemaps
• Sunburst
• Heatmaps
And More!

Mais conteúdo relacionado

Mais procurados

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
 
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidImply
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLSeveralnines
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseDatabricks
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
 

Mais procurados (20)

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 

Semelhante a An Introduction to Druid

Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetHortonworks
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDataWorks Summit
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDataWorks Summit
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive DataWorks Summit
 
Enterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiEnterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiTimothy Spann
 
Time-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidTime-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidRaúl Marín
 
Analyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopAnalyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopDataWorks Summit
 
Multi-Lingual Accumulo Communications
Multi-Lingual Accumulo CommunicationsMulti-Lingual Accumulo Communications
Multi-Lingual Accumulo CommunicationsAccumulo Summit
 
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo DataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionApache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionDataWorks Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerDataWorks Summit
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019alanfgates
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?DataWorks Summit
 
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storyApache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storySunil Govindan
 
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, JapanApache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, JapanAnkit Singhal
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsDataWorks Summit
 

Semelhante a An Introduction to Druid (20)

Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
Enterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiEnterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFi
 
Time-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidTime-series data analysis and persistence with Druid
Time-series data analysis and persistence with Druid
 
Analyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopAnalyzing Hadoop Using Hadoop
Analyzing Hadoop Using Hadoop
 
Multi-Lingual Accumulo Communications
Multi-Lingual Accumulo CommunicationsMulti-Lingual Accumulo Communications
Multi-Lingual Accumulo Communications
 
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionApache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the Union
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
 
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storyApache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration story
 
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, JapanApache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objects
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Último (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

An Introduction to Druid

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved An Introduction to Druid Nishant Bangarwa Software Developer
  • 2. © Hortonworks Inc. 2011 – 2016. All Rights Reserved 2 Agenda History and Motivation Introduction Data Storage Format Druid Architecture – Indexing and Querying Data Druid In Production Recent Improvements
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved HISTORY • Druid Open sourced in late 2012 • Initial Use case • Power ad-tech analytics product • Requirements • Query any combination of metrics and dimensions • Scalability : trillions of events/day • Real-time : data freshness • Streaming Ingestion • Interactive : low latency queries
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved 4 How Big is the initial use case ?
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved 5 MOTIVATION • Business Intelligence Queries • Arbitrary slicing and dicing of data • Interactive real time visualizations on Complex data streams • Answer BI questions • How many unique male visitors visited my website last month ? • How many products were sold last quarter broken down by a demographic and product category ? • Not interested in dumping entire dataset
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Introduction
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved 7 What is Druid ? • Column-oriented distributed datastore • Sub-Second query times • Realtime streaming ingestion • Arbitrary slicing and dicing of data • Automatic Data Summarization • Approximate algorithms (hyperLogLog, theta) • Scalable to petabytes of data • Highly available
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved 8 Companies Using Druid
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved Druid Architecture
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved 1 Node Types • Realtime Nodes • Historical Nodes • Broker Nodes • Coordinator Nodes
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Realtime Nodes Historical Nodes 1 Druid Architecture Batch Data Event Historical Nodes Broker Nodes Realtime Index Tasks Streaming Data Historical Nodes Handoff
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved 1 Druid Architecture Batch Data Queries Metadata Store Coordinator Nodes Zookeepe r Historical Nodes Broker Nodes Realtime Index Tasks Streaming Data Handoff
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Storage Format
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved Druid: Segments • Data in Druid is stored in Segment Files. • Partitioned by time • Ideally, segment files are each smaller than 1GB. • If files are large, smaller time partitions are needed. Time Segment 1: Monday Segment 2: Tuesday Segment 3: Wednesday Segment 4: Thursday Segment 5_2: Friday Segment 5_1: Friday
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved 1 Example Wikipedia Edit Dataset timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53 Timestamp Dimensions Metrics
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved 1 Data Rollup timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53 timestamp page language city country count sum_added sum_deleted min_added max_added …. 2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 32 2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 43 2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 12 Rollup by hour
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved 1 Dictionary Encoding • Create and store Ids for each value • e.g. page column ⬢ Values - Justin Bieber, Ke$ha, Selena Gomes ⬢ Encoding - Justin Bieber : 0, Ke$ha: 1, Selena Gomes: 2 ⬢ Column Data - [0 0 0 1 1 2] • city column - [0 0 0 1 1 1] timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved 1 Bitmap Indices • Store Bitmap Indices for each value ⬢ Justin Bieber -> [0, 1, 2] -> [1 1 1 0 0 0] ⬢ Ke$ha -> [3, 4] -> [0 0 0 1 1 0] ⬢ Selena Gomes -> [5] -> [0 0 0 0 0 1] • Queries ⬢ Justin Bieber or Ke$ha -> [1 1 1 0 0 0] OR [0 0 0 1 1 0] -> [1 1 1 1 1 0] ⬢ language = en and country = CA -> [1 1 1 1 1 1] AND [0 0 0 1 1 1] -> [0 0 0 1 1 1] • Indexes compressed with Concise or Roaring encoding timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:01:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:01:35Z Ke$ha en Calgary CA 43 99 2011-01-01T00:01:35Z Selena Gomes en Calgary CA 12 53
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved 1 Approximate Sketch Columns timestamp page userid language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber user1111111 en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber user1111111 en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber user2222222 en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha user3333333 en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha user4444444 en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes user1111111 en Calgary CA 12 53 timestamp page language city country count sum_added sum_delete d min_added Userid_sket ch …. 2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 {sketch} 2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 {sketch} 2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 {sketch} Rollup by hour
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved Approximate Sketch Columns • Better rollup for high cardinality columns e.g userid • Reduced storage size • Use Cases • Fast approximate distinct counts • Approximate histograms • Funnel/retention analysis • Limitation • Not possible to do exact counts • filter on individual row values
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved Indexing Data
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved Indexing Service • Indexing is performed by • Overlord • Middle Managers • Peons • Middle Managers spawn peons which runs ingestion tasks • Each peon runs 1 task • Task definition defines which task to run and its properties
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved 2 Streaming Ingestion : Realtime Index Tasks • Ability to ingest streams of data • Stores data in write-optimized structure • Periodically converts write-optimized structure to read-optimized segments • Event query-able as soon as it is ingested • Both push and pull based ingestion
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved Streaming Ingestion : Tranquility • Helper library for coordinating streaming ingestion • Simple API to send events to druid • Transparently Manages • Realtime index Task Creation • Partitioning and Replication • Schema Evolution • Can be used with your favourite ETL framework e.g Flink, Nifi, Samza, Spark, Storm • At-least once ingestion
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved Kafka Indexing Service (experimental) • Supports Exactly once ingestion • Messages pulled by Kafka Index Tasks • Each Kafka Index Task consumes from a set of partitions with specific start and end offset • Each message verified to ensure sequence • Kafka Offsets and corresponding segments persisted in same metadata transaction atomically • Kafka Supervisor • embedded inside overlord • Manages kafka index tasks • Retry failed tasks Task 1 Task 2 Task 3
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Batch Ingestion • HadoopIndexTask • Peon launches Hadoop MR job • Mappers read data • Reducers create Druid segment files • Index Task • Runs in single JVM i.e peon • Suitable for data sizes(<1G) • Integrations with Apache HIVE and Spark for Batch Ingestion
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved Querying Data
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved Querying Data from Druid • Druid supports • JSON Queries over HTTP • In built SQL (experimental) • Querying libraries available for • Python • R • Ruby • Javascript • Clojure • PHP • Multiple Open source UI tools
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved 2 JSON Over HTTP • HTTP Rest API • Queries and results expressed in JSON • Multiple Query Types • Time Boundary • Timeseries • TopN • GroupBy • Select • Segment Metadata
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved In built SQL (experimental) • Apache Calcite based parser and planner • Ability to connect druid to any BI tool that supports JDBC • SQL via JSON over HTTP • Supports Approximate queries • APPROX_COUNT_DISTINCT(col) • Ability to do Fast Approx TopN queries • APPROX_QUANTILE(column, probability)
  • 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved Integrated with multiple Open Source UI tools • Superset – • Developed at AirBnb • In Apache Incubation since May 2017 • Grafana – Druid plugin • Metabase • With in-built SQL, connect with any BI tool supporting JDBC
  • 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved Druid in Production
  • 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved Druid in Production  Is Druid suitable for my Use case ?  Will Druid meet my performance requirements at scale ?  How complex is it to Operate and Manage Druid cluster ?  How to monitor a Druid cluster ?  High Availability ?  How to upgrade Druid cluster without downtime ?  Security ?
  • 34. 34 © Hortonworks Inc. 2011–2018. All rights reserved Suitable Use Cases • Powering Interactive user facing applications • Arbitrary slicing and dicing of large datasets • User behavior analysis • measuring distinct counts • retention analysis • funnel analysis • A/B testing • Exploratory analytics/root cause analysis • Not interested in dumping entire dataset
  • 35. 35 © Hortonworks Inc. 2011–2018. All rights reserved Performance and Scalability : Fast Facts Most Events per Day 300 Billion Events / Day (Metamarkets) Most Computed Metrics 1 Billion Metrics / Min (Jolata) Largest Cluster 200 Nodes (Metamarkets) Largest Hourly Ingestion 2TB per Hour (Netflix)
  • 36. 36 © Hortonworks Inc. 2011–2018. All rights reserved 3 Performance Numbers • Query Latency • average - 500ms • 90%ile < 1sec • 95%ile < 5sec • 99%ile < 10 sec • Query Volume • 1000s queries per minute • Benchmarking code • https://github.com/druid- io/druid-benchmark
  • 37. 37 © Hortonworks Inc. 2011–2018. All rights reserved Simplified Druid Cluster Management with Ambari  Install, configure and manage Druid and all external dependencies from Ambari  Easy to enable HA, Security, Monitoring …
  • 38. 38 © Hortonworks Inc. 2011–2018. All rights reserved Simplified Druid Cluster Management with Ambari
  • 39. 39 © Hortonworks Inc. 2011–2018. All rights reserved Monitoring a Druid Cluster • Each Druid Node emits metrics for • Query performance • Ingestion Rate • JVM Health • Query Cache performance • System health • Emitted as JSON objects to a runtime log file or over HTTP to other services • Emitters available for Ambari Metrics Server, Graphite, StatsD, Kafka • Easy to implement your own metrics emitter
  • 40. 40 © Hortonworks Inc. 2011–2018. All rights reserved Monitoring using Ambari Metrics Server • HDP 2.6.1 contains pre-defined grafana dashboards • Health of Druid Nodes • Ingestion • Query performance • Easy to create new dashboards and setup alerts • Auto configured when both Druid and Ambari Metrics Server are installed
  • 41. 41 © Hortonworks Inc. 2011–2018. All rights reserved Monitoring using Ambari Metrics Server
  • 42. 42 © Hortonworks Inc. 2011–2018. All rights reserved Monitoring using Ambari Metrics Server
  • 43. 43 © Hortonworks Inc. 2011–2018. All rights reserved High Availability • Deploy Coordinator/Overlord on multiple instances • Leader election in zookeeper • Broker – install multiple brokers • Use druid Router/ Any Load balancer to route queries to brokers • Realtime Index Tasks – create redundant tasks. • Historical Nodes – create load rule with replication factor >= 2 (default = 2)
  • 44. 44 © Hortonworks Inc. 2011–2018. All rights reserved 4 Rolling Upgrades  Shared Nothing Architecture ⬢ Maintain backwards compatibility ⬢ Data redundancy  Upgrade one Druid Component at a time ⬢ No Downtime 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3
  • 45. 45 © Hortonworks Inc. 2011–2018. All rights reserved Security • Supports Authentication via Kerberos/ SPNEGO • Easy Wizard based kerberos security enablement via Ambari Druid KDC server User Browser1 kinit user 2 Token
  • 46. 46 © Hortonworks Inc. 2011–2018. All rights reserved 4 Summary • Easy installation and management via Ambari • Real-time • Ingestion latency < seconds. • Query latency < seconds. • Arbitrary slice and dice big data like ninja • No more pre-canned drill downs. • Query with more fine-grained granularity. • High availability and Rolling deployment capabilities • Secure and Production ready • Vibrant and Active community • Available as Tech Preview in HDP 2.6.1
  • 47. 47 © Hortonworks Inc. 2011–2018. All rights reserved 4 • Druid website – http://druid.io • Druid User Group - dev@druid.incubator.apache.org • Druid Dev Group – users@druid.incubator.apache.org Useful Resources
  • 48. 48 © Hortonworks Inc. 2011–2018. All rights reserved Thank you Twitter - @NishantBangarwa Email - nbangarwa@hortonworks.com
  • 49. 49 © Hortonworks Inc. 2011–2018. All rights reserved Questions?
  • 50. 50 © Hortonworks Inc. 2011–2018. All rights reserved Extending Core Druid • Plugin Based Architecture • leverage Guice in order to load extensions at runtime • Possible to add extension to • Add a new deep storage implementation • Add a new Firehose for Ingestion • Add Aggregators • Add Complex metrics • Add new Query types • Add new Jersey resources • Bundle your extension with all the other Druid extensions
  • 51. 51 © Hortonworks Inc. 2011–2018. All rights reserved Performance : Approximate Algorithms • Ability to Store Approximate Data Sketches for high cardinality columns e.g userid • Reduced storage size • Use Cases • Fast approximate distinct counts • Approximate Top-K queries • Approximate histograms • Funnel/retention analysis • Limitation • Not possible to do exact counts • filter on individual row values
  • 52. 52 © Hortonworks Inc. 2011–2018. All rights reserved Superset • Python backend • Flask app builder • Authentication • Pandas for rich analytics • SqlAlchemy for SQL toolkit • Javascript frontend • React, NVD3 • Deep integration with Druid
  • 53. 53 © Hortonworks Inc. 2011–2018. All rights reserved Superset Rich Dashboarding Capabilities: Treemaps
  • 54. 54 © Hortonworks Inc. 2011–2018. All rights reserved Superset Rich Dashboarding Capabilities: Sunburst
  • 55. 55 © Hortonworks Inc. 2011–2018. All rights reserved Superset UI Provides Powerful Visualizations Rich library of dashboard visualizations: Basic: • Bar Charts • Pie Charts • Line Charts Advanced: • Sankey Diagrams • Treemaps • Sunburst • Heatmaps And More!

Notas do Editor

  1. Motivation Druid introduction and use case Demo Druid Architecture Storage Internals Recent Improvements
  2. Initial Use Case Power ad-tech analytics product at metamarkets. Similar to as shown in the picture in the right, A dashboard where you can visualize timeseries data and do arbitrary filtering and grouping on any combinations of dimensions. Requirements - Data store needs to support Arbitrary queries i.e users should be able to filter and group on any combination of dimensions. Scalability : should be able to handle trillions of events/day Interactive : since the data store was going to power and interactive dashboard low latency queries was must Real-time : the time when between an event occurred and it is visible dashboard should be mininal (order of few seconds..) High Availability – no central point of failure Rolling Upgrades – the architecture was required to support Rolling upgrades
  3. MOTIVATION Interactive real time visualizations on Complex data streams Answer BI questions How many unique male visitors visited my website last month ? How many products were sold last quarter broken down by a demographic and product category ? Not interested in dumping entire dataset Suppose I am running an ad campaign, and I want to understand what kind of Impressions are there What is my click through rate How many users decided to purchase my services We have User Activity Stream and we may want to know How the users are behaving. We may have a stream of Firewall Events and we want to do detect any anomalies in those streams in realtime. Also, For very large distributed clusters there is a need to answer questions about application performance. How individual node in my cluster behaving ? Are there any Anomalies in query response time ? All the above use cases can have data streams which can be huge in volume depending on the scale of business. How do I analyze this information ? How do I get insights from these Stream of Events in realtime ?
  4. What is Druid ? Column-oriented distributed datastore – data is stored in columnar format, in general many datasets have a large number of dimensions e.g 100s or 1000s , but most of the time queries only need 5-10s of columns, the column oriented format helps druid in only scanning the required columns. Sub-Second query times – It utilizes various techniques like bitmap indexes to do fast filtering of data, uses memory mapped files to serve data from memory, data summarization and compression, query caching to do fast filtering of data and have very optimized algorithms for different query types. And is able to achievesub second query times Realtime streaming ingestion from almost any ETL pipeline. Arbitrary slicing and dicing of data – no need to create pre-canned drill downs Automatic Data Summarization – during ingestion it can summarize your data based, e.g If my dashboard only shows events aggregated by HOUR, we can optionally configure druid to do pre-aggregation at ingestion time. Approximate algorithms (hyperLogLog, theta) – for fast approximate answers Scalable to petabytes of data Highly available
  5. This shows some of the production users. I can talk about some of the large ones which have common use cases. Alibaba and Ebay use druid for ecommerce and user behavior analytics Cisco has a realtime analytics product for analyzing network flows Yahoo uses druid for user behavior analytics and realtime cluster monitoring Hulu does interactive analysis on user and application behavior Paypal, SK telecom – uses druid for business analytics
  6. Realtime Nodes - Handle Real-Time Ingestion, Support both pull & push based ingestion. Store data in Row Oriented write optimized Structure Periodically convert write optimized structure read optimized Structure Ability to serve queries as soon as data is ingested. Historical Nodes - Main workhorses of druid clatter Use Memory Mapped files to load columnar data Respond to User queries Broker Nodes - Keeps track of which node is service which portion of data Ability to scatter query across multiple Historical and Realtime nodes Caching Layer
  7. Druid has concept of different nodes, where each node is designed and optimized to perform specific set of tasks. Realtime Index Tasks / Realtime Nodes- Handle Real-Time Ingestion, Support both pull & push based ingestion. Handle Queries - Ability to serve queries as soon as data is ingested. Store data in write optimized data structure on heap, periodically convert it to write optimized time partitioned immutable segments and persist it to deep storage. In case you need to do any ETL like data enrichment or joining multiple streams of data, you can do it in a separate ETL and send your massaged data to druid. Deep storage can be any distributed FS and acts as a permanent backup of data Historical Nodes - Main workhorses of druid cluster Use Memory Mapped files to load immutable segments Respond to User queries Now Lets see the how data can be queried. Broker Nodes - Keeps track of the data chunks being loaded by each node in the cluster Ability to scatter query across multiple Historical and Realtime nodes Caching Layer Now Lets discuss another case, when you are not having streaming data, but want to Ingest Batch data into druid Batch ingestion can be done using either Hadoop MR or spark job, which converts your data into time partitioned segments and persist it to deep storage.
  8. With many historical nodes in a cluster there is a need for balance the load across them, this is done by the Coordinator Nodes - Uses Zookeeper for coordination Asks historical Nodes to load or drop data They also move data across historical nodes to balances load in the cluster Manages Data replication External Dependencies – Metadata Storage – for storing metadata about the segments i.e the location of segments, information on how to load the segments etc. Memcache/ Redis cache – you can optionally add a memcache or redis cache which can be used to cache partial query results.
  9. Druid: Segments Data in Druid is stored in Segment Files. Partitioned by time Ideally, segment files are each smaller than 1GB. If files are large, smaller time partitions are needed.
  10. Example Wikipedia Edit Dataset
  11. Data Rollup Rollup by hour
  12. Dictionary Encoding Create and store Ids for each value e.g. page column Values - Justin Bieber, Ke$ha, Selena Gomes Encoding - Justin Bieber : 0, Ke$ha: 1, Selena Gomes: 2 Column Data - [0 0 0 1 1 2] city column - [0 0 0 1 1 1]
  13. Bitmap Indices Store Bitmap Indices for each value Justin Bieber -> [0, 1, 2] -> [1 1 1 0 0 0] Ke$ha -> [3, 4] -> [0 0 0 1 1 0] Selena Gomes -> [5] -> [0 0 0 0 0 1] Queries Justin Bieber or Ke$ha -> [1 1 1 0 0 0] OR [0 0 0 1 1 0] -> [1 1 1 1 1 0] language = en and country = CA -> [1 1 1 1 1 1] AND [0 0 0 1 1 1] -> [0 0 0 1 1 1] Indexes compressed with Concise or Roaring encoding
  14. Data Rollup Rollup by hour
  15. Indexing Service is highly-available, distributed service that runs indexing related tasks. The indexing service is composed of three main components: Overlord - responsible for accepting tasks, coordinating task distribution, creating locks around tasks, and returning statuses to callers. Middle Managers - The middle manager node is a worker node that executes submitted tasks. they launch peons that actually runs the tasks. Peons – managed by middlemanagers and runs a single task. It gets a task definition which is a json spec file that describes the task to perform. All the coordination and communication for task assignment, announcing task stustuses is done via zookeeper.
  16. Streaming Ingestion Done by Realtime Index Tasks Ability to ingest streams of data Stores data in write-optimized structure – row oriented key-value store Indexed by time and dimension values Periodically based on either a time interval or threshold on number of rows it converts write-optimized structure to read-optimized segments Event query-able as soon as it is ingested Both push and pull based ingestion
  17. Tranquility is a helper library in druid which provides easy coordination and task management for performing streaming ingestion into druid It has a very simple API which you can use to send events to druid. On the right hand side you can see a simple example sending an event to druid. So you just create a Tranquilzer with config, The config contains the location of druid overlord, name of your datasource and other ingestion related properties. Just simply call send on the tranquilizer, it will automatically takes care of creating a druid task, managing lifecycle of the task, discovering location of the task and sending data to that task.
  18. We have also added an experimental support for ingesting data from Kafka that also supports exactly once consumption of data. How kafka works is as follows – Each message written to Kafka is placed into an ordered and immutable sequence called a partition and is assigned a sequentially incrementing identifier called an offset. Messages are pulled by druid tasks which verify the sequence and offsets to ensure the sequence. Then at time of persisting the data both the segments and information related kafka offsets is persisted in a single transaction. Since we have the offsets in the metadata in case of failures, we can start reading from that offset again.
  19. Batch Ingestion – ingested data in batch. HadoopIndexTask Peon launches Hadoop MR job Mappers read data Reducers create Druid segment files Index Task Suitable for data sizes(<1G)
  20. Druid broker nodes exposes HTTP endpoints where users can post the queries Queries and results expressed in JSON Multiple Query Types On the right we have an example of a groupBy query in the json you can see In the json query you can specify the datasource, granularity – time by which you want to bucket your data, any filter you may want to use, List of aggregations that you need to perform and any post aggregations like average etc.
  21. The second and easier way to query druid is using SQL (suport for inbuilt SQL is experimental at present) We leverage apache calcite for parsing and planning the query. It also uses Avatica which is a framework for building JDBC drivers for databases So using this, you can connect any BI tool that supports JDBC to druid. Druid also defines some new operators for supporting approximate queries,
  22. Retention analysis
  23. Most Events per Day 300 Billion Events / Day (Metamarkets) Most Computed Metrics 1 Billion Metrics / Min (Jolata) Largest Cluster 200 Nodes (Metamarkets) Largest Hourly Ingestion 2TB per Hour (Netflix)
  24. Query Latency average - 500ms 90%ile < 1sec 95%ile < 5sec 99%ile < 10 sec Query Volume 1000s queries per minute
  25. Query performance – query time, segment scan time … Ingestion Rate – events ingested, events persisted … JVM Health – JVM Heap usage, GC stats … Cache Related – cache hits, cache misses, cache evictions … System related – cpu, disk, network, swap usage etc..
  26. No Downtime Data redundancy Rolling upgrades
  27. You can secure Druid nodes using Kerberos, and use SPNEGO mechanism to interact with druid HTTP end points.
  28. Summary It is easy to install and manage druid via Ambari Realtime with ingestion and query latency of the order of few secs. Arbitrary slicing and dicing of data
  29. Summary It is easy to install and manage druid via Ambari Realtime with ingestion and query latency of the order of few secs. Arbitrary slicing and dicing of data
  30. Guice which is a  lightweight dependency injection framework