SlideShare uma empresa Scribd logo
1 de 26
Druid SQL Interface
Calcite
Problem
- Druid is used extensively on our team and at Oath
- Druid is hard to interact with due to its JSON input format
- Many at Oath are not familiar with how to optimize Druid queries
Why use Druid?
- Able to ingest and serve data in real-time with low latency
- Good for ad-hoc queries
- Good for storing aggregate data
- Scalable to ingest millions of events/sec
Using SQL to bridge the gap
● SQL is the lingua franca of data
● Most at Oath are already familiar with SQL
● SQL is easier to write and more concise than JSON
● All BI tools we use support SQL
SQL vs Druid JSON
Here is a sample SQL query for a given dataset:
SELECT
SUM("store_sales") filter (where "store_state" = 'CA'),
SUM("store_cost") filter (where "store_state" = 'OR')
FROM
"foodmart"
WHERE
"the_month" == 'October'
LIMIT
10
The same query in Druid JSON format is much less readable
SQL vs Druid JSON
{
"queryType":"groupBy",
"dataSource":"foodmart",
"granularity":"all",
"dimensions":[],
"limitSpec":{
"type":"default",
"limit":10,
"columns":[]
},
"filter":{
"type":"and",
"fields":[
{
"type":"or",
"fields":[
{
"type":"selector",
"dimension":"store_state",
"value":"CA"
},
{
"type":"selector",
"dimension":"store_state",
"value":"OR"
}
]
},
{
"type":"not",
"field":{
"type":"selector",
"dimension":"the_month",
"value":"October"
}
}
]
},
"aggregations":[
{
"type":"filtered",
"filter":{
"type":"selector",
"dimension":"store_state",
"value":"CA"
},
"aggregator":{
"type":"doubleSum",
"name":"EXPR$0",
"fieldName":"store_sales"
}
},
{
"type":"filtered",
"filter":{
"type":"selector",
"dimension":"store_state",
"value":"OR"
},
"aggregator":{
"type":"doubleSum",
"name":"EXPR$1",
"fieldName":"store_cost"
}
}
],
"intervals":["1900-01-09T00:00:00.000/2992-
01-10T00:00:00.000"]
}
Pre-existing Solutions
- Druid SQL services
- Hive Druid connection
- Apache Calcite
Druid SQL Services
- Druid has SQL support via Apache Calcite
- Pros:
- Significantly simplifies query JSON
- Already supported in Druid
- Cons:
- Support is experimental
- Doesn’t support DataSketch aggregators
curl -XPOST -H 'Content-Type: application/json' http://BROKER:8082/druid/v2/sql/ -d @query.json
{
"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar' AND __time > TIMESTAMP '2000-01-
01 00:00:00'",
"context" : {"sqlTimeZone" : "America/Los_Angeles"}
}
Hive Druid Connection
- Hive also has some level of Druid support via Apache Calcite
- Pros:
- Many BI tools already support Hive
- Cons:
- Lacks support for sketches
Apache Calcite
- Translator between SQL and Druid JSON
- Industry-standard SQL parser
- Represent your query in relational algebra, transform using planning rules,
and optimize according to a cost model
- Open source
Our Solution
- Use Apache Calcite directly
- Address the deficiencies of Calcite and contribute back to the open source
community
Calcite relational algebra
- Relational logic tree translated
from SQL query
- Each node has its cost based on
context
- SELECT SUM(a) as c FROM
table1 WHERE b=1 ORDER BY c
TableScan On table1
Filter (b=1)
Project (table1.a -> a)
Aggregate (sum(a))
Sort on c
Query Planning
- Apply rules on the
Relational logic tree
- Transform certain logic
subtree into Druid Query
Node
TableScan Filter Project Aggregate Sort
Druid GroupBy Query node Sort
Druid TopN Query node
Or
Optimization
- Use cost model to estimate the
performance of different
transformed logic tree
- Basic idea is to leverage more
computation in Druid
Druid GroupBy Query node Sort
Druid TopN Query node
Cost = 10 Cost = 10
Cost = 15
Renderer
- Render the druid json query to be
sent out
- If any computation cannot be
pushed to json query, run it locally
in Calcite.
Druid TopN Query node
{
"queryType":"TopN",
"dataSource":"foodmart",
"Granularity":"all",
…
Major Problems
- Did not support Post-Aggregation
- AVERAGE function
- Could run out of memory
- Did not support Filtered Aggregations
- Could cause Druid query all rows and process them in memory
- Did not support Distinct Count Aggregators using ThetaSketches
- Calcite will always try to give the user exact results
- Distinct count aggregations are not pushed to Druid
Post Aggregation Support
- New Rule to merge Post
aggregation node
- New Render that can
generate druid query with
post aggregation
TableScan Project Aggregate
Druid GroupBy Query node
Aggregate
Aggregate
Druid GroupBy Query node
New Rule
Filtered Aggregations Support
- New Rule to move Filter
operation from Calcite to
Druid
- Optimization on filters
- New rule to extract
common filter into outer
filter
- New rule to combine filter
with logical ORs to outer
filter
TableScan Filter1 Project
Aggregate
Aggregate
Filter2
Filter2
TableScan
Filter1
Project
Aggregate
AggregateFilter2
Performance
- Avoid unnecessary rows scan
in Druid
- Greatly reduce the runtime of
when filters are involved
Why ThetaSketch
- Sketches are a class of streaming, stochastic
algorithms
- Trade off accuracy for speed – orders of
magnitude faster
- Exact up to configurable thresholds and
approximate after
- Mathematically provable error bounds
- Bounded in space
- Set operations – union, intersect, difference Sketches logo from http://datasketches.github.io
ThetaSketch Support
- New rule to translate Distinct count aggregator node to Thetasketches node
- Allow users to config whether approximate cardinality is allowed
Performance
- Reduced the running time of the
query with count distinct aggregator
when cardinality estimation is
allowed
- Sketches column can be utilized
now
- With post aggregation support,
more operation can be applied
User Interface
- Superset is commonly used
with Druid
- Superset SQL Lab is popular
on SQL-like database
From superset documentation: https://superset.incubator.apache.org/
Superset Calcite Connection
- Superset is python application
- Standard python DBAPI is created
- Able to use SQL lab to run ad-hoc query on Druid
Perform Query
Parsing,
Planning
Internal
Computation
Druid Adapter
SQL Lab
Calcite JDBC
Output
User
Superset
Calcite
Druid
Questions

Mais conteúdo relacionado

Mais procurados

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomyDongmin Yu
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedis Labs
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...Altinity Ltd
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudAnshum Gupta
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1ScyllaDB
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Building an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache PulsarBuilding an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache PulsarScyllaDB
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...Altinity Ltd
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberXiang Fu
 

Mais procurados (20)

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ Twitter
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
Apache flink
Apache flinkApache flink
Apache flink
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Building an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache PulsarBuilding an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache Pulsar
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
Apache Airflow overview
Apache Airflow overviewApache Airflow overview
Apache Airflow overview
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 

Semelhante a Querying Druid in SQL with Superset

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH
 
1 extreme performance - part i
1   extreme performance - part i1   extreme performance - part i
1 extreme performance - part isqlserver.co.il
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersLucidworks
 
At the core you will have KUSTO
At the core you will have KUSTOAt the core you will have KUSTO
At the core you will have KUSTORiccardo Zamana
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudAlluxio, Inc.
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Lucas Jellema
 
[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)Steve Min
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllMichael Mior
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseHBaseCon
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
zData Inc. Big Data Consulting and Services - Overview and Summary
zData Inc. Big Data Consulting and Services - Overview and SummaryzData Inc. Big Data Consulting and Services - Overview and Summary
zData Inc. Big Data Consulting and Services - Overview and SummaryzData Inc.
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
 
NewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACIDNewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACIDTony Rogerson
 

Semelhante a Querying Druid in SQL with Superset (20)

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
1 extreme performance - part i
1   extreme performance - part i1   extreme performance - part i
1 extreme performance - part i
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
At the core you will have KUSTO
At the core you will have KUSTOAt the core you will have KUSTO
At the core you will have KUSTO
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
 
Oow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BIOow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BI
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
 
[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
zData Inc. Big Data Consulting and Services - Overview and Summary
zData Inc. Big Data Consulting and Services - Overview and SummaryzData Inc. Big Data Consulting and Services - Overview and Summary
zData Inc. Big Data Consulting and Services - Overview and Summary
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
NewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACIDNewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACID
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 

Último (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 

Querying Druid in SQL with Superset

  • 2. Problem - Druid is used extensively on our team and at Oath - Druid is hard to interact with due to its JSON input format - Many at Oath are not familiar with how to optimize Druid queries
  • 3. Why use Druid? - Able to ingest and serve data in real-time with low latency - Good for ad-hoc queries - Good for storing aggregate data - Scalable to ingest millions of events/sec
  • 4. Using SQL to bridge the gap ● SQL is the lingua franca of data ● Most at Oath are already familiar with SQL ● SQL is easier to write and more concise than JSON ● All BI tools we use support SQL
  • 5. SQL vs Druid JSON Here is a sample SQL query for a given dataset: SELECT SUM("store_sales") filter (where "store_state" = 'CA'), SUM("store_cost") filter (where "store_state" = 'OR') FROM "foodmart" WHERE "the_month" == 'October' LIMIT 10 The same query in Druid JSON format is much less readable
  • 6. SQL vs Druid JSON { "queryType":"groupBy", "dataSource":"foodmart", "granularity":"all", "dimensions":[], "limitSpec":{ "type":"default", "limit":10, "columns":[] }, "filter":{ "type":"and", "fields":[ { "type":"or", "fields":[ { "type":"selector", "dimension":"store_state", "value":"CA" }, { "type":"selector", "dimension":"store_state", "value":"OR" } ] }, { "type":"not", "field":{ "type":"selector", "dimension":"the_month", "value":"October" } } ] }, "aggregations":[ { "type":"filtered", "filter":{ "type":"selector", "dimension":"store_state", "value":"CA" }, "aggregator":{ "type":"doubleSum", "name":"EXPR$0", "fieldName":"store_sales" } }, { "type":"filtered", "filter":{ "type":"selector", "dimension":"store_state", "value":"OR" }, "aggregator":{ "type":"doubleSum", "name":"EXPR$1", "fieldName":"store_cost" } } ], "intervals":["1900-01-09T00:00:00.000/2992- 01-10T00:00:00.000"] }
  • 7. Pre-existing Solutions - Druid SQL services - Hive Druid connection - Apache Calcite
  • 8. Druid SQL Services - Druid has SQL support via Apache Calcite - Pros: - Significantly simplifies query JSON - Already supported in Druid - Cons: - Support is experimental - Doesn’t support DataSketch aggregators curl -XPOST -H 'Content-Type: application/json' http://BROKER:8082/druid/v2/sql/ -d @query.json { "query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar' AND __time > TIMESTAMP '2000-01- 01 00:00:00'", "context" : {"sqlTimeZone" : "America/Los_Angeles"} }
  • 9. Hive Druid Connection - Hive also has some level of Druid support via Apache Calcite - Pros: - Many BI tools already support Hive - Cons: - Lacks support for sketches
  • 10. Apache Calcite - Translator between SQL and Druid JSON - Industry-standard SQL parser - Represent your query in relational algebra, transform using planning rules, and optimize according to a cost model - Open source
  • 11. Our Solution - Use Apache Calcite directly - Address the deficiencies of Calcite and contribute back to the open source community
  • 12. Calcite relational algebra - Relational logic tree translated from SQL query - Each node has its cost based on context - SELECT SUM(a) as c FROM table1 WHERE b=1 ORDER BY c TableScan On table1 Filter (b=1) Project (table1.a -> a) Aggregate (sum(a)) Sort on c
  • 13. Query Planning - Apply rules on the Relational logic tree - Transform certain logic subtree into Druid Query Node TableScan Filter Project Aggregate Sort Druid GroupBy Query node Sort Druid TopN Query node Or
  • 14. Optimization - Use cost model to estimate the performance of different transformed logic tree - Basic idea is to leverage more computation in Druid Druid GroupBy Query node Sort Druid TopN Query node Cost = 10 Cost = 10 Cost = 15
  • 15. Renderer - Render the druid json query to be sent out - If any computation cannot be pushed to json query, run it locally in Calcite. Druid TopN Query node { "queryType":"TopN", "dataSource":"foodmart", "Granularity":"all", …
  • 16. Major Problems - Did not support Post-Aggregation - AVERAGE function - Could run out of memory - Did not support Filtered Aggregations - Could cause Druid query all rows and process them in memory - Did not support Distinct Count Aggregators using ThetaSketches - Calcite will always try to give the user exact results - Distinct count aggregations are not pushed to Druid
  • 17. Post Aggregation Support - New Rule to merge Post aggregation node - New Render that can generate druid query with post aggregation TableScan Project Aggregate Druid GroupBy Query node Aggregate Aggregate Druid GroupBy Query node New Rule
  • 18. Filtered Aggregations Support - New Rule to move Filter operation from Calcite to Druid - Optimization on filters - New rule to extract common filter into outer filter - New rule to combine filter with logical ORs to outer filter TableScan Filter1 Project Aggregate Aggregate Filter2 Filter2 TableScan Filter1 Project Aggregate AggregateFilter2
  • 19. Performance - Avoid unnecessary rows scan in Druid - Greatly reduce the runtime of when filters are involved
  • 20. Why ThetaSketch - Sketches are a class of streaming, stochastic algorithms - Trade off accuracy for speed – orders of magnitude faster - Exact up to configurable thresholds and approximate after - Mathematically provable error bounds - Bounded in space - Set operations – union, intersect, difference Sketches logo from http://datasketches.github.io
  • 21. ThetaSketch Support - New rule to translate Distinct count aggregator node to Thetasketches node - Allow users to config whether approximate cardinality is allowed
  • 22. Performance - Reduced the running time of the query with count distinct aggregator when cardinality estimation is allowed - Sketches column can be utilized now - With post aggregation support, more operation can be applied
  • 23. User Interface - Superset is commonly used with Druid - Superset SQL Lab is popular on SQL-like database From superset documentation: https://superset.incubator.apache.org/
  • 24. Superset Calcite Connection - Superset is python application - Standard python DBAPI is created - Able to use SQL lab to run ad-hoc query on Druid
  • 25. Perform Query Parsing, Planning Internal Computation Druid Adapter SQL Lab Calcite JDBC Output User Superset Calcite Druid

Notas do Editor

  1. Druid emerges recently as a is an open-source data store designed for sub-second queries on real-time and historical data. It is usually used to query event data. It can ingest event data in real time and allows flexible data exploration and data aggregation. Right now in Oath, a lot of data sciences want to do a ad-hoc query on users event data and Druid will be their first choice. Now, we will go over how to work with druid when we need to do a adhoc query.
  2. In this presentation, we will first go over the reason why we want a SQL interface on the top of Druid. In the meantime, we will briefly introduce Druid and other existing solution. After that, we will focus on the improvement needed to be done for the SQL interface. This part will include our contribution on the open source project. At last, we will introduce how we combine the SQL interface with a neat User interface for a wider range of users to work on Druid.
  3. Let’s say I am a data scientist and I may want to run this SQL on druid. In this SQL query, we are looking for summation of store_sales in California and store_costs in Oregon. The data in October should be excluded. Since now the data is in druid, we will have to write a query that Druid understands. In this case, we have to write a JSON and send it to druid within a http request. So how the json will be like?
  4. Probably after several experience with druid, ones will be familiar with this format. However, for someone who is already familiar with SQL to translate SQL into this format, it will need some time to learn the druid documentation and probably the syntax of json. It will be great if we can have a SQL interface to query druid data without losing the great performance of druid. Let’s go ahead and explore some SQL interfaces that Druid could work with.
  5. First option we can have is to use the SQL services provided by Druid. It works similarly as the druid query and we need to send a http request with the SQL statement in a json object. This function is supported by a open source tool called Apache Calcite.
  6. Need more
  7. Obviously, Calcite is the core part of the SQL interface on Druid. To develop a SQL interface on Druid, it is necessary to learn about Calcite first. We will briefly introduce Calcite and how it works then go over the contribution we made on Open source Calcite.
  8. The most important concept in Calcite is the relational algebra. Calcite basically translate SQL statement into a tree structure with nodes representing the logic in it. For example, the tree on the left side is a simple example algebra tree for the SQL statement on the left side. First, the TableScan to lock down the SQL on certain table. Then the filter contains the logic of WHERE value of b column equals to 1. The Project node is not so obvious but in many SQL database the actual column name may contain namespace and the project node will take care these kinds of translation. The Aggregate node includes the logic of summation function we used. The last one is the sort node which corresponds to order by part in SQL statement.
  9. With the relational logic tree, we now can transform the tree to another tree representing the equivalent logic. After transformation, the new tree should be translated into Druid Query. The transformation rules will subtrees with equivalent logic to be transformed between each other. Like the example here, the first four node can be transformed as a druid groupby query through rules we specified. In the other transformation path, the whole tree can be transformed into Druid topN query node which contains equivalent logic. Now, the question is to pick a certain tree as our final result.
  10. The final result will be determined by the optimizer in Calcite. In calcite, each node will have cost to quantify the computation power it required to run the logic in the node. Different output trees, therefore, will have different cost. The result will be the final tree with minimum cost. Back to our example, the cost of final tree on the top is 10 + 10 20, but another output tree only have one node with 15 cost. The optimizer will then pick the Druid TopN query node as the final result. Now, the final job is to render the druid json query with the logic we have in the druid query node.
  11. Now, the final job is to render the druid json query with the logic we have in the druid query node. Basically it is like a json writer that generate the query based on the information we had in the node.
  12. The whole idea of Calcite is brilliant, but at the time when we try to use it, it still has missing piece that can affect the performance. We the decided to contribute to the open source repository to enhance Calcite. These are two major problems we worked on. First, Calcite in that time did not have support on post-aggregation, so part of the computation in function like AVERAGE will be performed in Calcite instead of Druid. Moving those computation to Druid can save memory and time when running query. The second one is the support of filtered Aggregation. Without the support, calcite may assign druid to query all rows and do the filter in Calcite local machine. This could cause a huge fallback in performance when filtered aggregation is involved.
  13. To add post aggregation support, we add new rules to merge post aggregation node in supported druid query node, so the final tree can include post aggregation logic. Also, a new renderer is needed to render query with post aggregator. Now, when possible, the post aggregation will be pushed to Druid query node.
  14. Similar as the post aggregation support, we add new rules to deal with the filter node right before the aggregate node. The general idea here is to move inner filter to outer filter