SlideShare uma empresa Scribd logo
1 de 28
Baixar para ler offline
2017.06.08
SQL on Druid
4th Druid@Seoul Meetup
You sun Jeong (jerryjung@sk.com)
Index
1. What is Druid?
2. Benchmark
3. SQL on Druid
4. Q&A
2
What is Druid?
http://druid.io/
Druidis an open-source data storedesigned for sub-
second queries on real-time and historical data. It is primarily used for
business intelligence(OLAP) queries on event data. Druid provides low
latency (real-time) data ingestion, flexible data exploration, and
fast data aggregation. Existing Druid deployments have scaled to trillions of events and
petabytes of data. Druid is most commonly used to power user-facing analytic
applications.
3
Big Data Discovery
4
Druid Features
5http://www.popit.kr/ultra-fast_olap_druid/
https://hortonworks.com/blog/apache-hive-druid-part-1-3/
Pre-Aggregation & Roll-up
6
minute
Segment Management
7
Druid Architecture
8
https://en.wikipedia.org/wiki/Druid_(open-source_data_store)
Agenda
9
1. What is Druid?
2. Benchmark
3. SQL on Druid
4. Q&A
Druid vs Spark
10
http://
www.popit.kr/
druid-spark-
performance/
Druid vs Spark(Cached) vs Spark(HDFS)
11
http://www.popit.kr/druid-spark-
performance/
Interactive Analysis Capability
Druid vs Spark
12
http://www.popit.kr/druid-spark-
performance/
Agenda
13
1. What is Druid?
2. Benchmark
3. SQL on Druid
4. Q&A
Things that bother me in the druid
14
Ingestion
Query
"dataSchema" : {
"dataSource" : "wikipedia",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"timestampSpec" : {
"column" : "timestamp",
"format" : "auto"
{
"queryType": "groupBy",
"dataSource": "sample_datasource",
"granularity": "day",
"dimensions": ["country", "device"],
pyDruid 

RDruid
Join?
SQL on Druid
15
Hive Integration - Benchmark
16
https://hortonworks.com/blog/apache-hive-druid-part-1-3/
http://www.popit.kr/ultra-fast_olap_druid2/
Hive Integration - Druid Storage Handler
17
https://hortonworks.com/blog/apache-hive-druid-part-1-3/
http://www.popit.kr/ultra-fast_olap_druid2/
Hive 2.2.0 or higher
Hive Integration - Types of Analytics
18
https://hortonworks.com/blog/apache-hive-druid-part-1-3/
http://www.popit.kr/ultra-fast_olap_druid2/
Hive Integration - Create Datasource
19
CREATE EXTERNAL TABLE druid_table_1

STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'

TBLPROPERTIES ("druid.datasource" = "wikiticker");
CREATE TABLE druid_table_1

STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'

TBLPROPERTIES ("druid.datasource" = "wikiticker",
"druid.segment.granularity" = "DAY")

AS

SELECT __time, page, user, c_added, c_removed

FROM src;
Inference	of	Druid	column	types	(timestamp,	dimensions,	metrics)	
depends	on	Hive	column	type
Hive Integration - Querying Druid
20
Automatic	rewriting	when	query	is	expressed	over	Druid	table	
Powered	by	Apache	Calcite	
Main	challenge:	identify	patterns	in	logical	plan	corresponding	to	different	
kinds	of	Druid	queries	(Timeseries,	TopN,	GroupBy,	Select)	
Translate	(sub)plan	of	operators	into	valid	Druid	JSON	query	
Druid	query	is	encapsulated	within	Hive	TableScan	operator	
Hive	TableScan	uses	Druid	input	format	
Submits	query	to	Druid	and	generates	records	out	of	the	query	results	
It	might	not	be	possible	to	push	all	computation	to	Druid	
Our	contract	is	that	the	query	should	always	be	executed
https://www.slideshare.net/HadoopSummit/interactive-analytics-at-scale-in-apache-hive-using-
druid
Hive Integration - Querying Druid
21
SELECT `user`, sum(`c_added`) AS s

FROM druid_table_1

WHERE EXTRACT(year FROM `__time`)

BETWEEN 2010 AND 2011

GROUP BY `user`

ORDER BY s DESC

LIMIT 10; {	
		"queryType":	"groupBy",	
		"dataSource":	"users_index",	
		"granularity":	"all",	
		"dimension":	"user",	
		"aggregations":	[	{	"type":	"longSum",	"name":	"s",	"fieldName":	"c_added"	}	],	
		"limitSpec":	{	
				"limit":	10,	
				"columns":	[	{"dimension":	"s",	"direction":	"descending"	}	]	
		},	
		"intervals":	[	"2010-01-01T00:00:00.000/2012-01-01T00:00:00.000"	]	
}
Hive Integration - Join
22
SELECT a.channel, b.col1
FROM
(
SELECT `channel`, max(delta) as m, sum(added)
FROM druid_table_1
GROUP BY `channel`, `floor_year`(`__time`)
ORDER BY m DESC
LIMIT 1000
) a
JOIN
(
SELECT col1, col2
FROM hive_table_1
) b
ON a.channel = b.col2;
Query that runs across Druid and Hive
Spark Integration
23
CREATE TABLE if not exists orderLineItemPartSupplier
USING org.sparklinedata.druid
OPTIONS (sourceDataframe "orderLineItemPartSupplierBase",
timeDimensionColumn "l_shipdate",
druidDatasource "tpch",
druidHost "localhost",
zkQualifyDiscoveryNames "true",
columnMapping '{ "l_quantity" : "sum_l_quantity", "ps_availqty" : "sum_ps_availqty", "cn_name" : "c_nation",
"cr_name" : "c_region", "sn_name" : "s_nation", "sr_name" : "s_region" } ',
numProcessingThreadsPerHistorical '1',
starSchema ' { "factTable" : "orderLineItemPartSupplier", "relations" : [] } ')
;
JAVA_TOOL_OPTIONS=-Duser.timezone=UTC sh start-sparklinedatathriftserver.sh ~/server/spark-druid-olap/
scripts/spl-accel-assembly-0.5.0-SNAPSHOT.jar 
--driver-memory 19g --master yarn --deploy-mode client --conf spark.scheduler.mode=FAIR --properties-
file sparkline.properties
https://github.com/SparklineData/spark-druid-olap
Druid - Built-in SQL
24
Druid 0.10.0 or higher
// Connect to /druid/v2/sql/avatica/ on your broker.
String url = "jdbc:avatica:remote:url=http://localhost:8082/druid/v2/sql/avatica/";
// Set any connection context parameters you need here (see "Connection context"
below).
// Or leave empty for default behavior.
Properties connectionProperties = new Properties();
try (Connection connection = DriverManager.getConnection(url, connectionProperties))
{
try (ResultSet resultSet = connection.createStatement().executeQuery("SELECT
COUNT(*) AS cnt FROM data_source")) {
while (resultSet.next()) {
// Do something
}
}
}
Druid includes a native SQL layer with an Apache Calcite-based parser and planner.
Druid - Built-in SQL
25
{
"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'"
}
SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE
TABLE_SCHEMA = 'druid' AND TABLE_NAME = 'foo'
SELECT x, COUNT(*)
FROM data_source_1
WHERE x IN (SELECT x FROM data_source_2 WHERE y = 'baz')
GROUP BY x
http://druid.io/docs/latest/querying/sql.html
Agenda
26
1. What is Druid?
2. Benchmark
3. SQL on Druid
4. Q&A
May the force be with you!
27
2017.06.08
THANK YOU

Mais conteúdo relacionado

Mais procurados

Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 

Mais procurados (20)

Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.com
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on Databricks
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & Druid
 
More Algorithms and Tools for Genomic Analysis on Apache Spark with Ryan Will...
More Algorithms and Tools for Genomic Analysis on Apache Spark with Ryan Will...More Algorithms and Tools for Genomic Analysis on Apache Spark with Ryan Will...
More Algorithms and Tools for Genomic Analysis on Apache Spark with Ryan Will...
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
 
DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and Analytics
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and Hadoop
 

Semelhante a Druid meetup 4th_sql_on_druid

Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
MongoDB
 
[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)
Steve Min
 
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
SegFaultConf
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Masayuki Matsushita
 

Semelhante a Druid meetup 4th_sql_on_druid (20)

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
 
[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
Bids talk 9.18
Bids talk 9.18Bids talk 9.18
Bids talk 9.18
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL DatabasesReal-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
 
Hydra - Getting Started
Hydra - Getting StartedHydra - Getting Started
Hydra - Getting Started
 
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick Database
 
MongoDB Europe 2016 - Debugging MongoDB Performance
MongoDB Europe 2016 - Debugging MongoDB PerformanceMongoDB Europe 2016 - Debugging MongoDB Performance
MongoDB Europe 2016 - Debugging MongoDB Performance
 
Data science apps: beyond notebooks
Data science apps: beyond notebooksData science apps: beyond notebooks
Data science apps: beyond notebooks
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
 
#MBLTdev: Разработка первоклассных SDK для Android (Twitter)
#MBLTdev: Разработка первоклассных SDK для Android (Twitter)#MBLTdev: Разработка первоклассных SDK для Android (Twitter)
#MBLTdev: Разработка первоклассных SDK для Android (Twitter)
 
Data Science Apps: Beyond Notebooks - Natalino Busa - Codemotion Amsterdam 2017
Data Science Apps: Beyond Notebooks - Natalino Busa - Codemotion Amsterdam 2017Data Science Apps: Beyond Notebooks - Natalino Busa - Codemotion Amsterdam 2017
Data Science Apps: Beyond Notebooks - Natalino Busa - Codemotion Amsterdam 2017
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
 

Mais de Yousun Jeong (10)

Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Spark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesSpark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on Kubernetes
 
Kubernetes on aws
Kubernetes on awsKubernetes on aws
Kubernetes on aws
 
Kafka for begginer
Kafka for begginerKafka for begginer
Kafka for begginer
 
Data Analytics with Druid
Data Analytics with DruidData Analytics with Druid
Data Analytics with Druid
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 
Enterprise 환경에서의 오픈소스 기반 아키텍처 적용 사례
Enterprise 환경에서의 오픈소스 기반 아키텍처 적용 사례Enterprise 환경에서의 오픈소스 기반 아키텍처 적용 사례
Enterprise 환경에서의 오픈소스 기반 아키텍처 적용 사례
 
2012 07 28_cloud_reference_architecture_openplatform
2012 07 28_cloud_reference_architecture_openplatform2012 07 28_cloud_reference_architecture_openplatform
2012 07 28_cloud_reference_architecture_openplatform
 

Último

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Último (20)

%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 

Druid meetup 4th_sql_on_druid

  • 1. 2017.06.08 SQL on Druid 4th Druid@Seoul Meetup You sun Jeong (jerryjung@sk.com)
  • 2. Index 1. What is Druid? 2. Benchmark 3. SQL on Druid 4. Q&A 2
  • 3. What is Druid? http://druid.io/ Druidis an open-source data storedesigned for sub- second queries on real-time and historical data. It is primarily used for business intelligence(OLAP) queries on event data. Druid provides low latency (real-time) data ingestion, flexible data exploration, and fast data aggregation. Existing Druid deployments have scaled to trillions of events and petabytes of data. Druid is most commonly used to power user-facing analytic applications. 3
  • 9. Agenda 9 1. What is Druid? 2. Benchmark 3. SQL on Druid 4. Q&A
  • 11. Druid vs Spark(Cached) vs Spark(HDFS) 11 http://www.popit.kr/druid-spark- performance/ Interactive Analysis Capability
  • 13. Agenda 13 1. What is Druid? 2. Benchmark 3. SQL on Druid 4. Q&A
  • 14. Things that bother me in the druid 14 Ingestion Query "dataSchema" : { "dataSource" : "wikipedia", "parser" : { "type" : "string", "parseSpec" : { "format" : "json", "timestampSpec" : { "column" : "timestamp", "format" : "auto" { "queryType": "groupBy", "dataSource": "sample_datasource", "granularity": "day", "dimensions": ["country", "device"], pyDruid 
 RDruid Join?
  • 16. Hive Integration - Benchmark 16 https://hortonworks.com/blog/apache-hive-druid-part-1-3/ http://www.popit.kr/ultra-fast_olap_druid2/
  • 17. Hive Integration - Druid Storage Handler 17 https://hortonworks.com/blog/apache-hive-druid-part-1-3/ http://www.popit.kr/ultra-fast_olap_druid2/ Hive 2.2.0 or higher
  • 18. Hive Integration - Types of Analytics 18 https://hortonworks.com/blog/apache-hive-druid-part-1-3/ http://www.popit.kr/ultra-fast_olap_druid2/
  • 19. Hive Integration - Create Datasource 19 CREATE EXTERNAL TABLE druid_table_1
 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
 TBLPROPERTIES ("druid.datasource" = "wikiticker"); CREATE TABLE druid_table_1
 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
 TBLPROPERTIES ("druid.datasource" = "wikiticker", "druid.segment.granularity" = "DAY")
 AS
 SELECT __time, page, user, c_added, c_removed
 FROM src; Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive column type
  • 20. Hive Integration - Querying Druid 20 Automatic rewriting when query is expressed over Druid table Powered by Apache Calcite Main challenge: identify patterns in logical plan corresponding to different kinds of Druid queries (Timeseries, TopN, GroupBy, Select) Translate (sub)plan of operators into valid Druid JSON query Druid query is encapsulated within Hive TableScan operator Hive TableScan uses Druid input format Submits query to Druid and generates records out of the query results It might not be possible to push all computation to Druid Our contract is that the query should always be executed https://www.slideshare.net/HadoopSummit/interactive-analytics-at-scale-in-apache-hive-using- druid
  • 21. Hive Integration - Querying Druid 21 SELECT `user`, sum(`c_added`) AS s
 FROM druid_table_1
 WHERE EXTRACT(year FROM `__time`)
 BETWEEN 2010 AND 2011
 GROUP BY `user`
 ORDER BY s DESC
 LIMIT 10; { "queryType": "groupBy", "dataSource": "users_index", "granularity": "all", "dimension": "user", "aggregations": [ { "type": "longSum", "name": "s", "fieldName": "c_added" } ], "limitSpec": { "limit": 10, "columns": [ {"dimension": "s", "direction": "descending" } ] }, "intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ] }
  • 22. Hive Integration - Join 22 SELECT a.channel, b.col1 FROM ( SELECT `channel`, max(delta) as m, sum(added) FROM druid_table_1 GROUP BY `channel`, `floor_year`(`__time`) ORDER BY m DESC LIMIT 1000 ) a JOIN ( SELECT col1, col2 FROM hive_table_1 ) b ON a.channel = b.col2; Query that runs across Druid and Hive
  • 23. Spark Integration 23 CREATE TABLE if not exists orderLineItemPartSupplier USING org.sparklinedata.druid OPTIONS (sourceDataframe "orderLineItemPartSupplierBase", timeDimensionColumn "l_shipdate", druidDatasource "tpch", druidHost "localhost", zkQualifyDiscoveryNames "true", columnMapping '{ "l_quantity" : "sum_l_quantity", "ps_availqty" : "sum_ps_availqty", "cn_name" : "c_nation", "cr_name" : "c_region", "sn_name" : "s_nation", "sr_name" : "s_region" } ', numProcessingThreadsPerHistorical '1', starSchema ' { "factTable" : "orderLineItemPartSupplier", "relations" : [] } ') ; JAVA_TOOL_OPTIONS=-Duser.timezone=UTC sh start-sparklinedatathriftserver.sh ~/server/spark-druid-olap/ scripts/spl-accel-assembly-0.5.0-SNAPSHOT.jar --driver-memory 19g --master yarn --deploy-mode client --conf spark.scheduler.mode=FAIR --properties- file sparkline.properties https://github.com/SparklineData/spark-druid-olap
  • 24. Druid - Built-in SQL 24 Druid 0.10.0 or higher // Connect to /druid/v2/sql/avatica/ on your broker. String url = "jdbc:avatica:remote:url=http://localhost:8082/druid/v2/sql/avatica/"; // Set any connection context parameters you need here (see "Connection context" below). // Or leave empty for default behavior. Properties connectionProperties = new Properties(); try (Connection connection = DriverManager.getConnection(url, connectionProperties)) { try (ResultSet resultSet = connection.createStatement().executeQuery("SELECT COUNT(*) AS cnt FROM data_source")) { while (resultSet.next()) { // Do something } } } Druid includes a native SQL layer with an Apache Calcite-based parser and planner.
  • 25. Druid - Built-in SQL 25 { "query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'" } SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = 'druid' AND TABLE_NAME = 'foo' SELECT x, COUNT(*) FROM data_source_1 WHERE x IN (SELECT x FROM data_source_2 WHERE y = 'baz') GROUP BY x http://druid.io/docs/latest/querying/sql.html
  • 26. Agenda 26 1. What is Druid? 2. Benchmark 3. SQL on Druid 4. Q&A
  • 27. May the force be with you! 27