O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Druid meetup 4th_sql_on_druid

436 visualizações

Publicada em

How to use SQL on Druid.

Publicada em: Software
  • Seja o primeiro a comentar

Druid meetup 4th_sql_on_druid

  1. 1. 2017.06.08 SQL on Druid 4th Druid@Seoul Meetup You sun Jeong (jerryjung@sk.com)
  2. 2. Index 1. What is Druid? 2. Benchmark 3. SQL on Druid 4. Q&A 2
  3. 3. What is Druid? http://druid.io/ Druidis an open-source data storedesigned for sub- second queries on real-time and historical data. It is primarily used for business intelligence(OLAP) queries on event data. Druid provides low latency (real-time) data ingestion, flexible data exploration, and fast data aggregation. Existing Druid deployments have scaled to trillions of events and petabytes of data. Druid is most commonly used to power user-facing analytic applications. 3
  4. 4. Big Data Discovery 4
  5. 5. Druid Features 5http://www.popit.kr/ultra-fast_olap_druid/ https://hortonworks.com/blog/apache-hive-druid-part-1-3/
  6. 6. Pre-Aggregation & Roll-up 6 minute
  7. 7. Segment Management 7
  8. 8. Druid Architecture 8 https://en.wikipedia.org/wiki/Druid_(open-source_data_store)
  9. 9. Agenda 9 1. What is Druid? 2. Benchmark 3. SQL on Druid 4. Q&A
  10. 10. Druid vs Spark 10 http:// www.popit.kr/ druid-spark- performance/
  11. 11. Druid vs Spark(Cached) vs Spark(HDFS) 11 http://www.popit.kr/druid-spark- performance/ Interactive Analysis Capability
  12. 12. Druid vs Spark 12 http://www.popit.kr/druid-spark- performance/
  13. 13. Agenda 13 1. What is Druid? 2. Benchmark 3. SQL on Druid 4. Q&A
  14. 14. Things that bother me in the druid 14 Ingestion Query "dataSchema" : { "dataSource" : "wikipedia", "parser" : { "type" : "string", "parseSpec" : { "format" : "json", "timestampSpec" : { "column" : "timestamp", "format" : "auto" { "queryType": "groupBy", "dataSource": "sample_datasource", "granularity": "day", "dimensions": ["country", "device"], pyDruid 
 RDruid Join?
  15. 15. SQL on Druid 15
  16. 16. Hive Integration - Benchmark 16 https://hortonworks.com/blog/apache-hive-druid-part-1-3/ http://www.popit.kr/ultra-fast_olap_druid2/
  17. 17. Hive Integration - Druid Storage Handler 17 https://hortonworks.com/blog/apache-hive-druid-part-1-3/ http://www.popit.kr/ultra-fast_olap_druid2/ Hive 2.2.0 or higher
  18. 18. Hive Integration - Types of Analytics 18 https://hortonworks.com/blog/apache-hive-druid-part-1-3/ http://www.popit.kr/ultra-fast_olap_druid2/
  19. 19. Hive Integration - Create Datasource 19 CREATE EXTERNAL TABLE druid_table_1
 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
 TBLPROPERTIES ("druid.datasource" = "wikiticker"); CREATE TABLE druid_table_1
 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
 TBLPROPERTIES ("druid.datasource" = "wikiticker", "druid.segment.granularity" = "DAY")
 AS
 SELECT __time, page, user, c_added, c_removed
 FROM src; Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive column type
  20. 20. Hive Integration - Querying Druid 20 Automatic rewriting when query is expressed over Druid table Powered by Apache Calcite Main challenge: identify patterns in logical plan corresponding to different kinds of Druid queries (Timeseries, TopN, GroupBy, Select) Translate (sub)plan of operators into valid Druid JSON query Druid query is encapsulated within Hive TableScan operator Hive TableScan uses Druid input format Submits query to Druid and generates records out of the query results It might not be possible to push all computation to Druid Our contract is that the query should always be executed https://www.slideshare.net/HadoopSummit/interactive-analytics-at-scale-in-apache-hive-using- druid
  21. 21. Hive Integration - Querying Druid 21 SELECT `user`, sum(`c_added`) AS s
 FROM druid_table_1
 WHERE EXTRACT(year FROM `__time`)
 BETWEEN 2010 AND 2011
 GROUP BY `user`
 ORDER BY s DESC
 LIMIT 10; { "queryType": "groupBy", "dataSource": "users_index", "granularity": "all", "dimension": "user", "aggregations": [ { "type": "longSum", "name": "s", "fieldName": "c_added" } ], "limitSpec": { "limit": 10, "columns": [ {"dimension": "s", "direction": "descending" } ] }, "intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ] }
  22. 22. Hive Integration - Join 22 SELECT a.channel, b.col1 FROM ( SELECT `channel`, max(delta) as m, sum(added) FROM druid_table_1 GROUP BY `channel`, `floor_year`(`__time`) ORDER BY m DESC LIMIT 1000 ) a JOIN ( SELECT col1, col2 FROM hive_table_1 ) b ON a.channel = b.col2; Query that runs across Druid and Hive
  23. 23. Spark Integration 23 CREATE TABLE if not exists orderLineItemPartSupplier USING org.sparklinedata.druid OPTIONS (sourceDataframe "orderLineItemPartSupplierBase", timeDimensionColumn "l_shipdate", druidDatasource "tpch", druidHost "localhost", zkQualifyDiscoveryNames "true", columnMapping '{ "l_quantity" : "sum_l_quantity", "ps_availqty" : "sum_ps_availqty", "cn_name" : "c_nation", "cr_name" : "c_region", "sn_name" : "s_nation", "sr_name" : "s_region" } ', numProcessingThreadsPerHistorical '1', starSchema ' { "factTable" : "orderLineItemPartSupplier", "relations" : [] } ') ; JAVA_TOOL_OPTIONS=-Duser.timezone=UTC sh start-sparklinedatathriftserver.sh ~/server/spark-druid-olap/ scripts/spl-accel-assembly-0.5.0-SNAPSHOT.jar --driver-memory 19g --master yarn --deploy-mode client --conf spark.scheduler.mode=FAIR --properties- file sparkline.properties https://github.com/SparklineData/spark-druid-olap
  24. 24. Druid - Built-in SQL 24 Druid 0.10.0 or higher // Connect to /druid/v2/sql/avatica/ on your broker. String url = "jdbc:avatica:remote:url=http://localhost:8082/druid/v2/sql/avatica/"; // Set any connection context parameters you need here (see "Connection context" below). // Or leave empty for default behavior. Properties connectionProperties = new Properties(); try (Connection connection = DriverManager.getConnection(url, connectionProperties)) { try (ResultSet resultSet = connection.createStatement().executeQuery("SELECT COUNT(*) AS cnt FROM data_source")) { while (resultSet.next()) { // Do something } } } Druid includes a native SQL layer with an Apache Calcite-based parser and planner.
  25. 25. Druid - Built-in SQL 25 { "query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'" } SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = 'druid' AND TABLE_NAME = 'foo' SELECT x, COUNT(*) FROM data_source_1 WHERE x IN (SELECT x FROM data_source_2 WHERE y = 'baz') GROUP BY x http://druid.io/docs/latest/querying/sql.html
  26. 26. Agenda 26 1. What is Druid? 2. Benchmark 3. SQL on Druid 4. Q&A
  27. 27. May the force be with you! 27
  28. 28. 2017.06.08 THANK YOU

×