SlideShare uma empresa Scribd logo
1 de 28
Baixar para ler offline
Introduction to Presto
Making SQL Scalable
Taro L. Saito

leo@treasure-data.com
Treasure Data, Inc.
How do we make SQL scalable?
• Problem
• Count access logs of each web page:
• SELECT page, count(*) FROM weblog

GROUP BY page
• A Challenge
• How do you process millions of records in a
second?
• Making SQL scalable enough to handle large
data set
2
3
HDFS
• Translate SQL into MapReduce (Hadoop) programs
• MapReduce:
• Does the same job by using many machines
Hive
A B
A0
B0
A1
A2
B
B1
B2
B3
A
map reduce mergesplit
HDFS
Single CPU Job
Distributed Processing
SQL to MapReduce
• Mapping SQL stages into MapReduce program
• SELECT page, count(*) FROM weblog

GROUP BY page
4
HDFS
A0
B0
A1
A2
B
B1
B2
B3
A
map reduce mergesplit
HDFS
TableScan(weblog)
GroupBy(hash(page))
count(weblog of a page)
result
HDFS is the bottleneck
• HDFS (Hadoop File System)
• Used for storing intermediate results
• Provides fault-tolerance, but slow
5
HDFS
A0
B0
A1
A2
B
B1
B2
B3
A
map reduce mergesplit
HDFS
TableScan(weblog)
GroupBy(hash(page))
count(weblog of a page)
result
Presto
• Distributed query engine developed by Facebook
• Uses HTTP for data transfer
• No intermediate storage like HDFS
• No fault-tolerance (but failure rate is less than 0.2%)
• Pipelining data transfer and data processing
6
A0
B0
A1
A2
B
B1
B2
B3
A
map reduce mergesplit
TableScan(weblog)
GroupBy(hash(page))
count(weblog of a page)
result
Architecture Comparison
7
Hive Presto Spark BigQuery
Performance Slow Fast Fast Ultra Fast
(using many disks)
Intermediate
Storage
HDFS None Memory/Disk Colossus (?)
Data
Transfer
HTTP HTTP HTTP ?
Query
Execution
Stage-wize

MapReduce
Run all stages

at once
(pipelining)
Stage-wise ?
Fault
Tolerance
Yes
None
(but, TD will retry
the query)
fromscratch)
Yes, but
limited
?
Multiple Job
Support
Good

Can handle many
jobs
limited
(~ 5 concurrent queries
per account in TD)

Require another
resource manager
(e.g. YARN, mesos)
limited
(Query queue)
Presto Usage Stats
• More than 99.8% queries finishes without any error
• 90%~ of queries finishes within 1 minute
• Treasure Data Presto Stats
• Processing more than 100,000 queries / day
• Processing 15 trillion records / day
• Facebook’s stat:
• 30,000~100,000 queries / day
• 1 trillion records / day
• Treasure data is No.1 Presto user in the world
8
Presto can process more than 1M rows /sec.
• N
9
Presto Overview
• A distributed SQL Engine developed by Facebook
• For interactive analysis on peta-scale dataset
• As a replacement of Hive
• Nov. 2013: Open sourced at GitHub
• Facebook now has 12 engineers working on Presto
• Code
• In-memory query engine, written in Java
• Based on ANSI SQL syntax
• Isolating query execution layer and storage access layer
• Connector provides data access methods
• Cassandra / Hive / JMX / Kafka / MySQL / PostgreSQL / MongoDB /
System / TPCH connectors
• td-presto is our connector to access PlazmaDB (Columnar Message
Pack Database)
10
Architectural overview
11
https://prestodb.io/overview.html
With Hive connector
Presto Users
• Facebook
12
• Dropbox
13
• Airbnb
14
Interactive Analysis with TD Presto + Jupyter
15
• https://github.com/treasure-data/td-
jupyter-notebooks/blob/master/
imported/pandas-td-tutorial.ipynb
Presto Internal

Query Execution
Stage 1
Stage 2
Stage 0
Presto Architecture
Query
Task 0.0
Split
Task 1.0
Split
Task 1.1 Task 1.2
Split Split Split
Task 2.0
Split
Task 2.1 Task 2.2
Split Split Split Split Split Split Split
Split
TableScan
(FROM)
Aggregation
(GROUP BY)
Output
@worker#2 @worker#3 @worker#0
Logical Query Plan
Output[nationkey, _col1] => [nationkey:bigint, count:bigint]

- _col1 := count
Exchange[GATHER] => nationkey:bigint, count:bigint
Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]

- count := "count"("count_15")
Exchange[REPARTITION] => nationkey:bigint, count_15:bigint
Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint]
- count_15 := "count"("expr")
Project => [nationkey:bigint, expr:bigint]
- expr := 1
InnerJoin[("custkey" = "custkey_0")] =>
[custkey:bigint, custkey_0:bigint, nationkey:bigint]
Project => [custkey:bigint]
Filter[("orderpriority" = '1-URGENT')] =>
[custkey:bigint, orderpriority:varchar]
TableScan[tpch:tpch:orders:sf0.01, original constraint=

('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]

- custkey := tpch:custkey:1

- orderpriority := tpch:orderpriority:5
Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint
TableScan[tpch:tpch:customer:sf0.01, original constraint=true] =>
[custkey_0:bigint, nationkey:bigint]

- custkey_0 := tpch:custkey:0

- nationkey := tpch:nationkey:3
select

c.nationkey,

count(1)

from orders o
join customer c

on o.custkey = c.custkey
where
o.orderpriority = '1-URGENT'
group by c.nationkey
Output[nationkey, _col1] => [nationkey:bigint, count:bigint]

- _col1 := count
Exchange[GATHER] => nationkey:bigint, count:bigint
Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]

- count := "count"("count_15")
Exchange[REPARTITION] => nationkey:bigint, count_15:bigint
Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint]
- count_15 := "count"("expr")
Project => [nationkey:bigint, expr:bigint]
- expr := 1
InnerJoin[("custkey" = "custkey_0")] =>
[custkey:bigint, custkey_0:bigint, nationkey:bigint]
Project => [custkey:bigint]
Filter[("orderpriority" = '1-URGENT')] =>
[custkey:bigint, orderpriority:varchar]
TableScan[tpch:tpch:orders:sf0.01, original constraint=

('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]

- custkey := tpch:custkey:1

- orderpriority := tpch:orderpriority:5
Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint
TableScan[tpch:tpch:customer:sf0.01, original constraint=true] =>
[custkey_0:bigint, nationkey:bigint]

- custkey_0 := tpch:custkey:0

- nationkey := tpch:nationkey:3 Stage 3
Table Scan
select

c.nationkey,

count(1)

from orders o
join customer c

on o.custkey = c.custkey
where
o.orderpriority = '1-URGENT'
group by c.nationkey
Output[nationkey, _col1] => [nationkey:bigint, count:bigint]

- _col1 := count
Exchange[GATHER] => nationkey:bigint, count:bigint
Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]

- count := "count"("count_15")
Exchange[REPARTITION] => nationkey:bigint, count_15:bigint
Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint]
- count_15 := "count"("expr")
Project => [nationkey:bigint, expr:bigint]
- expr := 1
InnerJoin[("custkey" = "custkey_0")] =>
[custkey:bigint, custkey_0:bigint, nationkey:bigint]
Project => [custkey:bigint]
Filter[("orderpriority" = '1-URGENT')] =>
[custkey:bigint, orderpriority:varchar]
TableScan[tpch:tpch:orders:sf0.01, original constraint=

('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]

- custkey := tpch:custkey:1

- orderpriority := tpch:orderpriority:5
Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint
TableScan[tpch:tpch:customer:sf0.01, original constraint=true] =>
[custkey_0:bigint, nationkey:bigint]

- custkey_0 := tpch:custkey:0

- nationkey := tpch:nationkey:3 Stage 3
Stage 2
Logical Plan Optimization
select

c.nationkey,

count(1)

from orders o
join customer c

on o.custkey = c.custkey
where
o.orderpriority = '1-URGENT'
group by c.nationkey
Output[nationkey, _col1] => [nationkey:bigint, count:bigint]

- _col1 := count
Exchange[GATHER] => nationkey:bigint, count:bigint
Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]

- count := "count"("count_15")
Exchange[REPARTITION] => nationkey:bigint, count_15:bigint
Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint]
- count_15 := "count"("expr")
Project => [nationkey:bigint, expr:bigint]
- expr := 1
InnerJoin[("custkey" = "custkey_0")] =>
[custkey:bigint, custkey_0:bigint, nationkey:bigint]
Project => [custkey:bigint]
Filter[("orderpriority" = '1-URGENT')] =>
[custkey:bigint, orderpriority:varchar]
TableScan[tpch:tpch:orders:sf0.01, original constraint=

('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]

- custkey := tpch:custkey:1

- orderpriority := tpch:orderpriority:5
Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint
TableScan[tpch:tpch:customer:sf0.01, original constraint=true] =>
[custkey_0:bigint, nationkey:bigint]

- custkey_0 := tpch:custkey:0

- nationkey := tpch:nationkey:3 Stage 3
Stage 2
Stage 1
select

c.nationkey,

count(1)

from orders o
join customer c

on o.custkey = c.custkey
where
o.orderpriority = '1-URGENT'
group by c.nationkey
Output[nationkey, _col1] => [nationkey:bigint, count:bigint]

- _col1 := count
Exchange[GATHER] => nationkey:bigint, count:bigint
Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]

- count := "count"("count_15")
Exchange[REPARTITION] => nationkey:bigint, count_15:bigint
Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint]
- count_15 := "count"("expr")
Project => [nationkey:bigint, expr:bigint]
- expr := 1
InnerJoin[("custkey" = "custkey_0")] =>
[custkey:bigint, custkey_0:bigint, nationkey:bigint]
Project => [custkey:bigint]
Filter[("orderpriority" = '1-URGENT')] =>
[custkey:bigint, orderpriority:varchar]
TableScan[tpch:tpch:orders:sf0.01, original constraint=

('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]

- custkey := tpch:custkey:1

- orderpriority := tpch:orderpriority:5
Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint
TableScan[tpch:tpch:customer:sf0.01, original constraint=true] =>
[custkey_0:bigint, nationkey:bigint]

- custkey_0 := tpch:custkey:0

- nationkey := tpch:nationkey:3 Stage 3
Stage 2
Stage 1
Stage 0
Output Query Results (JSON)
select

c.nationkey,

count(1)

from orders o
join customer c

on o.custkey = c.custkey
where
o.orderpriority = '1-URGENT'
group by c.nationkey
TD Storage Architecture
23
LogLogLogLogLogLog
1-hour

partition1-hour

partition1-hour

partition
Hadoop

MapReduce
2015-09-29 01:00:00
2015-09-29 02:00:00
2015-09-29 03:00:00
Real-Time
Storage
Archive

Storage
time column-based partitioning
…
Hive Presto
Log
many small log files log merge job
LogLogLogLogLog
Distributed SQL Query Engine
Utilizing Time Index
24
1-hour

partition
2015-09-29 01:00:00
2015-09-29 02:00:00
2015-09-29 03:00:00
time column-based partitioning
…
Hive/Presto
1-hour

partition1-hour

partition1-hour

partition
TD_TIME_RANGE(time, ‘2015-09-29 02:00:00’, ‘2015-09-29 03:00:00’)
Query Results
2015-09-29 01:00:00
2015-09-29 02:00:00
2015-09-29 03:00:00
…
Hive/Presto Query Results
TD_TIME_RANGE(non_time_column, ‘2015-09-29 02:00:00’, ‘2015-09-29 03:00:00’)
Scanning the whole data set
1-hour

partition1-hour

partition1-hour

partition1-hour

partition
Full Scan
Partial Scan
Queries with huge results
• SELECT col1, col2, col3, … FROM …
• INSERT INTO (table) SELECT col1, col2, …
• or CREATE TABLE AS
25
1-hour

partition
header
col1
col2
…
…
Presto
Read query results in JSON
(single-thread task: slow)
msgack.gz
On Amazon S3
Presto
1-hour

partition
1-hour

partition
1-hour

partition
Directly create 1-hour partition on S3 from query results
Runs in parallel: fast
Memory Consuming Operators
• DISTINCT col1, col2, … (duplicate elimination)
• Need to store the whole data set in a single node
• COUNT(DISTINCT col1), etc.
• Use approx_distinct(col1) instead
• order by col1, col2, …
• A single node task (in Presto)
• UNION
• performs duplicate elimination (single node)
• Use UNION ALL
26
Finding bottlenecks
• Table scan range
• Check TD_TIME_RANGE condition
• distinct
• duplicate elimination of all selected columns (single node)
• slow and memory consuming
• huge result output
• Output Stage (0) becomes the bottleneck
• Use DROP TABLE IF EXISTS …, then CREATE TABLE AS SELECT …
27
Resources
• Presto Query FAQs
• https://docs.treasuredata.com/articles/presto-
query-faq
• Presto Documentation
• https://prestodb.io/docs
28

Mais conteúdo relacionado

Mais procurados

Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaShiao-An Yuan
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache PinotAltinity Ltd
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기NAVER D2
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine kiran palaka
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producerconfluent
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)DataStax Academy
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 

Mais procurados (20)

Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
 
Druid
DruidDruid
Druid
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 

Destaque

Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CAPresto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CAkbajda
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...viirya
 
Presto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkPresto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkkbajda
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016kbajda
 
C#でもメタプログラミングがしたい!!
C#でもメタプログラミングがしたい!!C#でもメタプログラミングがしたい!!
C#でもメタプログラミングがしたい!!TATSUYA HAYAMIZU
 
RuntimeUnitTestToolkit for Unity(English)
RuntimeUnitTestToolkit for Unity(English)RuntimeUnitTestToolkit for Unity(English)
RuntimeUnitTestToolkit for Unity(English)Yoshifumi Kawai
 
Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.Wojciech Biela
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanTaro L. Saito
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 
Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例
Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例
Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例Yoshifumi Kawai
 
Sparkをノートブックにまとめちゃおう。Zeppelinでね!(Hadoopソースコードリーディング 第19回 発表資料)
Sparkをノートブックにまとめちゃおう。Zeppelinでね!(Hadoopソースコードリーディング 第19回 発表資料)Sparkをノートブックにまとめちゃおう。Zeppelinでね!(Hadoopソースコードリーディング 第19回 発表資料)
Sparkをノートブックにまとめちゃおう。Zeppelinでね!(Hadoopソースコードリーディング 第19回 発表資料)NTT DATA OSS Professional Services
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis FirehoseAmazon Web Services
 
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...SlideShare
 
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShareSlideShare
 
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShareSlideShare
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShareSlideShare
 

Destaque (20)

Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CAPresto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
 
Presto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkPresto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talk
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 
C#でもメタプログラミングがしたい!!
C#でもメタプログラミングがしたい!!C#でもメタプログラミングがしたい!!
C#でもメタプログラミングがしたい!!
 
Amazon S3 Overview
Amazon S3 OverviewAmazon S3 Overview
Amazon S3 Overview
 
RuntimeUnitTestToolkit for Unity(English)
RuntimeUnitTestToolkit for Unity(English)RuntimeUnitTestToolkit for Unity(English)
RuntimeUnitTestToolkit for Unity(English)
 
Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例
Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例
Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例
 
Sparkをノートブックにまとめちゃおう。Zeppelinでね!(Hadoopソースコードリーディング 第19回 発表資料)
Sparkをノートブックにまとめちゃおう。Zeppelinでね!(Hadoopソースコードリーディング 第19回 発表資料)Sparkをノートブックにまとめちゃおう。Zeppelinでね!(Hadoopソースコードリーディング 第19回 発表資料)
Sparkをノートブックにまとめちゃおう。Zeppelinでね!(Hadoopソースコードリーディング 第19回 発表資料)
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
 
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
 
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare
 
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShare
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShare
 

Semelhante a Introduction to Making SQL Scalable with Presto

Presto in Treasure Data (presented at db tech showcase Sapporo 2015)
Presto in Treasure Data (presented at db tech showcase Sapporo 2015)Presto in Treasure Data (presented at db tech showcase Sapporo 2015)
Presto in Treasure Data (presented at db tech showcase Sapporo 2015)Mitsunori Komatsu
 
Timeseries - data visualization in Grafana
Timeseries - data visualization in GrafanaTimeseries - data visualization in Grafana
Timeseries - data visualization in GrafanaOCoderFest
 
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...InfluxData
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAPEDB
 
Building a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management ApplicationBuilding a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management ApplicationJonathan Katz
 
Social Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDBSocial Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDBTakahiro Inoue
 
MySQL performance monitoring using Statsd and Graphite
MySQL performance monitoring using Statsd and GraphiteMySQL performance monitoring using Statsd and Graphite
MySQL performance monitoring using Statsd and GraphiteDB-Art
 
How we switched to columnar at SpendHQ
How we switched to columnar at SpendHQHow we switched to columnar at SpendHQ
How we switched to columnar at SpendHQMariaDB plc
 
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015Debashis Saha
 
Postgres & Redis Sitting in a Tree- Rimas Silkaitis, Heroku
Postgres & Redis Sitting in a Tree- Rimas Silkaitis, HerokuPostgres & Redis Sitting in a Tree- Rimas Silkaitis, Heroku
Postgres & Redis Sitting in a Tree- Rimas Silkaitis, HerokuRedis Labs
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...InfluxData
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Social media analytics using Azure Technologies
Social media analytics using Azure TechnologiesSocial media analytics using Azure Technologies
Social media analytics using Azure TechnologiesKoray Kocabas
 

Semelhante a Introduction to Making SQL Scalable with Presto (20)

Presto in Treasure Data (presented at db tech showcase Sapporo 2015)
Presto in Treasure Data (presented at db tech showcase Sapporo 2015)Presto in Treasure Data (presented at db tech showcase Sapporo 2015)
Presto in Treasure Data (presented at db tech showcase Sapporo 2015)
 
Presto in Treasure Data
Presto in Treasure DataPresto in Treasure Data
Presto in Treasure Data
 
Timeseries - data visualization in Grafana
Timeseries - data visualization in GrafanaTimeseries - data visualization in Grafana
Timeseries - data visualization in Grafana
 
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
Building a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management ApplicationBuilding a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management Application
 
Social Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDBSocial Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDB
 
MySQL performance monitoring using Statsd and Graphite
MySQL performance monitoring using Statsd and GraphiteMySQL performance monitoring using Statsd and Graphite
MySQL performance monitoring using Statsd and Graphite
 
How we switched to columnar at SpendHQ
How we switched to columnar at SpendHQHow we switched to columnar at SpendHQ
How we switched to columnar at SpendHQ
 
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Postgres & Redis Sitting in a Tree- Rimas Silkaitis, Heroku
Postgres & Redis Sitting in a Tree- Rimas Silkaitis, HerokuPostgres & Redis Sitting in a Tree- Rimas Silkaitis, Heroku
Postgres & Redis Sitting in a Tree- Rimas Silkaitis, Heroku
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Social media analytics using Azure Technologies
Social media analytics using Azure TechnologiesSocial media analytics using Azure Technologies
Social media analytics using Azure Technologies
 

Mais de Taro L. Saito

Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Taro L. Saito
 
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Taro L. Saito
 
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020Taro L. Saito
 
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020Taro L. Saito
 
Airframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpecAirframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpecTaro L. Saito
 
Presto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 UpdatesPresto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 UpdatesTaro L. Saito
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of PrestoTaro L. Saito
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataTaro L. Saito
 
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018Taro L. Saito
 
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Taro L. Saito
 
Tips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTaro L. Saito
 
Learning Silicon Valley Culture
Learning Silicon Valley CultureLearning Silicon Valley Culture
Learning Silicon Valley CultureTaro L. Saito
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure DataTaro L. Saito
 
Scala at Treasure Data
Scala at Treasure DataScala at Treasure Data
Scala at Treasure DataTaro L. Saito
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoTaro L. Saito
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Taro L. Saito
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Taro L. Saito
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringTaro L. Saito
 

Mais de Taro L. Saito (20)

Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
 
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
 
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
 
Airframe RPC
Airframe RPCAirframe RPC
Airframe RPC
 
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
 
Airframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpecAirframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpec
 
Presto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 UpdatesPresto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 Updates
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of Presto
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
 
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
 
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
 
Tips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTips For Maintaining OSS Projects
Tips For Maintaining OSS Projects
 
Learning Silicon Valley Culture
Learning Silicon Valley CultureLearning Silicon Valley Culture
Learning Silicon Valley Culture
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Scala at Treasure Data
Scala at Treasure DataScala at Treasure Data
Scala at Treasure Data
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例
 
JNuma Library
JNuma LibraryJNuma Library
JNuma Library
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
 

Último

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Último (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Introduction to Making SQL Scalable with Presto

  • 1. Introduction to Presto Making SQL Scalable Taro L. Saito
 leo@treasure-data.com Treasure Data, Inc.
  • 2. How do we make SQL scalable? • Problem • Count access logs of each web page: • SELECT page, count(*) FROM weblog
 GROUP BY page • A Challenge • How do you process millions of records in a second? • Making SQL scalable enough to handle large data set 2
  • 3. 3 HDFS • Translate SQL into MapReduce (Hadoop) programs • MapReduce: • Does the same job by using many machines Hive A B A0 B0 A1 A2 B B1 B2 B3 A map reduce mergesplit HDFS Single CPU Job Distributed Processing
  • 4. SQL to MapReduce • Mapping SQL stages into MapReduce program • SELECT page, count(*) FROM weblog
 GROUP BY page 4 HDFS A0 B0 A1 A2 B B1 B2 B3 A map reduce mergesplit HDFS TableScan(weblog) GroupBy(hash(page)) count(weblog of a page) result
  • 5. HDFS is the bottleneck • HDFS (Hadoop File System) • Used for storing intermediate results • Provides fault-tolerance, but slow 5 HDFS A0 B0 A1 A2 B B1 B2 B3 A map reduce mergesplit HDFS TableScan(weblog) GroupBy(hash(page)) count(weblog of a page) result
  • 6. Presto • Distributed query engine developed by Facebook • Uses HTTP for data transfer • No intermediate storage like HDFS • No fault-tolerance (but failure rate is less than 0.2%) • Pipelining data transfer and data processing 6 A0 B0 A1 A2 B B1 B2 B3 A map reduce mergesplit TableScan(weblog) GroupBy(hash(page)) count(weblog of a page) result
  • 7. Architecture Comparison 7 Hive Presto Spark BigQuery Performance Slow Fast Fast Ultra Fast (using many disks) Intermediate Storage HDFS None Memory/Disk Colossus (?) Data Transfer HTTP HTTP HTTP ? Query Execution Stage-wize
 MapReduce Run all stages
 at once (pipelining) Stage-wise ? Fault Tolerance Yes None (but, TD will retry the query) fromscratch) Yes, but limited ? Multiple Job Support Good
 Can handle many jobs limited (~ 5 concurrent queries per account in TD)
 Require another resource manager (e.g. YARN, mesos) limited (Query queue)
  • 8. Presto Usage Stats • More than 99.8% queries finishes without any error • 90%~ of queries finishes within 1 minute • Treasure Data Presto Stats • Processing more than 100,000 queries / day • Processing 15 trillion records / day • Facebook’s stat: • 30,000~100,000 queries / day • 1 trillion records / day • Treasure data is No.1 Presto user in the world 8
  • 9. Presto can process more than 1M rows /sec. • N 9
  • 10. Presto Overview • A distributed SQL Engine developed by Facebook • For interactive analysis on peta-scale dataset • As a replacement of Hive • Nov. 2013: Open sourced at GitHub • Facebook now has 12 engineers working on Presto • Code • In-memory query engine, written in Java • Based on ANSI SQL syntax • Isolating query execution layer and storage access layer • Connector provides data access methods • Cassandra / Hive / JMX / Kafka / MySQL / PostgreSQL / MongoDB / System / TPCH connectors • td-presto is our connector to access PlazmaDB (Columnar Message Pack Database) 10
  • 15. Interactive Analysis with TD Presto + Jupyter 15 • https://github.com/treasure-data/td- jupyter-notebooks/blob/master/ imported/pandas-td-tutorial.ipynb
  • 17. Stage 1 Stage 2 Stage 0 Presto Architecture Query Task 0.0 Split Task 1.0 Split Task 1.1 Task 1.2 Split Split Split Task 2.0 Split Task 2.1 Task 2.2 Split Split Split Split Split Split Split Split TableScan (FROM) Aggregation (GROUP BY) Output @worker#2 @worker#3 @worker#0
  • 18. Logical Query Plan Output[nationkey, _col1] => [nationkey:bigint, count:bigint]
 - _col1 := count Exchange[GATHER] => nationkey:bigint, count:bigint Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]
 - count := "count"("count_15") Exchange[REPARTITION] => nationkey:bigint, count_15:bigint Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr") Project => [nationkey:bigint, expr:bigint] - expr := 1 InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint] Project => [custkey:bigint] Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar] TableScan[tpch:tpch:orders:sf0.01, original constraint=
 ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]
 - custkey := tpch:custkey:1
 - orderpriority := tpch:orderpriority:5 Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]
 - custkey_0 := tpch:custkey:0
 - nationkey := tpch:nationkey:3 select
 c.nationkey,
 count(1)
 from orders o join customer c
 on o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey
  • 19. Output[nationkey, _col1] => [nationkey:bigint, count:bigint]
 - _col1 := count Exchange[GATHER] => nationkey:bigint, count:bigint Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]
 - count := "count"("count_15") Exchange[REPARTITION] => nationkey:bigint, count_15:bigint Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr") Project => [nationkey:bigint, expr:bigint] - expr := 1 InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint] Project => [custkey:bigint] Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar] TableScan[tpch:tpch:orders:sf0.01, original constraint=
 ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]
 - custkey := tpch:custkey:1
 - orderpriority := tpch:orderpriority:5 Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]
 - custkey_0 := tpch:custkey:0
 - nationkey := tpch:nationkey:3 Stage 3 Table Scan select
 c.nationkey,
 count(1)
 from orders o join customer c
 on o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey
  • 20. Output[nationkey, _col1] => [nationkey:bigint, count:bigint]
 - _col1 := count Exchange[GATHER] => nationkey:bigint, count:bigint Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]
 - count := "count"("count_15") Exchange[REPARTITION] => nationkey:bigint, count_15:bigint Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr") Project => [nationkey:bigint, expr:bigint] - expr := 1 InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint] Project => [custkey:bigint] Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar] TableScan[tpch:tpch:orders:sf0.01, original constraint=
 ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]
 - custkey := tpch:custkey:1
 - orderpriority := tpch:orderpriority:5 Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]
 - custkey_0 := tpch:custkey:0
 - nationkey := tpch:nationkey:3 Stage 3 Stage 2 Logical Plan Optimization select
 c.nationkey,
 count(1)
 from orders o join customer c
 on o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey
  • 21. Output[nationkey, _col1] => [nationkey:bigint, count:bigint]
 - _col1 := count Exchange[GATHER] => nationkey:bigint, count:bigint Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]
 - count := "count"("count_15") Exchange[REPARTITION] => nationkey:bigint, count_15:bigint Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr") Project => [nationkey:bigint, expr:bigint] - expr := 1 InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint] Project => [custkey:bigint] Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar] TableScan[tpch:tpch:orders:sf0.01, original constraint=
 ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]
 - custkey := tpch:custkey:1
 - orderpriority := tpch:orderpriority:5 Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]
 - custkey_0 := tpch:custkey:0
 - nationkey := tpch:nationkey:3 Stage 3 Stage 2 Stage 1 select
 c.nationkey,
 count(1)
 from orders o join customer c
 on o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey
  • 22. Output[nationkey, _col1] => [nationkey:bigint, count:bigint]
 - _col1 := count Exchange[GATHER] => nationkey:bigint, count:bigint Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]
 - count := "count"("count_15") Exchange[REPARTITION] => nationkey:bigint, count_15:bigint Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr") Project => [nationkey:bigint, expr:bigint] - expr := 1 InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint] Project => [custkey:bigint] Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar] TableScan[tpch:tpch:orders:sf0.01, original constraint=
 ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]
 - custkey := tpch:custkey:1
 - orderpriority := tpch:orderpriority:5 Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]
 - custkey_0 := tpch:custkey:0
 - nationkey := tpch:nationkey:3 Stage 3 Stage 2 Stage 1 Stage 0 Output Query Results (JSON) select
 c.nationkey,
 count(1)
 from orders o join customer c
 on o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey
  • 23. TD Storage Architecture 23 LogLogLogLogLogLog 1-hour
 partition1-hour
 partition1-hour
 partition Hadoop
 MapReduce 2015-09-29 01:00:00 2015-09-29 02:00:00 2015-09-29 03:00:00 Real-Time Storage Archive
 Storage time column-based partitioning … Hive Presto Log many small log files log merge job LogLogLogLogLog Distributed SQL Query Engine
  • 24. Utilizing Time Index 24 1-hour
 partition 2015-09-29 01:00:00 2015-09-29 02:00:00 2015-09-29 03:00:00 time column-based partitioning … Hive/Presto 1-hour
 partition1-hour
 partition1-hour
 partition TD_TIME_RANGE(time, ‘2015-09-29 02:00:00’, ‘2015-09-29 03:00:00’) Query Results 2015-09-29 01:00:00 2015-09-29 02:00:00 2015-09-29 03:00:00 … Hive/Presto Query Results TD_TIME_RANGE(non_time_column, ‘2015-09-29 02:00:00’, ‘2015-09-29 03:00:00’) Scanning the whole data set 1-hour
 partition1-hour
 partition1-hour
 partition1-hour
 partition Full Scan Partial Scan
  • 25. Queries with huge results • SELECT col1, col2, col3, … FROM … • INSERT INTO (table) SELECT col1, col2, … • or CREATE TABLE AS 25 1-hour
 partition header col1 col2 … … Presto Read query results in JSON (single-thread task: slow) msgack.gz On Amazon S3 Presto 1-hour
 partition 1-hour
 partition 1-hour
 partition Directly create 1-hour partition on S3 from query results Runs in parallel: fast
  • 26. Memory Consuming Operators • DISTINCT col1, col2, … (duplicate elimination) • Need to store the whole data set in a single node • COUNT(DISTINCT col1), etc. • Use approx_distinct(col1) instead • order by col1, col2, … • A single node task (in Presto) • UNION • performs duplicate elimination (single node) • Use UNION ALL 26
  • 27. Finding bottlenecks • Table scan range • Check TD_TIME_RANGE condition • distinct • duplicate elimination of all selected columns (single node) • slow and memory consuming • huge result output • Output Stage (0) becomes the bottleneck • Use DROP TABLE IF EXISTS …, then CREATE TABLE AS SELECT … 27
  • 28. Resources • Presto Query FAQs • https://docs.treasuredata.com/articles/presto- query-faq • Presto Documentation • https://prestodb.io/docs 28