1. Sadayuki Furuhashi
Founder & Software Architect
Treasure Data, inc.
Understanding
Presto meetup @ Tokyo #1
Presto
2. A little about me...
> Sadayuki Furuhashi
> github/twitter: @frsyuki
> Treasure Data, Inc.
> Founder & Software Architect
> Open-source hacker
> MessagePack - Efficient object serializer
> Fluentd - An unified data collection tool
> Prestogres - PostgreSQL protocol gateway for Presto
> Embulk - A bulk data loader with plugin-based architecture
> ServerEngine - A Ruby framework to build multiprocess servers
> LS4 - A distributed object storage with cross-region replication
> kumofs - A distributed strong-consistent key-value data store
16. PostgreSQL
HDFS / Metastore
MySQL
Presto
JOININSERT INTO
create table mysql.presto_test.recent_user_info
as
select users.id, users.email, count(1) as count
from orders
join mysql.presto_test.users
on orders.custkey = users.id
group by 1, 2;
17. 1. Distributed & Plug-in architecture
> 3 type of servers
> Coordinator, Worker, Discovery server
> Get data/metadata through connector plugins.
> Presto is state-less (Presto is NOT a database).
> Presto can provide distributed SQL to any data stores.
• connectors are loosely-coupled (may cause some overhead here)
> Client protocol is HTTP + JSON
> Language bindings: Ruby, Python, PHP, Java, R, etc.
> ODBC & JDBC support by Prestogres
> https://github.com/treasure-data/prestogres
18. Other Presto’s features
> Comprehensive SQL features
> WITH cte as (SELECT …) SELECT * FROM cte …;
> implicit JOIN (join criteria at WHERE)
> VIEW
> INSERT INTO … VALUES (1,2,3)
> Time & Date types & functions
compatible both MySQL & PostgreSQL
> Culster management using SQL
> SELECT * FROM sys.node;
> sys.task, sys.query
20. Presto’s execution model
> Presto is NOT MapReduce
> Presto’s query plan is based on DAG
> more like Spark or traditional MPP databases
21. All stages are pipe-lined
✓ No wait time
✓ No fault-tolerance
MapReduce vs. Presto
MapReduce Presto
map map
reduce reduce
task task
task task
task
task
memory-to-memory
data transfer
✓ No disk IO
✓ Data chunk must
fit in memory
task
disk
map map
reduce reduce
disk
disk
Write data
to disk
Wait between
stages
22. Query Planner
SELECT
name,
count(*) AS c
FROM access
GROUP BY name
SQL
TABLE access (
name varchar
time bigint
)
Table schema
Table scan
(name:varchar)
GROUP BY
(name, count(*))
Output
(name, c)
+
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Output
Exchange
Logical query plan
Distributed query plan
23. Query Planner - Stages
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Output
Exchange
inter-worker
data transfer
pipelined
aggregation
inter-worker
data transfer
Stage-0
Stage-1
Stage-2
27. 2. Query Planning
> SQL is converted into stages, tasks and splits
> All tasks run in parallel
> No wait time between stages (pipelined)
> If one task fails, all tasks fail at once (query fails)
> Memory-to-memory data transfer
> No disk IO
> If hash-partitioned aggregated data doesn’t fit in memory,
query fails
• Note: Query dies but worker doesn’t die.
Memory consumption is fully managed.