Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Presto @ Zalando - Big Data Tech Warsaw 2020
1. Presto @ Zalando
Max Schultze - max.schultze@zalando.de
Wojciech Biela - wojciech.biela@starburstdata.com
Piotr Findeisen - piotr.findeisen@starburstdata.com
27-02-2020
A cloud journey for Europe’s
leading online fashion retailer
@mcs1408 @wbiela @findepi
2. 2
Max Schultze
● Lead Data Engineer
● MSc in Computer Science
● Took part in early
development of Apache Flink
● Retired semi-professional
Magic: the Gathering player
Who are we?
Wojciech Biela
● Senior Engineering Director
● Starburst Co-founder
● MSc in Computer Science
● Prev: Engineering lead at Hadapt
(interactive SQL-on-Hadoop pioneer)
● Prev: Head of engineering @ Empik.com
3. 3
Max Schultze
● Lead Data Engineer
● MSc in Computer Science
● Took part in early
development of Apache Flink
● Retired semi-professional
Magic: the Gathering player
Who are we?
Piotr Findeisen
● Presto Committer & maintainer
● Starburst Co-founder
● MSc in Computer Science
● Prev: Presto Engineer at Teradata
14. 14
Community-driven
open source project
Separation of compute and
storage
No vendor lock-in
• No Hadoop distro vendor lock-in
• No storage engine vendor lock-in
• No cloud vendor lock-in
High performance ANSI SQL engine
• Proven scalability
• High concurrency
What is Presto?
15. 15
Community-driven
open source project
No vendor lock-in
• No Hadoop distro vendor lock-in
• No storage engine vendor lock-in
• No cloud vendor lock-in
High performance ANSI SQL engine
Separation of compute and
storage
What is Presto?
16. 16
What is Presto?
Community-driven
open source project
High performance ANSI SQL engine
● Proven scalability
● High concurrency
No vendor lock-in
• No Hadoop distro vendor lock-in
• No storage engine vendor lock-in
• No cloud vendor lock-in
Separation of compute and
storage
19. 19
Presto Extensibility with Connectors
Presto Coordinator
Metadata SPI
Hive
Cassandr
a
Kafka
MySQL
Custom
Data Statistics SPIHive
Cassandr
a
Kafka
MySQL
Custom
Presto Worker
Data Stream SPI
Hive
Cassandr
a
Kafka
MySQL
Custom
Data Location SPI
Hive
Cassandr
a
Kafka
MySQL
Custom
20. 20
Query Execution Performance
• In-memory processing, Pipelined execution across nodes MPP-style
• Vectorized columnar processing
• Multithreaded execution keeps all CPU cores busy
• Presto is written in highly tuned Java
○ Efficient data structures (minimizes GC)
○ Very careful coding of inner loops
○ Runtime bytecode generation
• Optimized ORC & Parquet readers
21. 21
Apache Hive Connector
• Access data stored in scalable and cost effective storage
○ HDFS
○ AWS S3
○ Google GCS
○ Azure Blob & ADLS (Gen 1 and 2)
○ S3-Compatible (i.e. Minio)
• Schema information stored in Hive Metastore or AWS Glue Data Catalog
• Uses “Hive-Style” Table format
• Partitions and Bucketing are recognized and used
• Does not use Hive runtime to perform execution
22. 22
Relational Database Connectors (JDBC based)
• Uses relational databases JDBC driver
for Presto worker to connect to data
source
• Filtering pushed down into database for
performance benefit
• MySQL
• PostgreSQL
• Redshift
• SQL Server
• Google BigQuery
• Oracle
• DB2
• Teradata
• Snowflake
23. 23
Non Relational Data Sources
• Apache Accumulo
• Apache Cassandra
• Apache Phoenix
• Elasticsearch
• Apache Kafka
• Apache Kudu
• MongoDB
• Redis
24. 24
SQL Support
• Presto's development is guided by the SQL standard
• Most major SQL features are covered
• TPC-H & TPC-DS queries run entirely
25. 25
Security
● User authentication (CLI/ODBC/JDBC)
○ Basic
○ Kerberos / LDAP
● Pluggable user authorization schemes (access control)
● User impersonation (Hive, JDBC connectors)
● Support for kerberized HDFS/Hive metastore
● SSL on the wire
○ client to Presto
○ between Presto nodes
● Sentry and Ranger support
○ column and row level security
26. 26
JDBC & ODBC Connectivity
• Presto provides an open source JDBC driver
https://prestosql.io/download.html
• Commercial JDBC and ODBC drivers available from Starburst
• Do not confuse these drivers with the drivers Presto internally uses to connect to
JDBC data sources (e.g. MySQL, SQL Server, etc.)
27. 27
End-User Tools
Starburst provides enterprise grade ODBC and JDBC drivers allowing you to use your favorite tools
with Starburst
○ PowerBI
○ Microstrategy
○ Tableau
○ Qlik
○ Looker
○ Periscope
○ DBeaver
○ And more…
28. 28
The Presto Fan Club
* Multiple clusters
(10,000+ of nodes)
* 300PB in HDFS,
MySQL, and Raptor
* 1000s users, 100s
concurrent queries
29. 29
* 300+ AWS nodes
* 100+ PB in S3
(Parquet)
* 650+ users with
6K+ queries daily
The Presto Fan Club
30. 30
* 150+ PB HDFS
(Parquet/ORC)
* 2,000+ nodes
(clusters on prem.)
* 160K+ queries/ day
over HDFS
The Presto Fan Club
31. 31
* 2,000+ nodes
(several clusters on
premises and GCP)
* 20K+ queries daily
(Parquet)
The Presto Fan Club
32. 32
* 100 Presto VMs
(on premises)
* 1K+ HDFS nodes
* ORC data
* Starburst support
The Presto Fan Club
33. 33
* interactive
* 400+ nodes in AWS
* 100K+ queries/day
* 20+ PBs in S3
(Parquet)
The Presto Fan Club
34. 34
* 200+ nodes
(on premises)
* HDFS, ObjectStore,
and Cassandra
* Starburst support
The Presto Fan Club
35. 35
* 120+ nodes in AWS
* 4PB is S3
* 200+ users
* Starburst support
The Presto Fan Club
36. 36
Starburst Overview
Founded 2017
• Founding team many of the largest
committers to open source project
Presto, working on Presto since 2015
• Former Teradata, Vertica, Hadapt,
Netezza, and Ab Initio
Enterprise Presto Offering
• Azure, AWS, GCP, On Premises,
Kubernetes
Headquartered Boston
Customers Globally
37. 37
Key Presto contributions from Starburst
Mission Control
For easy installation &
management of Presto
Security
Integrations
Kerberos, LDAP, Ranger
and in-transit encryption
ANSI SQL
Enhancements to fully
support SQL
ODBC and JDBC
drivers
To enable BI tools such as
Power BI, Tableau, Qlik, etc.
Presto Connectors
Teradata, Oracle, Hive
Cloud Storage, Snowflake
Autoscaling Presto
Autoscaling in the cloud
(AWS CFT, K8s, …)
Query Performance
Cost-Based
Query Optimizer
Providing
performance boost
Improved performance in
query execution engine
38. 38
Key upcoming developments from Starburst
Consumption
Tracking
Understand your
consumption and
spend on the cloud
Read data from Delta Lake
DeltaLake
Integration
Presto Insights
Tuning suggestions for
Presto cluster and queries
Okta Support
Integrate with Okta IdP
provider
Distributed Caching
Speed up queries on hot
datasets
IAM Passthrough
Leverage IAM roles
Integrated
Apache Ranger
Kubernetes
support
Advanced K8s
ecosystem support
Automatically deploy
Ranger in Presto for the
security stack
56. 56
Presto @ Zalando
A cloud journey for Europe’s
leading online fashion retailer
Max Schultze
max.schultze@zalando.de
@mcs1408
Wojciech Biela
wojciech.biela@starburstdata.com
@wbiela