Presto @ Zalando - Big Data Tech Warsaw 2020

Presto @ Zalando
Max Schultze - max.schultze@zalando.de
Wojciech Biela - wojciech.biela@starburstdata.com
Piotr Findeisen - piotr.findeisen@starburstdata.com
27-02-2020
A cloud journey for Europe’s
leading online fashion retailer
@mcs1408 @wbiela @findepi

2
Max Schultze
● Lead Data Engineer
● MSc in Computer Science
● Took part in early
development of Apache Flink
● Retired semi-professional
Magic: the Gathering player
Who are we?
Wojciech Biela
● Senior Engineering Director
● Starburst Co-founder
● Prev: Engineering lead at Hadapt
(interactive SQL-on-Hadoop pioneer)
● Prev: Head of engineering @ Empik.com

3
Max Schultze
● Lead Data Engineer
● Took part in early
development of Apache Flink
● Retired semi-professional
Magic: the Gathering player
Who are we?
Piotr Findeisen
● Presto Committer & maintainer
● Starburst Co-founder
● Prev: Presto Engineer at Teradata

4
TABLE OF
CONTENTS
Zalando Analytics Cloud Journey
The Evolution of Presto
Advance Analytical Infrastructure

5
Zalando Analytics Cloud
Journey

7
Messaging
Bus
Data Lake
Legacy Evolving

8
Zalando’s Data Lake
Ingestion
Storage
Serving

9
Web
Tracking
Event Bus
DWH
Data Center
Ingestion
Storage
Serving

10
Web
Tracking
Event Bus
DWH
Data Center
Ingestion
Storage
Serving
Metastore

11
Data CatalogWeb
Tracking
Event Bus
DWH
Data Center
Ingestion
Storage
Serving
Metastore
Fast Query Layer
Processing Platform

13
Community-driven
open source project
High performance ANSI SQL engine
What is Presto?
Separation of compute and
storage
No vendor lock-in

14
Community-driven
open source project
storage
No vendor lock-in
• No Hadoop distro vendor lock-in
• No storage engine vendor lock-in
• No cloud vendor lock-in
• Proven scalability
• High concurrency
What is Presto?

15
Community-driven
open source project
No vendor lock-in
storage
What is Presto?

16
What is Presto?
Community-driven
open source project
● Proven scalability
● High concurrency
No vendor lock-in
storage

17
Many Well Known Presto Users

18
Presto Architecture
Processor
Processor
Processor
COORDINATOR
WORKER
WORKER
DATA SOURCES
Parser Optimizer Scheduler
Azure
SQL Database
ADLS Blob Storage S3

19
Presto Extensibility with Connectors
Presto Coordinator
Metadata SPI
Hive
Cassandr
a
Kafka
MySQL
Custom
Data Statistics SPIHive
Cassandr
a
Kafka
MySQL
Custom
Presto Worker
Data Stream SPI
Hive
Cassandr
a
Kafka
MySQL
Custom
Data Location SPI
Hive
Cassandr
a
Kafka
MySQL
Custom

20
Query Execution Performance
• In-memory processing, Pipelined execution across nodes MPP-style
• Vectorized columnar processing
• Multithreaded execution keeps all CPU cores busy
• Presto is written in highly tuned Java
○ Efficient data structures (minimizes GC)
○ Very careful coding of inner loops
○ Runtime bytecode generation
• Optimized ORC & Parquet readers

21
Apache Hive Connector
• Access data stored in scalable and cost effective storage
○ HDFS
○ AWS S3
○ Google GCS
○ Azure Blob & ADLS (Gen 1 and 2)
○ S3-Compatible (i.e. Minio)
• Schema information stored in Hive Metastore or AWS Glue Data Catalog
• Uses “Hive-Style” Table format
• Partitions and Bucketing are recognized and used
• Does not use Hive runtime to perform execution

22
Relational Database Connectors (JDBC based)
• Uses relational databases JDBC driver
for Presto worker to connect to data
source
• Filtering pushed down into database for
performance benefit
• MySQL
• PostgreSQL
• Redshift
• SQL Server
• Google BigQuery
• Oracle
• DB2
• Teradata
• Snowflake

23
Non Relational Data Sources
• Apache Accumulo
• Apache Cassandra
• Apache Phoenix
• Elasticsearch
• Apache Kafka
• Apache Kudu
• MongoDB
• Redis

24
SQL Support
• Presto's development is guided by the SQL standard
• Most major SQL features are covered
• TPC-H & TPC-DS queries run entirely

25
Security
● User authentication (CLI/ODBC/JDBC)
○ Basic
○ Kerberos / LDAP
● Pluggable user authorization schemes (access control)
● User impersonation (Hive, JDBC connectors)
● Support for kerberized HDFS/Hive metastore
● SSL on the wire
○ client to Presto
○ between Presto nodes
● Sentry and Ranger support
○ column and row level security

26
JDBC & ODBC Connectivity
• Presto provides an open source JDBC driver
https://prestosql.io/download.html
• Commercial JDBC and ODBC drivers available from Starburst
• Do not confuse these drivers with the drivers Presto internally uses to connect to
JDBC data sources (e.g. MySQL, SQL Server, etc.)

27
End-User Tools
Starburst provides enterprise grade ODBC and JDBC drivers allowing you to use your favorite tools
with Starburst
○ PowerBI
○ Microstrategy
○ Tableau
○ Qlik
○ Looker
○ Periscope
○ DBeaver
○ And more…

28
The Presto Fan Club
* Multiple clusters
(10,000+ of nodes)
* 300PB in HDFS,
MySQL, and Raptor
* 1000s users, 100s
concurrent queries

29
* 300+ AWS nodes
* 100+ PB in S3
(Parquet)
* 650+ users with
6K+ queries daily
The Presto Fan Club

30
* 150+ PB HDFS
(Parquet/ORC)
* 2,000+ nodes
(clusters on prem.)
* 160K+ queries/ day
over HDFS
The Presto Fan Club

31
* 2,000+ nodes
(several clusters on
premises and GCP)
* 20K+ queries daily
(Parquet)
The Presto Fan Club

32
* 100 Presto VMs
(on premises)
* 1K+ HDFS nodes
* ORC data
* Starburst support
The Presto Fan Club

33
* interactive
* 400+ nodes in AWS
* 100K+ queries/day
* 20+ PBs in S3
(Parquet)
The Presto Fan Club

34
* 200+ nodes
(on premises)
* HDFS, ObjectStore,
and Cassandra
* Starburst support
The Presto Fan Club

35
* 120+ nodes in AWS
* 4PB is S3
* 200+ users
* Starburst support
The Presto Fan Club

36
Starburst Overview
Founded 2017
• Founding team many of the largest
committers to open source project
Presto, working on Presto since 2015
• Former Teradata, Vertica, Hadapt,
Netezza, and Ab Initio
Enterprise Presto Oﬀering
• Azure, AWS, GCP, On Premises,
Kubernetes
Headquartered Boston
Customers Globally

37
Key Presto contributions from Starburst
Mission Control
For easy installation &
management of Presto
Security
Integrations
Kerberos, LDAP, Ranger
and in-transit encryption
ANSI SQL
Enhancements to fully
support SQL
ODBC and JDBC
drivers
To enable BI tools such as
Power BI, Tableau, Qlik, etc.
Presto Connectors
Teradata, Oracle, Hive
Cloud Storage, Snowflake
Autoscaling Presto
Autoscaling in the cloud
(AWS CFT, K8s, …)
Query Performance
Cost-Based
Query Optimizer
Providing
performance boost
Improved performance in
query execution engine

38
Key upcoming developments from Starburst
Consumption
Tracking
Understand your
consumption and
spend on the cloud
Read data from Delta Lake
DeltaLake
Integration
Presto Insights
Tuning suggestions for
Presto cluster and queries
Okta Support
Integrate with Okta IdP
provider
Distributed Caching
Speed up queries on hot
datasets
IAM Passthrough
Leverage IAM roles
Integrated
Apache Ranger
Kubernetes
support
Advanced K8s
ecosystem support
Automatically deploy
Ranger in Presto for the
security stack

39
Try Starburst
Enterprise-Grade Presto
in the Cloud and On-Premises
Azure, AWS, GCP, On Premises, &
Kubernetes
www.starburstdata.com/presto-enterprise

40
Advanced Analytical
Infrastructure

44
$$
Analytical Infrastructure

45
Advanced Analytical Infrastructure

46
$$

47

48
Presto Gateway

49
Infrastructure Support
Expedite Learning

50
Expedite Learning
Fine Tuning Infrastructure

51
Expedite Learning
Fine Tuning Infrastructure
New Features

56
Presto @ Zalando
A cloud journey for Europe’s
leading online fashion retailer
Max Schultze
max.schultze@zalando.de
@mcs1408
Wojciech Biela
wojciech.biela@starburstdata.com
@wbiela

Presto @ Zalando - Big Data Tech Warsaw 2020

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Presto @ Zalando - Big Data Tech Warsaw 2020

Semelhante a Presto @ Zalando - Big Data Tech Warsaw 2020 (20)

Último

Último (20)

Presto @ Zalando - Big Data Tech Warsaw 2020