SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
Real Time Analytics
at Uber: Bring SQL
into Everything
Zhenxiao Luo
NYC
Uber’s mission is to
ignite opportunity by
setting the world in
motion.
15M
Trips/Day
600+
Cities
75M
Monthly Riders
Data informs every decision at the company
Overview of Uber’s Data Platform
DATA SOURCES
RAW DATA
MODELED TABLES
MINING BUSINESS
INSIGHTS
CONSUMING BUSINESS INSIGHTS
EXPERIMENTATION
DATA SCIENCE
MACHINE
LEARNING
CUSTOM DATA SETS
Dashboarding
Alerting
Monitoring
Data Exploration
Knowledge Bases
Storage
Infrastructure
ETL Frameworks
Data Integrity
Query Engines
Kafka
Uber Data Infrastructure
Schemaless
MySQL,
Postgres
Vertica
Streamio
Raw
Data
Raw
Tables
Sqoop
Reports
Hadoop
Hive Presto Spark
Notebook Ad Hoc Queries
Real Time
Applications
Machine
Learning Jobs
Business
Intelligence Jobs
Cluster
Management
All-Active
Observability
Security
Vertica
Samza
Pinot
Flink
AresDB
Modeled
Tables
Streaming
Warehouse
Real-time
Presto @ Uber-scale
5KWeekly Active Users
160KQueries/day
3Data Centers
2KNodes
700MHDFS files read/day
10PBHDFS files
processed/day
Presto use cases at Uber
Growth Marketing
Data Science
Marketplace
Pricing
Community
Operations
Data Quality
Ad-hoc Querying
The people who rely on us
Technical
Skills
Data Scientists
Software Engineers
ML/AI Researchers
Advanced SQL
Advanced Statistics
Scala/Spark, Python/R
Data Modeling
Inventor Ivan
Marketing Managers
Entry-level Analysts
General Managers
Product Managers
Limited SQL
Spreadsheets
Reliant Rebecca
City Operations
Regional Managers
Intermediate SQL
Spreadsheets
Dashboarding
Monitoring Matt
Operations Managers
Data Analysts
Product Analysts
Advanced SQL
Spreadsheets
Limited Statistics
Limited Python/R
Analyst Anna
Exploratory ML &
model-training
Data Scientists ML ResearchersEngineers
Using ML to ensure data
security and compliance
Advanced data
science &
complex analytics
Data Scientists Ops Analysts Support Agents
Surfacing hidden insights
to empower restaurants
Business process
automation
S&P AnalystsOps Managers Contractors
Using technology to make
transportation safer
What is Presto: Interactive SQL
Engine for Big Data
Interactive query speeds
Horizontally scalable
ANSI SQL
Battle-tested by Facebook, Uber, Linkedin, Twitter, Netflix, Airbnb, etc
Completely open source
Access to petabytes of data in the Hadoop, Elasticsearch, Pinot, etc.
How Presto Works
Why Presto is Fast
● Data in memory during execution
● Pipelining and streaming
● Columnar storage & execution
● Bytecode generation
Resource Management
● Presto has its own resource manager
○ Not on YARN
○ Not on Mesos
● CPU Management
○ Priority queues
○ Short running queries higher priority
● Memory Management
○ Max memory per query per node
○ If query exceeds max memory limit, query fails
○ No OutOfMemory in Presto process
Presto Connectors:
No Need to Copy Data
Uber Contributions
Contributions
New Features
● Geospatial indexing and operations - 10x or more speedup
● Pinot connector enhancements (in-house)
Optimizations
● Elasticsearch connector
● New Parquet reader - 4x speedup
● Nested column pushdowns (project, predicate) - 10x speedup
Security
● Metastore authentication support for Kerberos deployments
● Dispatch Proxy using HTTP redirect for multi-cluster operation
Presto Connector Interface
● ConnectorMetadata
○ Schema, Table, Column
● ConnectorSplitManager
○ Divide data into splits
● ConnectorSplit
○ Split data range
○ Predicate/JsonFunction/Limit pushdown
● ConnectorRecordCursor
○ Transform underlying storage data into Presto internal
page/block
Presto Elasticsearch Connector
Data Model
● each Elasticsearch index is a table partition
● each field of an index is a column
● all Elasticsearch indexes sharing the same prefix
consist a logical table
○ Es-vehicles-sjc1, es-vehicles-dca1, es-vehicles
Describe Table
Query
Optimizations
● Parallel Reads
○ Get all indices and search nodes
○ For each search node, send request for one specific index
● Cap Max Hits
● Predicate Pushdown
● Json Function Pushdown
● Limit Pushdown
● Nested Fields
How many Uber trip
requests did we serve
in Chicago yesterday?
Fetch daily trip count in seconds
SELECT T.base.city_id AS cid,
Count(CASE WHEN T.base.status = 'completed' THEN 1 END) AS
completed_trips,
Count(CASE WHEN T.base.status = 'canceled' THEN 1 END) AS
rider_canceled_trips
FROM trips AS T
WHERE T.datestr = '2019-03-11'
GROUP BY 1
Column Chunk
base.client_uuid
Column Chunk
base.driver_uuid
Column Chunk
base.status
Column Chunk
base.vehicle_id
Column Chunk
base.city_idRow Group
Column Chunk
base.client_uuid
Column Chunk
base.driver_uuid
Column Chunk
base.city_id
Column Chunk
base.vehicle_id
Column Chunk
base.statusRow Group
Parquet
Parquet Footer: File Metadata, Row Group Metadata
Step 1: Read all Parquet nested fields from disk
base.driver_uuid base.client_uuid base.city_id …... base.vehicle_id base.status
base.driver_uuid
base.driver_uuid
base.driver_uuid
base.driver_uuid base.client_uuid base.city_id …... base.vehicle_id base.status
Presto Columnar Engine
Step 2: Transform Parquet rows into Presto columnar blocks
Step 3: Evaluate predicates on columnar blocks
base.client_uuid
base.client_uuid
base.client_uuid
base.city_id
base.city_id
base.city_id
base.vehicle_id
base.vehicle_id
base.vehicle_id
base.status
base.status
base.status
….
Default Apache Parquet Reader
Column Chunk
base.client_uuid
Column Chunk
base.driver_uuid
Column Chunk
base.status
Column Chunk
base.vehicle_id
Column Chunk
base.city_id
Row Group
Column Chunk
base.client_uuid
Column Chunk
base.driver_uuid
Column Chunk
base.city_id
Column Chunk
base.vehicle_id
Column Chunk
base.status
Row Group
Parquet Footer: File Metadata, Row Group Metadata
Step 1: Read ONLY Required nested fields from disk
Presto Columnar Engine
Apache Parquet Reader Optimization
base.driver_uuid
base.driver_uuid
base.driver_uuid
base.city_id
base.city_id
base.city_id
Step 1: Read ONLY Required nested fields from disk
Evaluate predicates on the fly:
Skip reading row group;
predicate: base.city_id = 12
dictionary: base.city_id: {3,
5, 9, 14, 21}
Build columnar blocks only
for predicate matches
Step 2. Build columnar blocks on the fly
base.driver_uuid
base.driver_uuid
base.driver_uuid
Step 3: Evaluate predicates on columnar blocks
Parquet
Results
Looking forward
Federated SQL Layer
Vision
HDFS
VerticaElasticsearch
Apache
Pinot
MySQL
Machines
Reports
Users
Presto RealTime Presto
Proxy layer
Management
Universal
Metadata
Service
Focus areas
Connectors
● Apache Hive, Apache Pinot, Elasticsearch, Apache Cassandra, Vertica, MySQL, etc
● Aggregation / Join pushdown
● Cross-connector optimizations (hybrid connectors)
Real-time
● Real-time mode with low latency pass through
● Query plan / result / data cache
● Time-series joins and stitching
Universal Metadata Service (UMS)
● Logical definitions / physical schemas
● Column stitching and joins
● Table and partition caching
Thank you
Proprietary © 2018 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying, recording, or by any
information storage or retrieval systems, without permission in writing from
Uber. This document is intended only for the use of the individual or entity to
whom it is addressed. All recipients of this document are notified that the
information contained herein includes proprietary information of Uber, and
recipient may not make use of, disseminate, or in any way disclose this
document or any of the enclosed information to any person other than
employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.

Mais conteúdo relacionado

Mais procurados

데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...
데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...
데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...Amazon Web Services Korea
 
Little Big Data #1. 바닥부터 시작하는 데이터 인프라
Little Big Data #1. 바닥부터 시작하는 데이터 인프라Little Big Data #1. 바닥부터 시작하는 데이터 인프라
Little Big Data #1. 바닥부터 시작하는 데이터 인프라Seongyun Byeon
 
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유Hyojun Jeon
 
[오픈소스컨설팅] 서비스 메쉬(Service mesh)
[오픈소스컨설팅] 서비스 메쉬(Service mesh)[오픈소스컨설팅] 서비스 메쉬(Service mesh)
[오픈소스컨설팅] 서비스 메쉬(Service mesh)Open Source Consulting
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)Hyojun Jeon
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginnersNeil Baker
 
검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민종민 김
 
AWS로 사용자 천만 명 서비스 만들기 (윤석찬)- 클라우드 태권 2015
AWS로 사용자 천만 명 서비스 만들기 (윤석찬)- 클라우드 태권 2015 AWS로 사용자 천만 명 서비스 만들기 (윤석찬)- 클라우드 태권 2015
AWS로 사용자 천만 명 서비스 만들기 (윤석찬)- 클라우드 태권 2015 Amazon Web Services Korea
 
mongodb와 mysql의 CRUD 연산의 성능 비교
mongodb와 mysql의 CRUD 연산의 성능 비교mongodb와 mysql의 CRUD 연산의 성능 비교
mongodb와 mysql의 CRUD 연산의 성능 비교Woo Yeong Choi
 
Indexing with MongoDB
Indexing with MongoDBIndexing with MongoDB
Indexing with MongoDBMongoDB
 
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1 나무기술(주) 최유석 20170912
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1  나무기술(주) 최유석 20170912Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1  나무기술(주) 최유석 20170912
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1 나무기술(주) 최유석 20170912Yooseok Choi
 
[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기NAVER D2
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search medcl
 
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편Seongyun Byeon
 
How to build massive service for advance
How to build massive service for advanceHow to build massive service for advance
How to build massive service for advanceDaeMyung Kang
 
Data pipeline and data lake
Data pipeline and data lakeData pipeline and data lake
Data pipeline and data lakeDaeMyung Kang
 
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)Myungjin Lee
 

Mais procurados (20)

데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...
데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...
데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...
 
Little Big Data #1. 바닥부터 시작하는 데이터 인프라
Little Big Data #1. 바닥부터 시작하는 데이터 인프라Little Big Data #1. 바닥부터 시작하는 데이터 인프라
Little Big Data #1. 바닥부터 시작하는 데이터 인프라
 
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
 
[오픈소스컨설팅] 서비스 메쉬(Service mesh)
[오픈소스컨설팅] 서비스 메쉬(Service mesh)[오픈소스컨설팅] 서비스 메쉬(Service mesh)
[오픈소스컨설팅] 서비스 메쉬(Service mesh)
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 
검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민
 
AWS로 사용자 천만 명 서비스 만들기 (윤석찬)- 클라우드 태권 2015
AWS로 사용자 천만 명 서비스 만들기 (윤석찬)- 클라우드 태권 2015 AWS로 사용자 천만 명 서비스 만들기 (윤석찬)- 클라우드 태권 2015
AWS로 사용자 천만 명 서비스 만들기 (윤석찬)- 클라우드 태권 2015
 
Reliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at AirbnbReliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at Airbnb
 
mongodb와 mysql의 CRUD 연산의 성능 비교
mongodb와 mysql의 CRUD 연산의 성능 비교mongodb와 mysql의 CRUD 연산의 성능 비교
mongodb와 mysql의 CRUD 연산의 성능 비교
 
Indexing with MongoDB
Indexing with MongoDBIndexing with MongoDB
Indexing with MongoDB
 
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1 나무기술(주) 최유석 20170912
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1  나무기술(주) 최유석 20170912Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1  나무기술(주) 최유석 20170912
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1 나무기술(주) 최유석 20170912
 
[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search
 
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
 
How to build massive service for advance
How to build massive service for advanceHow to build massive service for advance
How to build massive service for advance
 
Data pipeline and data lake
Data pipeline and data lakeData pipeline and data lake
Data pipeline and data lake
 
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
 

Semelhante a Real time analytics at uber @ strata data 2019

Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Zhenxiao Luo
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaData Con LA
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Bostonkbajda
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilSunita Shrivastava
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine kiran palaka
 
The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019Amit Banerjee
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileRoy Kim
 
EDB Postgres in DBaaS & Container Platforms
EDB Postgres in DBaaS & Container PlatformsEDB Postgres in DBaaS & Container Platforms
EDB Postgres in DBaaS & Container PlatformsAshnikbiz
 
Neo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform OverviewNeo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform OverviewNeo4j
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analyticskgshukla
 
Neo4j Vision and Roadmap
Neo4j Vision and Roadmap Neo4j Vision and Roadmap
Neo4j Vision and Roadmap Neo4j
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Amazon Web Services
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016kbajda
 

Semelhante a Real time analytics at uber @ strata data 2019 (20)

Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Presto
PrestoPresto
Presto
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch Council
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 
The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
EDB Postgres in DBaaS & Container Platforms
EDB Postgres in DBaaS & Container PlatformsEDB Postgres in DBaaS & Container Platforms
EDB Postgres in DBaaS & Container Platforms
 
Neo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform OverviewNeo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform Overview
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Neo4j Vision and Roadmap
Neo4j Vision and Roadmap Neo4j Vision and Roadmap
Neo4j Vision and Roadmap
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 

Mais de Zhenxiao Luo

Presto Elasticsearch Connector at Presto Summit
Presto Elasticsearch Connector at Presto SummitPresto Elasticsearch Connector at Presto Summit
Presto Elasticsearch Connector at Presto SummitZhenxiao Luo
 
Uber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks SummitUber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks SummitZhenxiao Luo
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Zhenxiao Luo
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Zhenxiao Luo
 
Presto Apache BigData 2017
Presto Apache BigData 2017Presto Apache BigData 2017
Presto Apache BigData 2017Zhenxiao Luo
 
presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15Zhenxiao Luo
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Zhenxiao Luo
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudZhenxiao Luo
 

Mais de Zhenxiao Luo (10)

Presto Elasticsearch Connector at Presto Summit
Presto Elasticsearch Connector at Presto SummitPresto Elasticsearch Connector at Presto Summit
Presto Elasticsearch Connector at Presto Summit
 
Uber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks SummitUber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks Summit
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
Presto Apache BigData 2017
Presto Apache BigData 2017Presto Apache BigData 2017
Presto Apache BigData 2017
 
Presto@Uber
Presto@UberPresto@Uber
Presto@Uber
 
presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 

Último

Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...SofiyaSharma5
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...tanu pandey
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.soniya singh
 
Challengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya Shirtrahman018755
 
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607dollysharma2066
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxellan12
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With RoomVIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Roomgirls4nights
 
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130  Available With RoomVIP Kolkata Call Girl Alambazar 👉 8250192130  Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Roomdivyansh0kumar0
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Sheetaleventcompany
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...Diya Sharma
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Delhi Call girls
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Servicegwenoracqe6
 
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Roomishabajaj13
 

Último (20)

Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
Challengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya Shirt
 
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
 
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With RoomVIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
 
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130  Available With RoomVIP Kolkata Call Girl Alambazar 👉 8250192130  Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
 
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
 

Real time analytics at uber @ strata data 2019

  • 1. Real Time Analytics at Uber: Bring SQL into Everything Zhenxiao Luo
  • 2. NYC Uber’s mission is to ignite opportunity by setting the world in motion. 15M Trips/Day 600+ Cities 75M Monthly Riders
  • 3. Data informs every decision at the company
  • 4. Overview of Uber’s Data Platform DATA SOURCES RAW DATA MODELED TABLES MINING BUSINESS INSIGHTS CONSUMING BUSINESS INSIGHTS EXPERIMENTATION DATA SCIENCE MACHINE LEARNING CUSTOM DATA SETS Dashboarding Alerting Monitoring Data Exploration Knowledge Bases Storage Infrastructure ETL Frameworks Data Integrity Query Engines
  • 5. Kafka Uber Data Infrastructure Schemaless MySQL, Postgres Vertica Streamio Raw Data Raw Tables Sqoop Reports Hadoop Hive Presto Spark Notebook Ad Hoc Queries Real Time Applications Machine Learning Jobs Business Intelligence Jobs Cluster Management All-Active Observability Security Vertica Samza Pinot Flink AresDB Modeled Tables Streaming Warehouse Real-time
  • 6. Presto @ Uber-scale 5KWeekly Active Users 160KQueries/day 3Data Centers 2KNodes 700MHDFS files read/day 10PBHDFS files processed/day
  • 7. Presto use cases at Uber Growth Marketing Data Science Marketplace Pricing Community Operations Data Quality Ad-hoc Querying
  • 8. The people who rely on us Technical Skills Data Scientists Software Engineers ML/AI Researchers Advanced SQL Advanced Statistics Scala/Spark, Python/R Data Modeling Inventor Ivan Marketing Managers Entry-level Analysts General Managers Product Managers Limited SQL Spreadsheets Reliant Rebecca City Operations Regional Managers Intermediate SQL Spreadsheets Dashboarding Monitoring Matt Operations Managers Data Analysts Product Analysts Advanced SQL Spreadsheets Limited Statistics Limited Python/R Analyst Anna
  • 9. Exploratory ML & model-training Data Scientists ML ResearchersEngineers Using ML to ensure data security and compliance
  • 10. Advanced data science & complex analytics Data Scientists Ops Analysts Support Agents Surfacing hidden insights to empower restaurants
  • 11. Business process automation S&P AnalystsOps Managers Contractors Using technology to make transportation safer
  • 12. What is Presto: Interactive SQL Engine for Big Data Interactive query speeds Horizontally scalable ANSI SQL Battle-tested by Facebook, Uber, Linkedin, Twitter, Netflix, Airbnb, etc Completely open source Access to petabytes of data in the Hadoop, Elasticsearch, Pinot, etc.
  • 14. Why Presto is Fast ● Data in memory during execution ● Pipelining and streaming ● Columnar storage & execution ● Bytecode generation
  • 15. Resource Management ● Presto has its own resource manager ○ Not on YARN ○ Not on Mesos ● CPU Management ○ Priority queues ○ Short running queries higher priority ● Memory Management ○ Max memory per query per node ○ If query exceeds max memory limit, query fails ○ No OutOfMemory in Presto process
  • 18. Contributions New Features ● Geospatial indexing and operations - 10x or more speedup ● Pinot connector enhancements (in-house) Optimizations ● Elasticsearch connector ● New Parquet reader - 4x speedup ● Nested column pushdowns (project, predicate) - 10x speedup Security ● Metastore authentication support for Kerberos deployments ● Dispatch Proxy using HTTP redirect for multi-cluster operation
  • 19. Presto Connector Interface ● ConnectorMetadata ○ Schema, Table, Column ● ConnectorSplitManager ○ Divide data into splits ● ConnectorSplit ○ Split data range ○ Predicate/JsonFunction/Limit pushdown ● ConnectorRecordCursor ○ Transform underlying storage data into Presto internal page/block
  • 21. Data Model ● each Elasticsearch index is a table partition ● each field of an index is a column ● all Elasticsearch indexes sharing the same prefix consist a logical table ○ Es-vehicles-sjc1, es-vehicles-dca1, es-vehicles
  • 23. Query
  • 24. Optimizations ● Parallel Reads ○ Get all indices and search nodes ○ For each search node, send request for one specific index ● Cap Max Hits ● Predicate Pushdown ● Json Function Pushdown ● Limit Pushdown ● Nested Fields
  • 25. How many Uber trip requests did we serve in Chicago yesterday?
  • 26. Fetch daily trip count in seconds SELECT T.base.city_id AS cid, Count(CASE WHEN T.base.status = 'completed' THEN 1 END) AS completed_trips, Count(CASE WHEN T.base.status = 'canceled' THEN 1 END) AS rider_canceled_trips FROM trips AS T WHERE T.datestr = '2019-03-11' GROUP BY 1
  • 27. Column Chunk base.client_uuid Column Chunk base.driver_uuid Column Chunk base.status Column Chunk base.vehicle_id Column Chunk base.city_idRow Group Column Chunk base.client_uuid Column Chunk base.driver_uuid Column Chunk base.city_id Column Chunk base.vehicle_id Column Chunk base.statusRow Group Parquet Parquet Footer: File Metadata, Row Group Metadata Step 1: Read all Parquet nested fields from disk base.driver_uuid base.client_uuid base.city_id …... base.vehicle_id base.status base.driver_uuid base.driver_uuid base.driver_uuid base.driver_uuid base.client_uuid base.city_id …... base.vehicle_id base.status Presto Columnar Engine Step 2: Transform Parquet rows into Presto columnar blocks Step 3: Evaluate predicates on columnar blocks base.client_uuid base.client_uuid base.client_uuid base.city_id base.city_id base.city_id base.vehicle_id base.vehicle_id base.vehicle_id base.status base.status base.status …. Default Apache Parquet Reader
  • 28. Column Chunk base.client_uuid Column Chunk base.driver_uuid Column Chunk base.status Column Chunk base.vehicle_id Column Chunk base.city_id Row Group Column Chunk base.client_uuid Column Chunk base.driver_uuid Column Chunk base.city_id Column Chunk base.vehicle_id Column Chunk base.status Row Group Parquet Footer: File Metadata, Row Group Metadata Step 1: Read ONLY Required nested fields from disk Presto Columnar Engine Apache Parquet Reader Optimization base.driver_uuid base.driver_uuid base.driver_uuid base.city_id base.city_id base.city_id Step 1: Read ONLY Required nested fields from disk Evaluate predicates on the fly: Skip reading row group; predicate: base.city_id = 12 dictionary: base.city_id: {3, 5, 9, 14, 21} Build columnar blocks only for predicate matches Step 2. Build columnar blocks on the fly base.driver_uuid base.driver_uuid base.driver_uuid Step 3: Evaluate predicates on columnar blocks Parquet
  • 31. Federated SQL Layer Vision HDFS VerticaElasticsearch Apache Pinot MySQL Machines Reports Users Presto RealTime Presto Proxy layer Management Universal Metadata Service
  • 32. Focus areas Connectors ● Apache Hive, Apache Pinot, Elasticsearch, Apache Cassandra, Vertica, MySQL, etc ● Aggregation / Join pushdown ● Cross-connector optimizations (hybrid connectors) Real-time ● Real-time mode with low latency pass through ● Query plan / result / data cache ● Time-series joins and stitching Universal Metadata Service (UMS) ● Logical definitions / physical schemas ● Column stitching and joins ● Table and partition caching
  • 33. Thank you Proprietary © 2018 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed. All recipients of this document are notified that the information contained herein includes proprietary information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.