SlideShare uma empresa Scribd logo
1 de 43
Baixar para ler offline
Internals of Presto Service
Taro L. Saito, Treasure Data
leo@treasure-data.com
March 11-12th, 2015
Treasure Data Tech Talk #1 at Tokyo
Taro L. Saito @taroleo
•  2007 University of Tokyo. Ph.D.
–  XML DBMS, Transaction Processing
•  Relational-Style XML Query [SIGMOD 2008]
•  ~ 2014 Assistant Professor at University of Tokyo
–  Genome Science Research
•  Distributed Computing, Personal Genome Analysis
•  March 2014 ~ Treasure Data
–  Software Engineer, MPP Team Leader
•  Open source projects at GitHub
–  snappy-java, msgpack-java, sqlite-jdbc
–  sbt-pack, sbt-sonatype, larray
–  silk
•  Distributed workflow engine
2
Hive
TD API /
Web Console
batch query
Presto
Treasure Data
PlazmaDB:
MessagePack Columnar Storage
td-presto connector
Interactive query
What is Presto?
•  A distributed SQL Engine developed by Facebook
–  For interactive analysis on peta-scale dataset
•  As a replacement of Hive
–  Nov. 2013: Open sourced at GitHub
•  Presto
–  Written in Java
–  In-memory query layer
–  CPU efficient for ad-hoc analysis
–  Based on ANSI SQL
–  Isolation of query layer and storage access layer
•  A connector provides data access (reading schema and records)
4
Presto: Distributed SQL Engine
5
TD Presto has its own
query retry mechanism
Tailored to throughput CPU-intensive. Faster response time
Fault
Tolerant
Treasure Data: Presto as a Service
6
Presto Public
Release
Topics
•  Challenges in providing Database as a Service
•  TD Presto Connector
–  Optimizing Scan Performance
–  Multi-tenancy Cluster Management
•  Resource allocation
•  Monitoring
•  Query Tuning
7
buffer
Optimizing Scan Performance
•  Fully utilize the network bandwidth from S3
•  TD Presto becomes CPU bottleneck
TableScanOperator	
•  s3 file list
•  table schema
header
request
S3 / RiakCS	
•  release(Buffer)
Buffer size limit
Reuse allocated buffers
Request Queue	
•  priority queue
•  max connections limit
Header	
Column Block 0
(column names)	
Column Block 1	
Column Block i	
Column Block m	
MPC1 file
HeaderReader	
•  callback to HeaderParser
ColumnBlockReader	
header
HeaderParser	
•  parse MPC file header
• column block offsets
• column names
column block request
Column block requests
column block
prepare
MessageUnpacker	
buffer
MessageUnpacker	
MessageUnpacker	
S3 read	
S3 read	
pull records
Retry GET request on
- 500 (internal error)
- 503 (slow down)
- 404 (not found)
- eventual consistency
S3 read	
•  decompression
•  msgpack-java v07
S3 read	
S3 read	
S3 read
MessageBuffer
•  msgpack-java v06 was the bottleneck
–  Inefficient buffer access
•  v07
•  Fast memory access
•  sun.misc.Unsafe
•  Direct access to heap memory
•  extract primitive type value from byte[]
•  cast
•  No boxing
9
Unsafe memory access performance is comparable to C
•  http://frsyuki.hatenablog.com/entry/2014/03/12/155231
10
Why ByteBuffer is slow?
•  Following a good programming manner
–  Define interface, then implement classes
•  ByteBuffer interface has HeapByteBuffer and DirectByteBuffer
implementations
•  In reality: TypeProfile slows down method access
–  JVM generates look-up table of method implementations
–  Simply importing one or more classes generates TypeProfile
•  v07 avoid TypeProfile generation
–  Load an implementation class through Reflection
11
Format Type Detection
•  MessageUnpacker
–  read prefix: 1 byte
–  detect format type
•  switch-case
–  ANTLR generates this
type of codes
12
Format Type Detection
•  Using cache-efficient lookup table: 20000x faster
13
2x performance improvement in v07
14
Database As A Service
15
Claremont Report on Database Research
•  Discussion on future of DBMS
–  Top researchers, vendors and
practitioners.
–  CACM, Vol. 52 No. 6, 2009
•  Predicts emergence of Cloud Data
Service
–  SQL has an important role
•  limited functionality
•  suited for service provider
–  A difficult example: Spark 
•  Need a secure application container
to run arbitrary Scala code.
16
Beckman Report on Database Research
•  2013
–  http://beckman.cs.wisc.edu/beckman-report2013.pdf
–  Topics of Big-Data
•  End-to-end service
–  From data collection to knowledge
•  Cloud Service has become popular
–  IaaS, PaaS, SaaS
–  Challenge is to migrate all of the functionalities of DBMS into Cloud
17
Results Push
Results Push
SQL
Big Data Simplified: The Treasure Data Approach
AppServers
Multi-structured Events!
•  register!
•  login!
•  start_event!
•  purchase!
•  etc!
SQL-based
Ad-hoc Queries
SQL-based Dashboards
DBs & Data Marts
Other Apps
Familiar &
Table-oriented
Infinite & Economical
Cloud Data Store
ü  App log data!
ü  Mobile event data!
ü  Sensor data!
ü  Telemetry!
Mobile SDKs
Web SDK
Multi-structured Events
Multi-structured Events
Treasure Agent
Treasure Agent
Treasure Agent
Treasure Agent Treasure Agent
Treasure Agent
Treasure Agent
Treasure Agent
Embedded SDKs
Server-side Agents
18
Challenges in Database as a Service
•  Tradeoffs
–  Cost and service level objectives (SLOs)
•  Reference
–  Workload Management for Big Data Analytics. A. Aboulnaga
[SIGMOD2013 Tutorial]
19
Run each query set
on an independent
cluster
Run all queries
together on the
smallest possible
cluster
Fast
$$$
Limited performance guarantee
Reasonable price
Shift of Presto Query Usage
•  Initial phase
–  Try and error of queries
•  Many syntax errors, semantic errors
•  Next phase
–  Scheduled query execution
•  Increased Presto query usage
–  Some customers submit more than 1,000 Presto queries / day
–  Establishing typical query patterns
•  hourly, daily reports
•  query templates
•  Advanced phase: More elaborate data analysis
–  Complex queries
•  via data scientists and data analysts
–  High resource usage
20
Usage Shift: Simple to Complex queries
21
Monitoring Presto Usage with Fluentd
22
Hive
Presto
DataDog
•  Monitoring CPU, memory and network usage
•  Query stats
23
Query Collection in TD
•  SQL query logs
–  query, detailed query plan, elapsed time, processed rows, etc.
•  Presto is used for analyzing the query history
24
Daily/Hourly Query Usage
25
Query Running Time
•  More than 90% of queries finishes within 2 min.
expected response time for interactive queries
26
Processed Rows of Queries
27
Performance
•  Processed rows / sec. of a query
28
Collecting Recoverable Error Patterns
•  Presto has no fault tolerance
•  Error types
–  User error
•  Syntax errors
–  SQL syntax, missing function
•  Semantic errors
–  missing tables/columns
–  Insufficient resource
•  Exceeded task memory size
–  Internal failure
•  I/O error
–  S3/Riak CS
•  worker failure
•  etc.
29
TD Presto retries
these queries
Query Retry on Internal Errors
•  More than 99.8% of queries finishes without errors
30
Query Retry on Internal Errors (log scale)
•  Queries succeed eventually
31
Multi-tenancy: Resource Allocation
•  Price-plan based resource allocation
•  Parameters
–  The number of worker nodes to use (min-candidates)
–  The number of hash partitions (initial-hash-partitions)
–  The maximum number of running tasks per account
•  If running queries exceeds allowed number of tasks, the next queries need
to wait (queued)
•  Presto: SqlQueryExecution class
–  Controls query execution state: planning -> running -> finished
•  No resource allocation policy
–  Extended TDSqlQueryExection class monitors running tasks and limits
resource usage
•  Rewriting SqlQueryExecutionFactory at run-time by using ASM library
32
Query Queue
•  Presto 0.97
–  Introduces user-wise query queues
•  Can limit the number of concurrent queries per user
•  Problem
–  Running too many queries delays overall query
performance
33
Customer Feedback
•  A feedback:
–  We don’t care if large queries take long time
–  But interactive queries should run immediately
•  Challenges
–  How do we allocate resources even if preceding queries
occupies customer share of resources?
–  How do we know a submitted query is interactive one?
34
Admission control is necessary
•  Adjust resource utilization
–  Running Drivers (Splits)
–  MPL (Multi-Programming Level)
35
Challenge: Auto Scaling
•  Setting the cluster size based on the peak usage is expensive
•  But predicting customer usage is difficult
36
Typical Query Patterns [Li Juang]
•  Q: What are typical queries of a customer?
–  Customer feels some queries are slow
–  But we don’t know what to compare with, except scheduled queries
•  Approach: Clustering Customer SQLs
•  TF/IDF measure: TF x IDF vector
–  Split SQL statements into tokens
–  Term frequency (TF) = the number of each term in a query
–  Inverse document frequency (IDF) = log (# of queries / # of queries that
have a token)
•  k-means clustering
–  TF/IDF vector
–  Generates clusters of similar queries
•  x-means clustering for deciding number of clusters automatically
–  D. Pelleg [ICML2000]
37
Problematic Queries
•  90% of queries finishes within 2 min.
–  But remaining 10% is still large
•  10% of 10,000 queries is 1,000.
•  Long-running queries
•  Hog queries
38
Long Running Queries
•  Typical bottlenecks
–  Cross joins
–  IN (a, b, c, …)
•  semi-join filtering process is slow
–  Complex scan condition
•  pushing down selection
•  but delays column scan
–  Tuple materialization
•  coordinator generates json data
–  Many aggregation columns
•  group by 1, 2, 3, 4, 5, 6, …
–  Full scan
•  Scanning 100 billion rows…
•  Adding more resources does not always make query faster
•  Storing intermediate data to disks is necessary
39
Result are
buffered
(waiting fetch)
slow process
fast
fast
Hog Query
•  Queries consuming a lot of CPU/memory resources
–  Coined in S. Krompass et al. [EDBT2009]
•  Example:
–  select 1 as day, count(…) from … where time <= current_date - interval 1 day
union all
select 2 as day, count(…) from … where time <= current_date - interval 2 day
union all
–  …
–  (up to 190 days)
•  More than 1000 query stages.
•  Presto tries to run all of the stages at once.
–  High CPU usage at coordinator
40
•  Query rewriting (better)
–  With group by and window functions
–  Not a perfect solution
•  Need to understand the meaning of the query
•  Semantic change is not allowed
–  e.g., We cannot rewrite UNION to UNION ALL
–  UNION includes duplicate elimination
•  Workaround Idea
–  Bushy plan -> Deep plan
–  Introduce stage-wise resource assignment
Query Rewriting? Plan Optimization?
41
Future Work
•  Reducing Queuing/Response Time
–  Introducing shared queue between customers
•  For utilizing remaining cluster resources
–  Fair-Scheduling: C. Gupata [EDBT2009]
–  Self-tuning DBMS. S. Chaudhuri [VLDB2007]
•  Adjusting Running Query Size (hard)
–  Limiting driver resources as small as possible for hog queries
–  Query plan based cost estimation
•  Predicting Query Running Time
–  J. Duggan [SIGMOD2011], A.C. Konig [VLDB2011]
42
Summary: Treasures in Treasure Data
•  Treasures for our customers
–  Data collected by fluentd (td-agent)
–  Query analysis platform
–  Query results - values
•  For Treasure Data
–  SQL query logs
•  Stored in treasure data
–  We know how customers use SQL
•  Typical queries and failures
–  We know which part of query can be improved
43

Mais conteúdo relacionado

Mais procurados

The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
SANG WON PARK
 

Mais procurados (20)

The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Plazma - Treasure Data’s distributed analytical database -
Plazma - Treasure Data’s distributed analytical database -Plazma - Treasure Data’s distributed analytical database -
Plazma - Treasure Data’s distributed analytical database -
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Airflow를 이용한 데이터 Workflow 관리
Airflow를 이용한  데이터 Workflow 관리Airflow를 이용한  데이터 Workflow 관리
Airflow를 이용한 데이터 Workflow 관리
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 

Destaque

Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
Sadayuki Furuhashi
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
Sadayuki Furuhashi
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
Taro L. Saito
 
トレジャーデータ流,データ分析の始め方
トレジャーデータ流,データ分析の始め方トレジャーデータ流,データ分析の始め方
トレジャーデータ流,データ分析の始め方
Takahiro Inoue
 

Destaque (20)

Diary of Support Engineer
Diary of Support EngineerDiary of Support Engineer
Diary of Support Engineer
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Treasure Data and Fluentd
Treasure Data and FluentdTreasure Data and Fluentd
Treasure Data and Fluentd
 
HDP2 and YARN operations point
HDP2 and YARN operations pointHDP2 and YARN operations point
HDP2 and YARN operations point
 
hotdog a TD tool for DD
hotdog a TD tool for DDhotdog a TD tool for DD
hotdog a TD tool for DD
 
Treasure Data Mobile SDK
Treasure Data Mobile SDKTreasure Data Mobile SDK
Treasure Data Mobile SDK
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例
 
Presto in my_use_case
Presto in my_use_casePresto in my_use_case
Presto in my_use_case
 
トレジャーデータ流,データ分析の始め方
トレジャーデータ流,データ分析の始め方トレジャーデータ流,データ分析の始め方
トレジャーデータ流,データ分析の始め方
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
pagecache-memo
pagecache-memopagecache-memo
pagecache-memo
 
Pentaho CTools 20140902
Pentaho CTools 20140902Pentaho CTools 20140902
Pentaho CTools 20140902
 
Building Physical in a Virtual World
Building Physical in a Virtual WorldBuilding Physical in a Virtual World
Building Physical in a Virtual World
 
Pentaho
PentahoPentaho
Pentaho
 
Lambda in java_20160121
Lambda in java_20160121Lambda in java_20160121
Lambda in java_20160121
 

Semelhante a Internals of Presto Service

Semelhante a Internals of Presto Service (20)

Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @Facebook
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage Systems
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Introduction to .NET Performance Measurement
Introduction to .NET Performance MeasurementIntroduction to .NET Performance Measurement
Introduction to .NET Performance Measurement
 
ElasticSearch as (only) datastore
ElasticSearch as (only) datastoreElasticSearch as (only) datastore
ElasticSearch as (only) datastore
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solr
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorization
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
Apache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbApache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdb
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
 
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuOSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuOSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
 
The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!
The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!
The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!
 

Mais de Treasure Data, Inc.

Mais de Treasure Data, Inc. (20)

GDPR: A Practical Guide for Marketers
GDPR: A Practical Guide for MarketersGDPR: A Practical Guide for Marketers
GDPR: A Practical Guide for Marketers
 
AR and VR by the Numbers: A Data First Approach to the Technology and Market
AR and VR by the Numbers: A Data First Approach to the Technology and MarketAR and VR by the Numbers: A Data First Approach to the Technology and Market
AR and VR by the Numbers: A Data First Approach to the Technology and Market
 
Introduction to Customer Data Platforms
Introduction to Customer Data PlatformsIntroduction to Customer Data Platforms
Introduction to Customer Data Platforms
 
Hands On: Javascript SDK
Hands On: Javascript SDKHands On: Javascript SDK
Hands On: Javascript SDK
 
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
Hands-On: Managing Slowly Changing Dimensions Using TD WorkflowHands-On: Managing Slowly Changing Dimensions Using TD Workflow
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
 
Brand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
Brand Analytics Management: Measuring CLV Across Platforms, Devices and AppsBrand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
Brand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
 
How to Power Your Customer Experience with Data
How to Power Your Customer Experience with DataHow to Power Your Customer Experience with Data
How to Power Your Customer Experience with Data
 
Why Your VR Game is Virtually Useless Without Data
Why Your VR Game is Virtually Useless Without DataWhy Your VR Game is Virtually Useless Without Data
Why Your VR Game is Virtually Useless Without Data
 
Connecting the Customer Data Dots
Connecting the Customer Data DotsConnecting the Customer Data Dots
Connecting the Customer Data Dots
 
Harnessing Data for Better Customer Experience and Company Success
Harnessing Data for Better Customer Experience and Company SuccessHarnessing Data for Better Customer Experience and Company Success
Harnessing Data for Better Customer Experience and Company Success
 
Packaging Ecosystems -Monki Gras 2017
Packaging Ecosystems -Monki Gras 2017Packaging Ecosystems -Monki Gras 2017
Packaging Ecosystems -Monki Gras 2017
 
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
 
Keynote - Fluentd meetup v14
Keynote - Fluentd meetup v14Keynote - Fluentd meetup v14
Keynote - Fluentd meetup v14
 
Introduction to New features and Use cases of Hivemall
Introduction to New features and Use cases of HivemallIntroduction to New features and Use cases of Hivemall
Introduction to New features and Use cases of Hivemall
 
Scalable Hadoop in the cloud
Scalable Hadoop in the cloudScalable Hadoop in the cloud
Scalable Hadoop in the cloud
 
Using Embulk at Treasure Data
Using Embulk at Treasure DataUsing Embulk at Treasure Data
Using Embulk at Treasure Data
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
 
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
Treasure Data:  Move your data from MySQL to Redshift with (not much more tha...Treasure Data:  Move your data from MySQL to Redshift with (not much more tha...
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
 
Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to Redshift
 
Unifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudUnifying Events and Logs into the Cloud
Unifying Events and Logs into the Cloud
 

Último

"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 

Último (20)

Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 

Internals of Presto Service

  • 1. Internals of Presto Service Taro L. Saito, Treasure Data leo@treasure-data.com March 11-12th, 2015 Treasure Data Tech Talk #1 at Tokyo
  • 2. Taro L. Saito @taroleo •  2007 University of Tokyo. Ph.D. –  XML DBMS, Transaction Processing •  Relational-Style XML Query [SIGMOD 2008] •  ~ 2014 Assistant Professor at University of Tokyo –  Genome Science Research •  Distributed Computing, Personal Genome Analysis •  March 2014 ~ Treasure Data –  Software Engineer, MPP Team Leader •  Open source projects at GitHub –  snappy-java, msgpack-java, sqlite-jdbc –  sbt-pack, sbt-sonatype, larray –  silk •  Distributed workflow engine 2
  • 3. Hive TD API / Web Console batch query Presto Treasure Data PlazmaDB: MessagePack Columnar Storage td-presto connector Interactive query
  • 4. What is Presto? •  A distributed SQL Engine developed by Facebook –  For interactive analysis on peta-scale dataset •  As a replacement of Hive –  Nov. 2013: Open sourced at GitHub •  Presto –  Written in Java –  In-memory query layer –  CPU efficient for ad-hoc analysis –  Based on ANSI SQL –  Isolation of query layer and storage access layer •  A connector provides data access (reading schema and records) 4
  • 5. Presto: Distributed SQL Engine 5 TD Presto has its own query retry mechanism Tailored to throughput CPU-intensive. Faster response time Fault Tolerant
  • 6. Treasure Data: Presto as a Service 6 Presto Public Release
  • 7. Topics •  Challenges in providing Database as a Service •  TD Presto Connector –  Optimizing Scan Performance –  Multi-tenancy Cluster Management •  Resource allocation •  Monitoring •  Query Tuning 7
  • 8. buffer Optimizing Scan Performance •  Fully utilize the network bandwidth from S3 •  TD Presto becomes CPU bottleneck TableScanOperator •  s3 file list •  table schema header request S3 / RiakCS •  release(Buffer) Buffer size limit Reuse allocated buffers Request Queue •  priority queue •  max connections limit Header Column Block 0 (column names) Column Block 1 Column Block i Column Block m MPC1 file HeaderReader •  callback to HeaderParser ColumnBlockReader header HeaderParser •  parse MPC file header • column block offsets • column names column block request Column block requests column block prepare MessageUnpacker buffer MessageUnpacker MessageUnpacker S3 read S3 read pull records Retry GET request on - 500 (internal error) - 503 (slow down) - 404 (not found) - eventual consistency S3 read •  decompression •  msgpack-java v07 S3 read S3 read S3 read
  • 9. MessageBuffer •  msgpack-java v06 was the bottleneck –  Inefficient buffer access •  v07 •  Fast memory access •  sun.misc.Unsafe •  Direct access to heap memory •  extract primitive type value from byte[] •  cast •  No boxing 9
  • 10. Unsafe memory access performance is comparable to C •  http://frsyuki.hatenablog.com/entry/2014/03/12/155231 10
  • 11. Why ByteBuffer is slow? •  Following a good programming manner –  Define interface, then implement classes •  ByteBuffer interface has HeapByteBuffer and DirectByteBuffer implementations •  In reality: TypeProfile slows down method access –  JVM generates look-up table of method implementations –  Simply importing one or more classes generates TypeProfile •  v07 avoid TypeProfile generation –  Load an implementation class through Reflection 11
  • 12. Format Type Detection •  MessageUnpacker –  read prefix: 1 byte –  detect format type •  switch-case –  ANTLR generates this type of codes 12
  • 13. Format Type Detection •  Using cache-efficient lookup table: 20000x faster 13
  • 15. Database As A Service 15
  • 16. Claremont Report on Database Research •  Discussion on future of DBMS –  Top researchers, vendors and practitioners. –  CACM, Vol. 52 No. 6, 2009 •  Predicts emergence of Cloud Data Service –  SQL has an important role •  limited functionality •  suited for service provider –  A difficult example: Spark  •  Need a secure application container to run arbitrary Scala code. 16
  • 17. Beckman Report on Database Research •  2013 –  http://beckman.cs.wisc.edu/beckman-report2013.pdf –  Topics of Big-Data •  End-to-end service –  From data collection to knowledge •  Cloud Service has become popular –  IaaS, PaaS, SaaS –  Challenge is to migrate all of the functionalities of DBMS into Cloud 17
  • 18. Results Push Results Push SQL Big Data Simplified: The Treasure Data Approach AppServers Multi-structured Events! •  register! •  login! •  start_event! •  purchase! •  etc! SQL-based Ad-hoc Queries SQL-based Dashboards DBs & Data Marts Other Apps Familiar & Table-oriented Infinite & Economical Cloud Data Store ü  App log data! ü  Mobile event data! ü  Sensor data! ü  Telemetry! Mobile SDKs Web SDK Multi-structured Events Multi-structured Events Treasure Agent Treasure Agent Treasure Agent Treasure Agent Treasure Agent Treasure Agent Treasure Agent Treasure Agent Embedded SDKs Server-side Agents 18
  • 19. Challenges in Database as a Service •  Tradeoffs –  Cost and service level objectives (SLOs) •  Reference –  Workload Management for Big Data Analytics. A. Aboulnaga [SIGMOD2013 Tutorial] 19 Run each query set on an independent cluster Run all queries together on the smallest possible cluster Fast $$$ Limited performance guarantee Reasonable price
  • 20. Shift of Presto Query Usage •  Initial phase –  Try and error of queries •  Many syntax errors, semantic errors •  Next phase –  Scheduled query execution •  Increased Presto query usage –  Some customers submit more than 1,000 Presto queries / day –  Establishing typical query patterns •  hourly, daily reports •  query templates •  Advanced phase: More elaborate data analysis –  Complex queries •  via data scientists and data analysts –  High resource usage 20
  • 21. Usage Shift: Simple to Complex queries 21
  • 22. Monitoring Presto Usage with Fluentd 22 Hive Presto
  • 23. DataDog •  Monitoring CPU, memory and network usage •  Query stats 23
  • 24. Query Collection in TD •  SQL query logs –  query, detailed query plan, elapsed time, processed rows, etc. •  Presto is used for analyzing the query history 24
  • 26. Query Running Time •  More than 90% of queries finishes within 2 min. expected response time for interactive queries 26
  • 27. Processed Rows of Queries 27
  • 28. Performance •  Processed rows / sec. of a query 28
  • 29. Collecting Recoverable Error Patterns •  Presto has no fault tolerance •  Error types –  User error •  Syntax errors –  SQL syntax, missing function •  Semantic errors –  missing tables/columns –  Insufficient resource •  Exceeded task memory size –  Internal failure •  I/O error –  S3/Riak CS •  worker failure •  etc. 29 TD Presto retries these queries
  • 30. Query Retry on Internal Errors •  More than 99.8% of queries finishes without errors 30
  • 31. Query Retry on Internal Errors (log scale) •  Queries succeed eventually 31
  • 32. Multi-tenancy: Resource Allocation •  Price-plan based resource allocation •  Parameters –  The number of worker nodes to use (min-candidates) –  The number of hash partitions (initial-hash-partitions) –  The maximum number of running tasks per account •  If running queries exceeds allowed number of tasks, the next queries need to wait (queued) •  Presto: SqlQueryExecution class –  Controls query execution state: planning -> running -> finished •  No resource allocation policy –  Extended TDSqlQueryExection class monitors running tasks and limits resource usage •  Rewriting SqlQueryExecutionFactory at run-time by using ASM library 32
  • 33. Query Queue •  Presto 0.97 –  Introduces user-wise query queues •  Can limit the number of concurrent queries per user •  Problem –  Running too many queries delays overall query performance 33
  • 34. Customer Feedback •  A feedback: –  We don’t care if large queries take long time –  But interactive queries should run immediately •  Challenges –  How do we allocate resources even if preceding queries occupies customer share of resources? –  How do we know a submitted query is interactive one? 34
  • 35. Admission control is necessary •  Adjust resource utilization –  Running Drivers (Splits) –  MPL (Multi-Programming Level) 35
  • 36. Challenge: Auto Scaling •  Setting the cluster size based on the peak usage is expensive •  But predicting customer usage is difficult 36
  • 37. Typical Query Patterns [Li Juang] •  Q: What are typical queries of a customer? –  Customer feels some queries are slow –  But we don’t know what to compare with, except scheduled queries •  Approach: Clustering Customer SQLs •  TF/IDF measure: TF x IDF vector –  Split SQL statements into tokens –  Term frequency (TF) = the number of each term in a query –  Inverse document frequency (IDF) = log (# of queries / # of queries that have a token) •  k-means clustering –  TF/IDF vector –  Generates clusters of similar queries •  x-means clustering for deciding number of clusters automatically –  D. Pelleg [ICML2000] 37
  • 38. Problematic Queries •  90% of queries finishes within 2 min. –  But remaining 10% is still large •  10% of 10,000 queries is 1,000. •  Long-running queries •  Hog queries 38
  • 39. Long Running Queries •  Typical bottlenecks –  Cross joins –  IN (a, b, c, …) •  semi-join filtering process is slow –  Complex scan condition •  pushing down selection •  but delays column scan –  Tuple materialization •  coordinator generates json data –  Many aggregation columns •  group by 1, 2, 3, 4, 5, 6, … –  Full scan •  Scanning 100 billion rows… •  Adding more resources does not always make query faster •  Storing intermediate data to disks is necessary 39 Result are buffered (waiting fetch) slow process fast fast
  • 40. Hog Query •  Queries consuming a lot of CPU/memory resources –  Coined in S. Krompass et al. [EDBT2009] •  Example: –  select 1 as day, count(…) from … where time <= current_date - interval 1 day union all select 2 as day, count(…) from … where time <= current_date - interval 2 day union all –  … –  (up to 190 days) •  More than 1000 query stages. •  Presto tries to run all of the stages at once. –  High CPU usage at coordinator 40
  • 41. •  Query rewriting (better) –  With group by and window functions –  Not a perfect solution •  Need to understand the meaning of the query •  Semantic change is not allowed –  e.g., We cannot rewrite UNION to UNION ALL –  UNION includes duplicate elimination •  Workaround Idea –  Bushy plan -> Deep plan –  Introduce stage-wise resource assignment Query Rewriting? Plan Optimization? 41
  • 42. Future Work •  Reducing Queuing/Response Time –  Introducing shared queue between customers •  For utilizing remaining cluster resources –  Fair-Scheduling: C. Gupata [EDBT2009] –  Self-tuning DBMS. S. Chaudhuri [VLDB2007] •  Adjusting Running Query Size (hard) –  Limiting driver resources as small as possible for hog queries –  Query plan based cost estimation •  Predicting Query Running Time –  J. Duggan [SIGMOD2011], A.C. Konig [VLDB2011] 42
  • 43. Summary: Treasures in Treasure Data •  Treasures for our customers –  Data collected by fluentd (td-agent) –  Query analysis platform –  Query results - values •  For Treasure Data –  SQL query logs •  Stored in treasure data –  We know how customers use SQL •  Typical queries and failures –  We know which part of query can be improved 43