SlideShare uma empresa Scribd logo
1 de 29
Baixar para ler offline
T R E A S U R E D A T A
Presto At Treasure Data
Presto Meetup @ Tokyo - June 15, 2017
Taro L. Saito - GitHub:@xerial
Ph.D., Software Engineer at Treasure Data, Inc.
1
Presto Usage at Treasure Data (2017)
Processing 15 Trillion Rows / Day 

(= 173 Million Rows / sec.)
150,000~ Queries / Day
1,500~ Users
Hosting Presto as a service for 3 years
2
Configurations
• Hosted on AWS (us-east), AWS Tokyo, IDCF (Japan)
• Multi-Tenancy Clusters
• PlazmaDB
• Storage: Amazon S3 or RiakCS
• S3 file indexes: PostgreSQL
• Storage format: Columnar Message Pack (MPC)
• MessagePack: Self-type describing format.
• Compact: 10x compression ratio from the original input data (JSON)
• 200GB JVM memory per node
• To support varieties of query usage
• Estimating required memory in advance is difficult
• For avoiding WAITING_FOR_MEMORY state that blocks the entire query processing
• In small-memory configuration, major GCs was quite frequent
3
Challenges
• Major Complaint
• Presto is slower than usual
• Only 20% of 150,000 queries are using our scheduling feature
• However, 85% of queries are actually scheduled by user scripts or third-party tools 

• How can we know the expected performance?
• (Implicit) Service Level Objectives (SLOs)
4
Understanding Implicit SLOs
• We usually looked into slow queries to figure out the performance bottlenecks.
• However analyzing SQL takes a long time
• Because we need to understand the meaning of the data.
• Understanding a hundred lines of SQL is painful
• Created Presto Query Tuning Guides:
• Presto Query FAQs: https://docs.treasuredata.com/articles/presto-query-faq
• Expectations to Performance
• Scheduled queries: We can estimate SLOs from historical stats
• Scheduled, but submitted from third-party tools or user scripts
• How do we know the expected performance?
• We need to internalize customer’s knowledge on query performance
5
• Bad:
• Collecting stdout/stderr logs of Presto
• Good:
• Collecting logs in a queryable format with Presto
• Collecting Query Event Logs to Treasure Data
• Presto Event Listener -> fluentd -> Treasure Data
• Treasure Data
• schema-less: Schema can be automatically generated from the data
• As we add new fields to the event, the schema evolves automatically
• We are collecting every single query log since the beginning of the Presto service
Our Approach: Data-Driven Improvement
Query Logs
Store
Analyze
SQL
Improve & Optimize
6
Query Event Logs
• Query Completion
• queryId, user id, session parameters, etc.
• Query stats: running time, total rows, bytes, splits, CPU time, etc.
• SQL statement
• Split Completion
• Running time, Processed rows, bytes, etc.
• S3 GET access count, read bytes
• Table Scan
• Accessed tables names, column sets
• Accessed time ranges (e.g., queries looking at data of past 1 hour, 7 days, etc.)
• Filtering conditions (predicate)
7
Clustering Queries with Query Signature
• Finding Implicit SLOs
• Need to classify 85% of scheduled queries
• Extracting Query Signatures
• Simplify complex SQL expressions into a
tiny SQL representation
• Reusing ANTLR parser of Presto
• Query Signature Example:
• S[Cnt](J(T1,G(S[Cnt](T2))))
• SELET count(a),... FROM T1 

JOIN (SELECT count(b),... FROM T2 GROUP BY x)
8
Implicit SLOs
• Collect the historical query running times
• Queries that have the same query signature
• Median-absolute deviation (MAD): the deviation of (running time - median)^2
• CoV: Coefficient of variation = MAD / median
• If CoV > 1, the query running time tends to vary
• If CoV < 1, median of historical running time is useful for query running time
estimation.
• SLO violation:
• If query is running longer than median + MAD
• Customer feels query is slower than usual
• However, query might be processing much more data than usual
• Normalization based on the processing data size is also necessary
9
Typical Performance Bottlenecks
• Huge Queries
• Frequent S3 access, wide table scans
• Single-node operators
• order by, window function, count(distinct x), processing skewed data, etc.
• Ill-performing worker nodes
• Heavy load on a single worker node
• Insufficient pool memory
• Major/full GCs
• We are using min.error-duration = 2m, but GC pause can be longer
• Too much resource usage
• A single query occupies the entire cluster
• e.g., A query with hundreds of query stages!
10
Split Resource Manager
• Problem: A singe query can occupy the entire cluster resource
• But Presto has a limited performance control
• Only for cpu time, memory usage, and concurrent queries (CQ) limits
• No throttling nor boosting
• Created Split Resource Manger
• Limiting the max runnable splits for each customer
• Using a custom RemoteTask class, which adds an wait if no splits are available
• => Efficient Use of Multi-Tenancy Cluster
11
Presto Ops Robot
• Problem: Insufficient memory of a worker
• Queries using that worker node enter WAITING_FOR_MEMORY state
• Report JMX metrics -> fluentd -> DataDog -> Trigger Alert -> Presto Ops Robot
• Presto Ops Robot
• Sending graceful shutdown command (POST SHUTTING_DOWN message to /v1/status)
• or kill memory consuming queries in the worker node
• Restarting worker JVM process
• At least every 1 week, to avoid any issues when running JVM for a long time
• Resetting any effect caused by unknown bugs
• Useful for cleaning up untracked memory (e.g., ANTLR objects, etc.)
12
S3 Access Performance
• Problem: Slow Table Scan
• S3 GET request has constant latency
• 30ms ~ 50ms latency regardless of the read size (up to 8KB read)
• Request retry on 500 (unavailable) or 503 (Slowdown) is also necessary
• Reading small header part of S3 objects can be the majority of query processing time
• Columnar format: header + column blocks
• IO Manager:
• Need to send as many S3 GET requests as possible
• 1 split = multiple S3 objects
• Pipelining S3 GET requests and column reads
13
Presto Stella: Plazma Storage Optimizer
• Problem:
• Some query reads 1 million partitions <- S3 latency overhead is quite high
• Data from mobile applications often have wide-range of time values.
• Presto Stella Connector
• Using Presto for optimizing physical storage partitions
• Input records: File list on S3
• Table writer stage: Merges fragmented partitions, and upload them to S3
• Commit: Update S3 file indexes on PostgreSQL (in an atomic transaction)
• Performance Improvement
• e.g. 10,000 partitions (30 sec.) -> 20 partitions (1.5 sec.)
• 20x performance improvement
• Use Cases
• Maintain fragmented user-defined partitions
• 1-hour partitioning -> more flexible time range partitioning
14
Transitions of Database Usages
15
New Directions Explored By Presto
• Traditional Database Usage
• Required Database Administrator (DBA)
• DBA designs the schema and queries
• DBA tunes query performance
• After Presto
• Schema is designed by data providers
• 1st data (user’s customer data)
• 3rd party data sources
• Analysts or Marketers explore the data with Presto
• Don’t know the schema in advance
• Convenient and low-latency access are necessary
• SQL can be inefficient at first
• While exploring data, SQL can be sophisticated, but not always
16
Prestobase Proxy: Low-Latency Access to Presto
• Needed more interactive experiences of Presto
• Prestobase Proxy: Gateway to Presto Coordinator
• Talks Presto Protocol (/v1/statement/…)
• Written in Scala.
• Runs on Docker
• Based on Finagle (HTTP server written by Twitter)
• Features
• Can work with standard presto clients (e.g., presto-cli, presto-jdbc, presto-odbc, etc.)
• Increased connectivity to BI tools: Tableau, Datorama, ChartIO, Looker, etc.
• Authentication (API key)
• Rewriting nextUri (internal IP address -> external host name)
• BI-tool specific query filters
• etc.
17
Customizing Prestobase Filters
• Prestobase Proxy: Gateway to access Presto
• Adding TD specific binding
• Finagle filters -> Injecting TD Specific filters
• Using Airframe, dependent injection library for Scala
18
Airframe
• http://wvlet.org/airframe
• Three step DI in Scala
• Bind
• Design
• Build
• Built-in life cycle manager
• Session start/shutdown
• examples:
• Open/close Presto connection
• Shutting down Presto server
• etc.
• Session
• Manage singletons and binding rules
19
VCR Record/Replay for Testing Presto
• Launching Presto requires a lot of memory (e.g., 2GB or more)
• Often crashes CI service containers (TravisCI, CircleCI, etc.)
• Recording Presto responses (prestobase-vcr)
• with sqlite-jdbc: https://github.com/xerial/sqlite-jdbc
• DB file for each test suite
• Enabled small-memory footprint testing
• Can run many Presto tests in CI
20
Optimizing QueryResults Transfer in Prestobase
• Accept: application/x-msgpack
• HTTP header
• Returning Presto query result rows in MessagePack format
• QueryResults object
• Contains Array<Array<Object>> => MessagePack (compact binary)
• Encoding QueryResults objects using MessagePack/Jackson
• https://github.com/msgpack/msgpack-java
• Presto client doesn’t need to parse the row part
• 1.5x ~ 2.0x performance improvement for streaming query results
21
Prestobase Modules
• prestobase-proxy
• Proxy server to access Presto with authentication
• prestobase-agent
• Agent for running Presto queries and storing their results
• prestobase-vcr
• For recording/replaying Presto responses
• prestobase-codec
• MessagePack codec of Presto query responses
• prestobase-hq (headquarter)
• Presto usage analysis pipelines, SLO monitoring, etc.
• prestobase-conductor
• Multi Presto cluster management tool
• td-prestobase
• Treasure Data specific bindings of prestobase
• TD Authentication, job logging/monitoring
• BI tool specific filters (Tableau, Looker, etc.)
22
Bridging Gaps Between SQL and Programming Language
• Traditional Approach
• OR-Mapper: app developer design objects and schema, then generate SQLs
• New Approach: SQL First
• Need to manage various SQL results inside Programming Language
• prestobase-hq
• Need to manage hundreds of SQLs and their results
• SLO analysis, query performance analysis, etc.
• But How?
23
sbt-sql: https://github.com/xerial/sbt-sql
• Scala SBT plugin for generating model classes from SQL files
• src/main/sql/presto/*.sql (Presto Queries)
• Using SQL as a function
• Read Presto SQL Results as Objects
• Enabled managing SQL queries in GitHub
• Type-safe data analysis in prestobase-hq
24
Big Challenge: Splitting Huge Queries
• Table Scan Log Analysis
• Revealed most of customers are scanning the same data over and over
• Optimizing SQL is not the major concern.
• Analyzing data has higher priority
• Splitting a huge query into scheduled hourly/daily jobs
• digdag: Open-source workflow engine
• http://digdag.io
• YAML-based task definition
• Scheduling, run Presto queries
• Easy to use
25
Time Range Primitives
• TD_TIME_RANGE(time, ‘2017-06-15’, ’2017-06-16’, ‘PDT’)
• Most frequently used UDF, but inconvenient
• Use short description of relative time ranges
• 1d (1 day)
• 7d (7 days)
• 1h (1 hour)
• 1w (1 week)
• 1M (1 month)
• today, yeasterday, lastWeek, thisWeek, etc.
• Recent data access
• 1dU (1 day until now) => TD_TIME_RANGE(time, ‘2017-06-15’, null, ‘JST’) open range
• Splitting ranges
• 1w.splitIntoDays
26
MessageFrame (In Design)
• Next-generation Tabular Data Format
• Hybrid layout:
• row-oriented: for streaming. Quick write
• column-oriented: better compression & fast read
• Specification Layers
• Layer-0 (basic specs: Keep it simple stupid)
• Data type: MessagePack
• Compression codec: raw, delta, gzip, (snappy, zstd? etc.)
• Column metadata: min/max/sum values of columns
• Layer-1 (advanced compression)
• Layer-N should be convertible to Layer-0
27
Summary
• Managing Implicit SLOs
• Data-oriented approach: Presto -> Fluentd -> Treasure Data -> Presto
• SQL clustering -> Find a bottleneck -> Optimize it!
• Optimization approaches
• Split usage control, Presto Ops Robot, Stella partition optimizer
• Low-latency access by Prestobase
• Workflow
• On-going Work
• Physical storage optimization (Stella)
• Huge query optimization
• Incremental Processing Support
• DigDag workflow
• MessageFrame
28
https://www.treasuredata.com/company/careers/
T R E A S U R E D A T A
29

Mais conteúdo relacionado

Mais procurados

Oracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration HustleOracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration HustleEDB
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기NAVER D2
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDataWorks Summit
 
Reducing Database Pain & Costs with Postgres
Reducing Database Pain & Costs with PostgresReducing Database Pain & Costs with Postgres
Reducing Database Pain & Costs with PostgresEDB
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
PostgreSQL Deep Internal
PostgreSQL Deep InternalPostgreSQL Deep Internal
PostgreSQL Deep InternalEXEM
 
Memoizeの仕組み(第41回PostgreSQLアンカンファレンス@オンライン 発表資料)
Memoizeの仕組み(第41回PostgreSQLアンカンファレンス@オンライン 発表資料)Memoizeの仕組み(第41回PostgreSQLアンカンファレンス@オンライン 発表資料)
Memoizeの仕組み(第41回PostgreSQLアンカンファレンス@オンライン 発表資料)NTT DATA Technology & Innovation
 
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)Satoshi Yamada
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesTanel Poder
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
 
2023 COSCUP - Whats new in PostgreSQL 16
2023 COSCUP - Whats new in PostgreSQL 162023 COSCUP - Whats new in PostgreSQL 16
2023 COSCUP - Whats new in PostgreSQL 16José Lin
 
PostgreSQLモニタリング機能の現状とこれから(Open Developers Conference 2020 Online 発表資料)
PostgreSQLモニタリング機能の現状とこれから(Open Developers Conference 2020 Online 発表資料)PostgreSQLモニタリング機能の現状とこれから(Open Developers Conference 2020 Online 発表資料)
PostgreSQLモニタリング機能の現状とこれから(Open Developers Conference 2020 Online 発表資料)NTT DATA Technology & Innovation
 
5ステップで始めるPostgreSQLレプリケーション@hbstudy#13
5ステップで始めるPostgreSQLレプリケーション@hbstudy#135ステップで始めるPostgreSQLレプリケーション@hbstudy#13
5ステップで始めるPostgreSQLレプリケーション@hbstudy#13Uptime Technologies LLC (JP)
 

Mais procurados (20)

Oracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration HustleOracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration Hustle
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
 
Vacuum徹底解説
Vacuum徹底解説Vacuum徹底解説
Vacuum徹底解説
 
Reducing Database Pain & Costs with Postgres
Reducing Database Pain & Costs with PostgresReducing Database Pain & Costs with Postgres
Reducing Database Pain & Costs with Postgres
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Presto
PrestoPresto
Presto
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
PostgreSQL Deep Internal
PostgreSQL Deep InternalPostgreSQL Deep Internal
PostgreSQL Deep Internal
 
Memoizeの仕組み(第41回PostgreSQLアンカンファレンス@オンライン 発表資料)
Memoizeの仕組み(第41回PostgreSQLアンカンファレンス@オンライン 発表資料)Memoizeの仕組み(第41回PostgreSQLアンカンファレンス@オンライン 発表資料)
Memoizeの仕組み(第41回PostgreSQLアンカンファレンス@オンライン 発表資料)
 
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
 
Presto overview
Presto overviewPresto overview
Presto overview
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
2023 COSCUP - Whats new in PostgreSQL 16
2023 COSCUP - Whats new in PostgreSQL 162023 COSCUP - Whats new in PostgreSQL 16
2023 COSCUP - Whats new in PostgreSQL 16
 
PostgreSQLの運用・監視にまつわるエトセトラ
PostgreSQLの運用・監視にまつわるエトセトラPostgreSQLの運用・監視にまつわるエトセトラ
PostgreSQLの運用・監視にまつわるエトセトラ
 
PostgreSQLモニタリング機能の現状とこれから(Open Developers Conference 2020 Online 発表資料)
PostgreSQLモニタリング機能の現状とこれから(Open Developers Conference 2020 Online 発表資料)PostgreSQLモニタリング機能の現状とこれから(Open Developers Conference 2020 Online 発表資料)
PostgreSQLモニタリング機能の現状とこれから(Open Developers Conference 2020 Online 発表資料)
 
5ステップで始めるPostgreSQLレプリケーション@hbstudy#13
5ステップで始めるPostgreSQLレプリケーション@hbstudy#135ステップで始めるPostgreSQLレプリケーション@hbstudy#13
5ステップで始めるPostgreSQLレプリケーション@hbstudy#13
 

Semelhante a Presto At Treasure Data

Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookTreasure Data, Inc.
 
SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1sqlserver.co.il
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceSATOSHI TAGOMORI
 
SQL Server Wait Types Everyone Should Know
SQL Server Wait Types Everyone Should KnowSQL Server Wait Types Everyone Should Know
SQL Server Wait Types Everyone Should KnowDean Richards
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP PerformanceBIOVIA
 
OSDC 2015: Tudor Golubenco | Application Performance Management with Packetbe...
OSDC 2015: Tudor Golubenco | Application Performance Management with Packetbe...OSDC 2015: Tudor Golubenco | Application Performance Management with Packetbe...
OSDC 2015: Tudor Golubenco | Application Performance Management with Packetbe...NETWAYS
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...javier ramirez
 
(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool Management(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool ManagementBIOVIA
 
Monitoring MongoDB’s Engines in the Wild
Monitoring MongoDB’s Engines in the WildMonitoring MongoDB’s Engines in the Wild
Monitoring MongoDB’s Engines in the WildTim Vaillancourt
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreDataWorks Summit
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageKai Sasaki
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
Boost the Performance of SharePoint Today!
Boost the Performance of SharePoint Today!Boost the Performance of SharePoint Today!
Boost the Performance of SharePoint Today!Brian Culver
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsKeeyong Han
 
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataProblems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataJignesh Shah
 

Semelhante a Presto At Treasure Data (20)

Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @Facebook
 
SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
 
SQL Server Wait Types Everyone Should Know
SQL Server Wait Types Everyone Should KnowSQL Server Wait Types Everyone Should Know
SQL Server Wait Types Everyone Should Know
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance
 
OSDC 2015: Tudor Golubenco | Application Performance Management with Packetbe...
OSDC 2015: Tudor Golubenco | Application Performance Management with Packetbe...OSDC 2015: Tudor Golubenco | Application Performance Management with Packetbe...
OSDC 2015: Tudor Golubenco | Application Performance Management with Packetbe...
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
 
(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool Management(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool Management
 
Monitoring MongoDB’s Engines in the Wild
Monitoring MongoDB’s Engines in the WildMonitoring MongoDB’s Engines in the Wild
Monitoring MongoDB’s Engines in the Wild
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud Storage
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
Fastest Servlets in the West
Fastest Servlets in the WestFastest Servlets in the West
Fastest Servlets in the West
 
Boost the Performance of SharePoint Today!
Boost the Performance of SharePoint Today!Boost the Performance of SharePoint Today!
Boost the Performance of SharePoint Today!
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
computer networking
computer networkingcomputer networking
computer networking
 
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataProblems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
 

Mais de Taro L. Saito

Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Taro L. Saito
 
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Taro L. Saito
 
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020Taro L. Saito
 
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020Taro L. Saito
 
Airframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpecAirframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpecTaro L. Saito
 
Presto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 UpdatesPresto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 UpdatesTaro L. Saito
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of PrestoTaro L. Saito
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataTaro L. Saito
 
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018Taro L. Saito
 
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Taro L. Saito
 
Tips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTaro L. Saito
 
Learning Silicon Valley Culture
Learning Silicon Valley CultureLearning Silicon Valley Culture
Learning Silicon Valley CultureTaro L. Saito
 
Scala at Treasure Data
Scala at Treasure DataScala at Treasure Data
Scala at Treasure DataTaro L. Saito
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure DataTaro L. Saito
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoTaro L. Saito
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Taro L. Saito
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Taro L. Saito
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringTaro L. Saito
 

Mais de Taro L. Saito (20)

Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
 
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
 
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
 
Airframe RPC
Airframe RPCAirframe RPC
Airframe RPC
 
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
 
Airframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpecAirframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpec
 
Presto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 UpdatesPresto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 Updates
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of Presto
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
 
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
 
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
 
Tips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTips For Maintaining OSS Projects
Tips For Maintaining OSS Projects
 
Learning Silicon Valley Culture
Learning Silicon Valley CultureLearning Silicon Valley Culture
Learning Silicon Valley Culture
 
Scala at Treasure Data
Scala at Treasure DataScala at Treasure Data
Scala at Treasure Data
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例
 
JNuma Library
JNuma LibraryJNuma Library
JNuma Library
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
 

Último

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Último (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Presto At Treasure Data

  • 1. T R E A S U R E D A T A Presto At Treasure Data Presto Meetup @ Tokyo - June 15, 2017 Taro L. Saito - GitHub:@xerial Ph.D., Software Engineer at Treasure Data, Inc. 1
  • 2. Presto Usage at Treasure Data (2017) Processing 15 Trillion Rows / Day 
 (= 173 Million Rows / sec.) 150,000~ Queries / Day 1,500~ Users Hosting Presto as a service for 3 years 2
  • 3. Configurations • Hosted on AWS (us-east), AWS Tokyo, IDCF (Japan) • Multi-Tenancy Clusters • PlazmaDB • Storage: Amazon S3 or RiakCS • S3 file indexes: PostgreSQL • Storage format: Columnar Message Pack (MPC) • MessagePack: Self-type describing format. • Compact: 10x compression ratio from the original input data (JSON) • 200GB JVM memory per node • To support varieties of query usage • Estimating required memory in advance is difficult • For avoiding WAITING_FOR_MEMORY state that blocks the entire query processing • In small-memory configuration, major GCs was quite frequent 3
  • 4. Challenges • Major Complaint • Presto is slower than usual • Only 20% of 150,000 queries are using our scheduling feature • However, 85% of queries are actually scheduled by user scripts or third-party tools 
 • How can we know the expected performance? • (Implicit) Service Level Objectives (SLOs) 4
  • 5. Understanding Implicit SLOs • We usually looked into slow queries to figure out the performance bottlenecks. • However analyzing SQL takes a long time • Because we need to understand the meaning of the data. • Understanding a hundred lines of SQL is painful • Created Presto Query Tuning Guides: • Presto Query FAQs: https://docs.treasuredata.com/articles/presto-query-faq • Expectations to Performance • Scheduled queries: We can estimate SLOs from historical stats • Scheduled, but submitted from third-party tools or user scripts • How do we know the expected performance? • We need to internalize customer’s knowledge on query performance 5
  • 6. • Bad: • Collecting stdout/stderr logs of Presto • Good: • Collecting logs in a queryable format with Presto • Collecting Query Event Logs to Treasure Data • Presto Event Listener -> fluentd -> Treasure Data • Treasure Data • schema-less: Schema can be automatically generated from the data • As we add new fields to the event, the schema evolves automatically • We are collecting every single query log since the beginning of the Presto service Our Approach: Data-Driven Improvement Query Logs Store Analyze SQL Improve & Optimize 6
  • 7. Query Event Logs • Query Completion • queryId, user id, session parameters, etc. • Query stats: running time, total rows, bytes, splits, CPU time, etc. • SQL statement • Split Completion • Running time, Processed rows, bytes, etc. • S3 GET access count, read bytes • Table Scan • Accessed tables names, column sets • Accessed time ranges (e.g., queries looking at data of past 1 hour, 7 days, etc.) • Filtering conditions (predicate) 7
  • 8. Clustering Queries with Query Signature • Finding Implicit SLOs • Need to classify 85% of scheduled queries • Extracting Query Signatures • Simplify complex SQL expressions into a tiny SQL representation • Reusing ANTLR parser of Presto • Query Signature Example: • S[Cnt](J(T1,G(S[Cnt](T2)))) • SELET count(a),... FROM T1 
 JOIN (SELECT count(b),... FROM T2 GROUP BY x) 8
  • 9. Implicit SLOs • Collect the historical query running times • Queries that have the same query signature • Median-absolute deviation (MAD): the deviation of (running time - median)^2 • CoV: Coefficient of variation = MAD / median • If CoV > 1, the query running time tends to vary • If CoV < 1, median of historical running time is useful for query running time estimation. • SLO violation: • If query is running longer than median + MAD • Customer feels query is slower than usual • However, query might be processing much more data than usual • Normalization based on the processing data size is also necessary 9
  • 10. Typical Performance Bottlenecks • Huge Queries • Frequent S3 access, wide table scans • Single-node operators • order by, window function, count(distinct x), processing skewed data, etc. • Ill-performing worker nodes • Heavy load on a single worker node • Insufficient pool memory • Major/full GCs • We are using min.error-duration = 2m, but GC pause can be longer • Too much resource usage • A single query occupies the entire cluster • e.g., A query with hundreds of query stages! 10
  • 11. Split Resource Manager • Problem: A singe query can occupy the entire cluster resource • But Presto has a limited performance control • Only for cpu time, memory usage, and concurrent queries (CQ) limits • No throttling nor boosting • Created Split Resource Manger • Limiting the max runnable splits for each customer • Using a custom RemoteTask class, which adds an wait if no splits are available • => Efficient Use of Multi-Tenancy Cluster 11
  • 12. Presto Ops Robot • Problem: Insufficient memory of a worker • Queries using that worker node enter WAITING_FOR_MEMORY state • Report JMX metrics -> fluentd -> DataDog -> Trigger Alert -> Presto Ops Robot • Presto Ops Robot • Sending graceful shutdown command (POST SHUTTING_DOWN message to /v1/status) • or kill memory consuming queries in the worker node • Restarting worker JVM process • At least every 1 week, to avoid any issues when running JVM for a long time • Resetting any effect caused by unknown bugs • Useful for cleaning up untracked memory (e.g., ANTLR objects, etc.) 12
  • 13. S3 Access Performance • Problem: Slow Table Scan • S3 GET request has constant latency • 30ms ~ 50ms latency regardless of the read size (up to 8KB read) • Request retry on 500 (unavailable) or 503 (Slowdown) is also necessary • Reading small header part of S3 objects can be the majority of query processing time • Columnar format: header + column blocks • IO Manager: • Need to send as many S3 GET requests as possible • 1 split = multiple S3 objects • Pipelining S3 GET requests and column reads 13
  • 14. Presto Stella: Plazma Storage Optimizer • Problem: • Some query reads 1 million partitions <- S3 latency overhead is quite high • Data from mobile applications often have wide-range of time values. • Presto Stella Connector • Using Presto for optimizing physical storage partitions • Input records: File list on S3 • Table writer stage: Merges fragmented partitions, and upload them to S3 • Commit: Update S3 file indexes on PostgreSQL (in an atomic transaction) • Performance Improvement • e.g. 10,000 partitions (30 sec.) -> 20 partitions (1.5 sec.) • 20x performance improvement • Use Cases • Maintain fragmented user-defined partitions • 1-hour partitioning -> more flexible time range partitioning 14
  • 16. New Directions Explored By Presto • Traditional Database Usage • Required Database Administrator (DBA) • DBA designs the schema and queries • DBA tunes query performance • After Presto • Schema is designed by data providers • 1st data (user’s customer data) • 3rd party data sources • Analysts or Marketers explore the data with Presto • Don’t know the schema in advance • Convenient and low-latency access are necessary • SQL can be inefficient at first • While exploring data, SQL can be sophisticated, but not always 16
  • 17. Prestobase Proxy: Low-Latency Access to Presto • Needed more interactive experiences of Presto • Prestobase Proxy: Gateway to Presto Coordinator • Talks Presto Protocol (/v1/statement/…) • Written in Scala. • Runs on Docker • Based on Finagle (HTTP server written by Twitter) • Features • Can work with standard presto clients (e.g., presto-cli, presto-jdbc, presto-odbc, etc.) • Increased connectivity to BI tools: Tableau, Datorama, ChartIO, Looker, etc. • Authentication (API key) • Rewriting nextUri (internal IP address -> external host name) • BI-tool specific query filters • etc. 17
  • 18. Customizing Prestobase Filters • Prestobase Proxy: Gateway to access Presto • Adding TD specific binding • Finagle filters -> Injecting TD Specific filters • Using Airframe, dependent injection library for Scala 18
  • 19. Airframe • http://wvlet.org/airframe • Three step DI in Scala • Bind • Design • Build • Built-in life cycle manager • Session start/shutdown • examples: • Open/close Presto connection • Shutting down Presto server • etc. • Session • Manage singletons and binding rules 19
  • 20. VCR Record/Replay for Testing Presto • Launching Presto requires a lot of memory (e.g., 2GB or more) • Often crashes CI service containers (TravisCI, CircleCI, etc.) • Recording Presto responses (prestobase-vcr) • with sqlite-jdbc: https://github.com/xerial/sqlite-jdbc • DB file for each test suite • Enabled small-memory footprint testing • Can run many Presto tests in CI 20
  • 21. Optimizing QueryResults Transfer in Prestobase • Accept: application/x-msgpack • HTTP header • Returning Presto query result rows in MessagePack format • QueryResults object • Contains Array<Array<Object>> => MessagePack (compact binary) • Encoding QueryResults objects using MessagePack/Jackson • https://github.com/msgpack/msgpack-java • Presto client doesn’t need to parse the row part • 1.5x ~ 2.0x performance improvement for streaming query results 21
  • 22. Prestobase Modules • prestobase-proxy • Proxy server to access Presto with authentication • prestobase-agent • Agent for running Presto queries and storing their results • prestobase-vcr • For recording/replaying Presto responses • prestobase-codec • MessagePack codec of Presto query responses • prestobase-hq (headquarter) • Presto usage analysis pipelines, SLO monitoring, etc. • prestobase-conductor • Multi Presto cluster management tool • td-prestobase • Treasure Data specific bindings of prestobase • TD Authentication, job logging/monitoring • BI tool specific filters (Tableau, Looker, etc.) 22
  • 23. Bridging Gaps Between SQL and Programming Language • Traditional Approach • OR-Mapper: app developer design objects and schema, then generate SQLs • New Approach: SQL First • Need to manage various SQL results inside Programming Language • prestobase-hq • Need to manage hundreds of SQLs and their results • SLO analysis, query performance analysis, etc. • But How? 23
  • 24. sbt-sql: https://github.com/xerial/sbt-sql • Scala SBT plugin for generating model classes from SQL files • src/main/sql/presto/*.sql (Presto Queries) • Using SQL as a function • Read Presto SQL Results as Objects • Enabled managing SQL queries in GitHub • Type-safe data analysis in prestobase-hq 24
  • 25. Big Challenge: Splitting Huge Queries • Table Scan Log Analysis • Revealed most of customers are scanning the same data over and over • Optimizing SQL is not the major concern. • Analyzing data has higher priority • Splitting a huge query into scheduled hourly/daily jobs • digdag: Open-source workflow engine • http://digdag.io • YAML-based task definition • Scheduling, run Presto queries • Easy to use 25
  • 26. Time Range Primitives • TD_TIME_RANGE(time, ‘2017-06-15’, ’2017-06-16’, ‘PDT’) • Most frequently used UDF, but inconvenient • Use short description of relative time ranges • 1d (1 day) • 7d (7 days) • 1h (1 hour) • 1w (1 week) • 1M (1 month) • today, yeasterday, lastWeek, thisWeek, etc. • Recent data access • 1dU (1 day until now) => TD_TIME_RANGE(time, ‘2017-06-15’, null, ‘JST’) open range • Splitting ranges • 1w.splitIntoDays 26
  • 27. MessageFrame (In Design) • Next-generation Tabular Data Format • Hybrid layout: • row-oriented: for streaming. Quick write • column-oriented: better compression & fast read • Specification Layers • Layer-0 (basic specs: Keep it simple stupid) • Data type: MessagePack • Compression codec: raw, delta, gzip, (snappy, zstd? etc.) • Column metadata: min/max/sum values of columns • Layer-1 (advanced compression) • Layer-N should be convertible to Layer-0 27
  • 28. Summary • Managing Implicit SLOs • Data-oriented approach: Presto -> Fluentd -> Treasure Data -> Presto • SQL clustering -> Find a bottleneck -> Optimize it! • Optimization approaches • Split usage control, Presto Ops Robot, Stella partition optimizer • Low-latency access by Prestobase • Workflow • On-going Work • Physical storage optimization (Stella) • Huge query optimization • Incremental Processing Support • DigDag workflow • MessageFrame 28 https://www.treasuredata.com/company/careers/
  • 29. T R E A S U R E D A T A 29