Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Presto At Treasure Data
1. T R E A S U R E D A T A
Presto At Treasure Data
Presto Meetup @ Tokyo - June 15, 2017
Taro L. Saito - GitHub:@xerial
Ph.D., Software Engineer at Treasure Data, Inc.
1
2. Presto Usage at Treasure Data (2017)
Processing 15 Trillion Rows / Day
(= 173 Million Rows / sec.)
150,000~ Queries / Day
1,500~ Users
Hosting Presto as a service for 3 years
2
3. Configurations
• Hosted on AWS (us-east), AWS Tokyo, IDCF (Japan)
• Multi-Tenancy Clusters
• PlazmaDB
• Storage: Amazon S3 or RiakCS
• S3 file indexes: PostgreSQL
• Storage format: Columnar Message Pack (MPC)
• MessagePack: Self-type describing format.
• Compact: 10x compression ratio from the original input data (JSON)
• 200GB JVM memory per node
• To support varieties of query usage
• Estimating required memory in advance is difficult
• For avoiding WAITING_FOR_MEMORY state that blocks the entire query processing
• In small-memory configuration, major GCs was quite frequent
3
4. Challenges
• Major Complaint
• Presto is slower than usual
• Only 20% of 150,000 queries are using our scheduling feature
• However, 85% of queries are actually scheduled by user scripts or third-party tools
• How can we know the expected performance?
• (Implicit) Service Level Objectives (SLOs)
4
5. Understanding Implicit SLOs
• We usually looked into slow queries to figure out the performance bottlenecks.
• However analyzing SQL takes a long time
• Because we need to understand the meaning of the data.
• Understanding a hundred lines of SQL is painful
• Created Presto Query Tuning Guides:
• Presto Query FAQs: https://docs.treasuredata.com/articles/presto-query-faq
• Expectations to Performance
• Scheduled queries: We can estimate SLOs from historical stats
• Scheduled, but submitted from third-party tools or user scripts
• How do we know the expected performance?
• We need to internalize customer’s knowledge on query performance
5
6. • Bad:
• Collecting stdout/stderr logs of Presto
• Good:
• Collecting logs in a queryable format with Presto
• Collecting Query Event Logs to Treasure Data
• Presto Event Listener -> fluentd -> Treasure Data
• Treasure Data
• schema-less: Schema can be automatically generated from the data
• As we add new fields to the event, the schema evolves automatically
• We are collecting every single query log since the beginning of the Presto service
Our Approach: Data-Driven Improvement
Query Logs
Store
Analyze
SQL
Improve & Optimize
6
7. Query Event Logs
• Query Completion
• queryId, user id, session parameters, etc.
• Query stats: running time, total rows, bytes, splits, CPU time, etc.
• SQL statement
• Split Completion
• Running time, Processed rows, bytes, etc.
• S3 GET access count, read bytes
• Table Scan
• Accessed tables names, column sets
• Accessed time ranges (e.g., queries looking at data of past 1 hour, 7 days, etc.)
• Filtering conditions (predicate)
7
8. Clustering Queries with Query Signature
• Finding Implicit SLOs
• Need to classify 85% of scheduled queries
• Extracting Query Signatures
• Simplify complex SQL expressions into a
tiny SQL representation
• Reusing ANTLR parser of Presto
• Query Signature Example:
• S[Cnt](J(T1,G(S[Cnt](T2))))
• SELET count(a),... FROM T1
JOIN (SELECT count(b),... FROM T2 GROUP BY x)
8
9. Implicit SLOs
• Collect the historical query running times
• Queries that have the same query signature
• Median-absolute deviation (MAD): the deviation of (running time - median)^2
• CoV: Coefficient of variation = MAD / median
• If CoV > 1, the query running time tends to vary
• If CoV < 1, median of historical running time is useful for query running time
estimation.
• SLO violation:
• If query is running longer than median + MAD
• Customer feels query is slower than usual
• However, query might be processing much more data than usual
• Normalization based on the processing data size is also necessary
9
10. Typical Performance Bottlenecks
• Huge Queries
• Frequent S3 access, wide table scans
• Single-node operators
• order by, window function, count(distinct x), processing skewed data, etc.
• Ill-performing worker nodes
• Heavy load on a single worker node
• Insufficient pool memory
• Major/full GCs
• We are using min.error-duration = 2m, but GC pause can be longer
• Too much resource usage
• A single query occupies the entire cluster
• e.g., A query with hundreds of query stages!
10
11. Split Resource Manager
• Problem: A singe query can occupy the entire cluster resource
• But Presto has a limited performance control
• Only for cpu time, memory usage, and concurrent queries (CQ) limits
• No throttling nor boosting
• Created Split Resource Manger
• Limiting the max runnable splits for each customer
• Using a custom RemoteTask class, which adds an wait if no splits are available
• => Efficient Use of Multi-Tenancy Cluster
11
12. Presto Ops Robot
• Problem: Insufficient memory of a worker
• Queries using that worker node enter WAITING_FOR_MEMORY state
• Report JMX metrics -> fluentd -> DataDog -> Trigger Alert -> Presto Ops Robot
• Presto Ops Robot
• Sending graceful shutdown command (POST SHUTTING_DOWN message to /v1/status)
• or kill memory consuming queries in the worker node
• Restarting worker JVM process
• At least every 1 week, to avoid any issues when running JVM for a long time
• Resetting any effect caused by unknown bugs
• Useful for cleaning up untracked memory (e.g., ANTLR objects, etc.)
12
13. S3 Access Performance
• Problem: Slow Table Scan
• S3 GET request has constant latency
• 30ms ~ 50ms latency regardless of the read size (up to 8KB read)
• Request retry on 500 (unavailable) or 503 (Slowdown) is also necessary
• Reading small header part of S3 objects can be the majority of query processing time
• Columnar format: header + column blocks
• IO Manager:
• Need to send as many S3 GET requests as possible
• 1 split = multiple S3 objects
• Pipelining S3 GET requests and column reads
13
14. Presto Stella: Plazma Storage Optimizer
• Problem:
• Some query reads 1 million partitions <- S3 latency overhead is quite high
• Data from mobile applications often have wide-range of time values.
• Presto Stella Connector
• Using Presto for optimizing physical storage partitions
• Input records: File list on S3
• Table writer stage: Merges fragmented partitions, and upload them to S3
• Commit: Update S3 file indexes on PostgreSQL (in an atomic transaction)
• Performance Improvement
• e.g. 10,000 partitions (30 sec.) -> 20 partitions (1.5 sec.)
• 20x performance improvement
• Use Cases
• Maintain fragmented user-defined partitions
• 1-hour partitioning -> more flexible time range partitioning
14
16. New Directions Explored By Presto
• Traditional Database Usage
• Required Database Administrator (DBA)
• DBA designs the schema and queries
• DBA tunes query performance
• After Presto
• Schema is designed by data providers
• 1st data (user’s customer data)
• 3rd party data sources
• Analysts or Marketers explore the data with Presto
• Don’t know the schema in advance
• Convenient and low-latency access are necessary
• SQL can be inefficient at first
• While exploring data, SQL can be sophisticated, but not always
16
17. Prestobase Proxy: Low-Latency Access to Presto
• Needed more interactive experiences of Presto
• Prestobase Proxy: Gateway to Presto Coordinator
• Talks Presto Protocol (/v1/statement/…)
• Written in Scala.
• Runs on Docker
• Based on Finagle (HTTP server written by Twitter)
• Features
• Can work with standard presto clients (e.g., presto-cli, presto-jdbc, presto-odbc, etc.)
• Increased connectivity to BI tools: Tableau, Datorama, ChartIO, Looker, etc.
• Authentication (API key)
• Rewriting nextUri (internal IP address -> external host name)
• BI-tool specific query filters
• etc.
17
19. Airframe
• http://wvlet.org/airframe
• Three step DI in Scala
• Bind
• Design
• Build
• Built-in life cycle manager
• Session start/shutdown
• examples:
• Open/close Presto connection
• Shutting down Presto server
• etc.
• Session
• Manage singletons and binding rules
19
20. VCR Record/Replay for Testing Presto
• Launching Presto requires a lot of memory (e.g., 2GB or more)
• Often crashes CI service containers (TravisCI, CircleCI, etc.)
• Recording Presto responses (prestobase-vcr)
• with sqlite-jdbc: https://github.com/xerial/sqlite-jdbc
• DB file for each test suite
• Enabled small-memory footprint testing
• Can run many Presto tests in CI
20
21. Optimizing QueryResults Transfer in Prestobase
• Accept: application/x-msgpack
• HTTP header
• Returning Presto query result rows in MessagePack format
• QueryResults object
• Contains Array<Array<Object>> => MessagePack (compact binary)
• Encoding QueryResults objects using MessagePack/Jackson
• https://github.com/msgpack/msgpack-java
• Presto client doesn’t need to parse the row part
• 1.5x ~ 2.0x performance improvement for streaming query results
21
22. Prestobase Modules
• prestobase-proxy
• Proxy server to access Presto with authentication
• prestobase-agent
• Agent for running Presto queries and storing their results
• prestobase-vcr
• For recording/replaying Presto responses
• prestobase-codec
• MessagePack codec of Presto query responses
• prestobase-hq (headquarter)
• Presto usage analysis pipelines, SLO monitoring, etc.
• prestobase-conductor
• Multi Presto cluster management tool
• td-prestobase
• Treasure Data specific bindings of prestobase
• TD Authentication, job logging/monitoring
• BI tool specific filters (Tableau, Looker, etc.)
22
23. Bridging Gaps Between SQL and Programming Language
• Traditional Approach
• OR-Mapper: app developer design objects and schema, then generate SQLs
• New Approach: SQL First
• Need to manage various SQL results inside Programming Language
• prestobase-hq
• Need to manage hundreds of SQLs and their results
• SLO analysis, query performance analysis, etc.
• But How?
23
24. sbt-sql: https://github.com/xerial/sbt-sql
• Scala SBT plugin for generating model classes from SQL files
• src/main/sql/presto/*.sql (Presto Queries)
• Using SQL as a function
• Read Presto SQL Results as Objects
• Enabled managing SQL queries in GitHub
• Type-safe data analysis in prestobase-hq
24
25. Big Challenge: Splitting Huge Queries
• Table Scan Log Analysis
• Revealed most of customers are scanning the same data over and over
• Optimizing SQL is not the major concern.
• Analyzing data has higher priority
• Splitting a huge query into scheduled hourly/daily jobs
• digdag: Open-source workflow engine
• http://digdag.io
• YAML-based task definition
• Scheduling, run Presto queries
• Easy to use
25
26. Time Range Primitives
• TD_TIME_RANGE(time, ‘2017-06-15’, ’2017-06-16’, ‘PDT’)
• Most frequently used UDF, but inconvenient
• Use short description of relative time ranges
• 1d (1 day)
• 7d (7 days)
• 1h (1 hour)
• 1w (1 week)
• 1M (1 month)
• today, yeasterday, lastWeek, thisWeek, etc.
• Recent data access
• 1dU (1 day until now) => TD_TIME_RANGE(time, ‘2017-06-15’, null, ‘JST’) open range
• Splitting ranges
• 1w.splitIntoDays
26
27. MessageFrame (In Design)
• Next-generation Tabular Data Format
• Hybrid layout:
• row-oriented: for streaming. Quick write
• column-oriented: better compression & fast read
• Specification Layers
• Layer-0 (basic specs: Keep it simple stupid)
• Data type: MessagePack
• Compression codec: raw, delta, gzip, (snappy, zstd? etc.)
• Column metadata: min/max/sum values of columns
• Layer-1 (advanced compression)
• Layer-N should be convertible to Layer-0
27