Mais conteúdo relacionado

Similar a Alluxio Product School Webinar - Boosting Trino Performance.(20)

Mais de Alluxio, Inc.(20)


Alluxio Product School Webinar - Boosting Trino Performance.

  1. Boosting Trino Performance Mar 23, 2023
  2. Agenda Overview & Architecture 01 Trino, HMS and Iceberg Overview SQL Execution 02 SQL Execution Introduction Accelerating SQL 03 Best practice of Accelerating SQL Q&A 04 Questions & Answers 2
  3. Overview & Architecture 01 3
  4. Trino Overview ● Distributed SQL Query Engine ○ ANSI SQL on Hive data warehouse, Hudi, Iceberg, Kafka, Druid and etc. ○ Designed to be interactive ○ Access to petabytes of data ● Open-source ○ ● Use Cases ○ Ad-hoc ○ BI tools ○ Dashboard ○ A/B testing ○ ETL 4
  5. Trino Architecture 5
  6. Hive Tables Overview ● Using Hive metastore to serve the metadata ○ Backed by a RDBMS(Mysql) ○ Limited scalability ○ 6 REF:
  7. Trino Tuning Tips 02 7
  8. Plan Generation 8 From
  9. Optimization 9
  10. EXPLAIN vs EXPLAIN ANALYZE 10 EXPLAIN: plan structure + cost estimates EXPLAIN ANALYZE: plan structure + cost estimates + actual execution statistics
  11. Plan Generation & Optimization 11
  12. Scheduling of Hive Tables 12 orms-1xts1m
  13. Scan Parquet Files 13
  14. Predicate Pushdown Resource Usage 14
  15. S3 getObject API zonS3/latest/API/ ml
  16. Best Practices 03 16
  17. ○ Set max memory per query ○ Set max memory per node ■ If we could speed up queries, we would be able to reduce max_concurrent and then increase the max memory per node and at the same time keep the same throughput ○ Memory allocation deadlock ■ Try query.low-memory-killer.policy Make Sure There Is Sufficient Memory 17
  18. ● Consider using column-based data format for your data files. ● ORC might have a better performance than parquet (especially when using prestodb) ● Avoid using CSV or json format ● Compression (SNAPPY, LZ4, ZSTD, and GZIP) ● Use partitioning. You can create a partitioned version of a table with a CTAS by adding the partitioned_by clause to the CREATE TABLE. ● Use bucketing. Do this by adding the bucketed_by clause to your CREATE TABLE statement. You will also need to specify bucket_count. Optimize File Format & Table Layout 18
  19. ● Collect table statistics to ensure the most efficient query plan is produced, which means queries run as fast as possible. ● Use the sql ANALYZE TABLE <tablename> command to do this. Repeat the ANALYZE TABLE commands for all tables involved in queries on a regular basis, typically when data has substantially changed (e.g. new data arrived / after an ETL cycle has completed). Collect Hive Table Stats 19
  20. ● Enable CBO (It’s default value already) ● “LARGE LEFT” (put the large table on the left side of the join). ● Let Trino do the job in case it’s using default settings. ○ SET session join_distribution_type=’AUTOMATIC’; ○ SET session join_reordering_strategy=’AUTOMATIC’; Join Optimization 20
  21. ● You should enable Dynamic Filtering when 1 or more joins are in-play, especially if there’s a smaller dimension table being used to probe a larger fact table for example. Dynamic Filtering is pushed down to ORC and Parquet readers, and can accelerate queries on partitioned as well as non-partitioned tables. Dynamic Filtering is a join optimization intended to improve performance of Hash JOINs. Enable this with: ○ SET session enable_dynamic_filtering=TRUE; Dynamic Filtering 21
  22. Trino + Alluxio Multi-Level Cache 22
  23. Configurations Recommendation (JVM) ● Avoid using large clusters ○ <= 400 workers ● Using JDK 11 instead of JDK 8 ○ JDK 11 provides a much better performance for both runtime and GC ● Using G1GC for large heap (> 10G) ● DO NOT over tune JVM ○ DO NOT touch MaxNewSize and NewSize ○ DO NOT touch MaxTenuringThreshold ○ DO NOT touch InitiatingHeapOccupancyPercent ● Larger Xmx remediate 95% of GC issues 23
  24. Q&A 04 24 Complete the Community Survey for a chance to win an Amazon gift card!
  25. THANK YOU 25