O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Performance Optimizations in Apache Impala

Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile).

To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.

Livros relacionados

Gratuito durante 30 dias do Scribd

Ver tudo

Audiolivros relacionados

Gratuito durante 30 dias do Scribd

Ver tudo
  • Seja o primeiro a comentar

Performance Optimizations in Apache Impala

  1. 1. 1© Cloudera, Inc. All rights reserved. Performance optimizations in Apache Impala Mostafa Mokhtar (mmokhtar@cloudera.com) @mostafamokhtar7
  2. 2. 2© Cloudera, Inc. All rights reserved. • History and motivation • Impala architecture • Performance focused overview on • Front-end • Query planning overview • Query optimization • Metadata and statistics • Back-end • Partitioning and sorting for Selective scans • Code-generation using LLVM • Streaming Aggregation • Runtime filters • Handling cache misses for Joins and Aggs Outline
  3. 3. 3© Cloudera, Inc. All rights reserved. “Big data” revolution
  4. 4. 4© Cloudera, Inc. All rights reserved. SQL on Apache Hadoop • SQL • Run on top of HDFS • Supported various file formats • Converted query operators to map-reduce jobs • Run at scale • Fault-tolerant • High startup-costs/materialization overhead (slow...)
  5. 5. 5© Cloudera, Inc. All rights reserved. Impala: An open source SQL engine for Apache Hadoop • Fast • C++ instead of Java • Run time code generation • Support interactive BI and analytical workloads • Supports queries that take from milliseconds to hours • Scalable • Run directly on “big” hadoop clusters (100+ nodes) • Flexible • Support multiple storage engines (HDFS, S3, ADLS, Apache Kudu, ...) • Support multiple file formats (Parquet, Text, Sequence, Avro, ..) • Support both structured and semi-structured data • Enterprise-grade • Authorization/authentication/lineage/audit
  6. 6. 6© Cloudera, Inc. All rights reserved. Impala’s history • First commit in May 2011 • Public beta in October 2012 • Over a million downloads since then • Most recent released version is 2.10, associated with CDH 5.13 • November 2017 • Apache® Impala™ has graduated from the Apache Incubator to become a Top- Level Project (TLP), signifying that the project's community and products have been well-governed under the ASF's meritocratic process and principles.
  7. 7. 7© Cloudera, Inc. All rights reserved. SQL on “big data”. The race is on... Hive on Tez Hive on Tez + LLAP
  8. 8. 8© Cloudera, Inc. All rights reserved. Impala - Logical view Query Compiler Query Executor Query Coordinator Metadata HDFS Kudu S3/ADL S HBase Query Compiler Query Executor Query Coordinator Metadata Query Compiler Query Executor Query Coordinator Metadata Hive Metastore HDFS NameNode StateStore Catalog FE (Java) BE (C++) HDFS Kudu S3/ADL S HBase HDFS Kudu S3/ADL S HBase Metadata Impalad Impalad Impalad Metadata/control Execution Storage Sentry Role: admin, Privs: R/W access on ALL Role: user, Privs: R access on TABLE db1.foo, ... Dir: hdfs://db1/foo, File: foo1.csv, Blocks:...., Size: 100B Dir: S3://bucket/db2/blah, File: blah1.parq, Size: 1GB DB: db1, Tbl: foo (a int,...) location ‘hdfs://…’ DB: db2, Tbl: blah (b string..) location ‘s3://…’
  9. 9. 9© Cloudera, Inc. All rights reserved. Metadata Impala in action - DDL Query Compiler Query Executor Query Coordinator HDFS Kudu S3/ADL S HBase Query Compiler Query Executor Query Coordinator Metadata Query Compiler Query Executor Query Coordinator Metadata Hive Metastore HDFS NameNode StateStore Catalog HDFS Kudu S3/ADL S HBase HDFS Kudu S3/ADL S HBase Metadata Impalad Impalad Impalad SQL App ODBC/JDBC CREATE TABLE db.foo (...) location ‘hdfs://…..’;
  10. 10. 11© Cloudera, Inc. All rights reserved. Impala in action - Select query Query Compiler Query Executor Query Coordinator Metadata HDFS Kudu S3/ADL S HBase Query Compiler Query Executor Query Coordinator Metadata Query Compiler Query Executor Query Coordinator Metadata Hive Metastore HDFS NameNode StateStore Catalog HDFS Kudu S3/ADL S HBase HDFS Kudu S3/ADL S HBase Metadata Impalad Impalad Impalad SQL App ODBC/JDBC select * from db.foo; 0 1 2 34 5 6 7
  11. 11. 12© Cloudera, Inc. All rights reserved. Impala release to release performance trend ● Track record in improving release to release performance ● 10x speedup in standard benchmarks over the last 24 months ● Continued to add functionality without introducing regressions
  12. 12. 13© Cloudera, Inc. All rights reserved. 2-phase cost-based optimizer • Phase 1: Generate a single node plan (transformations, join ordering, static partition pruning, runtime filters) • Phase 2: Convert the single node plan into a distributed query execution plan (add exchange nodes, decide join strategy) Query fragments (units of work): • Parts of query execution tree • Each fragment is executed in one or more impalads Query planning
  13. 13. 14© Cloudera, Inc. All rights reserved. Metadata • Table metadata (HMS) and block level (HDFS) information are cached to speed-up query time • Cached data is stored in FlatBuffers to save space and avoid excessive GC • Metadata loading from HDFS/S3/ADLS uses a thread pool to speedup the operation when needed Statistics • Impala uses an HLL to compute Number of distinct values (NDV) • HLL is much faster than the combination of COUNT and DISTINCT, and uses a constant amount of memory and thus is less memory-intensive for columns with high cardinality. • HLL size is 1KB per column • A Novel implementation of sampled stats is coming soon Metadata & Statistics
  14. 14. 15© Cloudera, Inc. All rights reserved. ● Order scan predicates by selectivity and cost, mitigate correlated predicates (exponential backoff) ● Detection of common join pattern of Primary key/Foreign key joins ● Compute selectivity of predicates for scans as well as joins ● Determine build and probe side for equi joins ● Select the ideal join type that minimizes resource utilization ○ Broadcast Join ○ Partition Join ● Identify joins which can benefit from Runtime filters ● Determine optimal join order Query optimizations based on statistics 02:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: l_partkey = o_orderkey | fk/pk conjuncts: l_orderkey = o_orderkey | runtime filters: RF000 <- o_orderkey | tuple-ids=1,0 row-size=113B cardinality=27,381,196 | |--05:EXCHANGE [BROADCAST] | | hosts=20 per-host-mem=0B | | tuple-ids=0 row-size=8B cardinality=68,452,805 | | | 00:SCAN HDFS [tpch_3000_parquet.orders, RANDOM] | partitions=366/2406 files=366 size=28.83GB | predicates: tpch_3000_parquet.orders.o_orderkey < 100 | table stats: 4,500,000,000 rows total | column stats: all | tuple-ids=0 row-size=8B cardinality=68,452,805 | 01:SCAN HDFS [tpch_3000_parquet.lineitem, RANDOM] partitions=2526/2526 files=2526 size=1.36TB predicates: l_partkey < 100, l_receiptdate >= '1994-01-01', l_comment LIKE '%long string% runtime filters: RF000 -> l_orderkey table stats: 18,000,048,306 rows total column stats: all tuple-ids=1 row-size=105B cardinality=1,800,004,831 SELECT l_shipmode, Sum(l_extendedprice) FROM orders, lineitem WHERE o_orderkey = l_orderkey AND l_comment LIKE '%long string%' AND l_receiptdate >= '1994-01-01' AND l_partkey < 100 AND o_orderdate < '1993-01-01' GROUP BY l_shipmode ORDER BY l_shipmode
  15. 15. 16© Cloudera, Inc. All rights reserved. Query optimizations based on statistics 02:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: l_partkey = o_orderkey | fk/pk conjuncts: l_orderkey = o_orderkey | runtime filters: RF000 <- o_orderkey | tuple-ids=1,0 row-size=113B cardinality=27,381,196 | |--05:EXCHANGE [BROADCAST] | | hosts=20 per-host-mem=0B | | tuple-ids=0 row-size=8B cardinality=68,452,805 | | | 00:SCAN HDFS [tpch_3000_parquet.orders, RANDOM] | partitions=366/2406 files=366 size=28.83GB | predicates: tpch_3000_parquet.orders.o_orderkey < 100 | table stats: 4,500,000,000 rows total | column stats: all | tuple-ids=0 row-size=8B cardinality=68,452,805 | 01:SCAN HDFS [tpch_3000_parquet.lineitem, RANDOM] partitions=2526/2526 files=2526 size=1.36TB predicates: l_partkey < 100, l_receiptdate >= '1994-01-01', l_comment LIKE '%long string% runtime filters: RF000 -> l_orderkey table stats: 18,000,048,306 rows total column stats: all tuple-ids=1 row-size=105B cardinality=1,800,004,831 SELECT l_shipmode, Sum(l_extendedprice) FROM orders, lineitem WHERE o_orderkey = l_orderkey AND l_comment LIKE '%long string%' AND l_receiptdate >= '1994-01-01' AND l_partkey < 100 AND o_orderdate < '1993-01-01' GROUP BY l_shipmode ORDER BY l_shipmode ● Order scan predicates by selectivity and cost, mitigate correlated predicates (exponential backoff) ● Detection of common join pattern of Primary key/Foreign key joins ● Compute selectivity of predicates for scans as well as joins ● Determine build and probe side for equi joins ● Select the ideal join type that minimizes resource utilization ○ Broadcast Join ○ Partition Join ● Identify joins which can benefit from Runtime filters ● Determine optimal join order
  16. 16. 17© Cloudera, Inc. All rights reserved. Query optimizations based on statistics 02:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: l_partkey = o_orderkey | fk/pk conjuncts: l_orderkey = o_orderkey | runtime filters: RF000 <- o_orderkey | tuple-ids=1,0 row-size=113B cardinality=27,381,196 | |--05:EXCHANGE [BROADCAST] | | hosts=20 per-host-mem=0B | | tuple-ids=0 row-size=8B cardinality=68,452,805 | | | 00:SCAN HDFS [tpch_3000_parquet.orders, RANDOM] | partitions=366/2406 files=366 size=28.83GB | predicates: tpch_3000_parquet.orders.o_orderkey < 100 | table stats: 4,500,000,000 rows total | column stats: all | tuple-ids=0 row-size=8B cardinality=68,452,805 | 01:SCAN HDFS [tpch_3000_parquet.lineitem, RANDOM] partitions=2526/2526 files=2526 size=1.36TB predicates: l_partkey < 100, l_receiptdate >= '1994-01-01', l_comment LIKE '%long string% runtime filters: RF000 -> l_orderkey table stats: 18,000,048,306 rows total column stats: all tuple-ids=1 row-size=105B cardinality=1,800,004,831 ● Order scan predicates by selectivity and cost, mitigate correlated predicates (exponential backoff) ● Detection of common join pattern of Primary key/Foreign key joins ● Compute selectivity of predicates for scans as well as joins ● Determine build and probe side for equi joins ● Select the ideal join type that minimizes resource utilization ○ Broadcast Join ○ Partition Join ● Identify joins which can benefit from Runtime filters ● Determine optimal join order SELECT l_shipmode, Sum(l_extendedprice) FROM orders, lineitem WHERE o_orderkey = l_orderkey AND l_comment LIKE '%long string%' AND l_receiptdate >= '1994-01-01' AND l_partkey < 100 AND o_orderdate < '1993-01-01' GROUP BY l_shipmode ORDER BY l_shipmode
  17. 17. 18© Cloudera, Inc. All rights reserved. Query optimizations based on statistics 02:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: l_partkey = o_orderkey | fk/pk conjuncts: l_orderkey = o_orderkey | runtime filters: RF000 <- o_orderkey | tuple-ids=1,0 row-size=113B cardinality=27,381,196 | |--05:EXCHANGE [BROADCAST] | | hosts=20 per-host-mem=0B | | tuple-ids=0 row-size=8B cardinality=68,452,805 | | | 00:SCAN HDFS [tpch_3000_parquet.orders, RANDOM] | partitions=366/2406 files=366 size=28.83GB | predicates: tpch_3000_parquet.orders.o_orderkey < 100 | table stats: 4,500,000,000 rows total | column stats: all | tuple-ids=0 row-size=8B cardinality=68,452,805 | 01:SCAN HDFS [tpch_3000_parquet.lineitem, RANDOM] partitions=2526/2526 files=2526 size=1.36TB predicates: l_partkey < 100, l_receiptdate >= '1994-01-01', l_comment LIKE '%long string% runtime filters: RF000 -> l_orderkey table stats: 18,000,048,306 rows total column stats: all tuple-ids=1 row-size=105B cardinality=1,800,004,831 ● Order scan predicates by selectivity and cost, mitigate correlated predicates (exponential backoff) ● Detection of common join pattern of Primary key/Foreign key joins ● Compute selectivity of predicates for scans as well as joins ● Determine build and probe side for equi joins ● Select the ideal join type that minimizes resource utilization ○ Broadcast Join ○ Partition Join ● Identify joins which can benefit from Runtime filters ● Determine optimal join order SELECT l_shipmode, Sum(l_extendedprice) FROM orders, lineitem WHERE o_orderkey = l_orderkey AND l_comment LIKE '%long string%' AND l_receiptdate >= '1994-01-01' AND l_partkey < 100 AND o_orderdate < '1993-01-01' GROUP BY l_shipmode ORDER BY l_shipmode
  18. 18. 19© Cloudera, Inc. All rights reserved. Query optimizations based on statistics 02:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: l_partkey = o_orderkey | fk/pk conjuncts: l_orderkey = o_orderkey | runtime filters: RF000 <- o_orderkey | tuple-ids=1,0 row-size=113B cardinality=27,381,196 | |--05:EXCHANGE [BROADCAST] | | hosts=20 per-host-mem=0B | | tuple-ids=0 row-size=8B cardinality=68,452,805 | | | 00:SCAN HDFS [tpch_3000_parquet.orders, RANDOM] | partitions=366/2406 files=366 size=28.83GB | predicates: tpch_3000_parquet.orders.o_orderkey < 100 | table stats: 4,500,000,000 rows total | column stats: all | tuple-ids=0 row-size=8B cardinality=68,452,805 | 01:SCAN HDFS [tpch_3000_parquet.lineitem, RANDOM] partitions=2526/2526 files=2526 size=1.36TB predicates: l_partkey < 100, l_receiptdate >= '1994-01-01', l_comment LIKE '%long string% runtime filters: RF000 -> l_orderkey table stats: 18,000,048,306 rows total column stats: all tuple-ids=1 row-size=105B cardinality=1,800,004,831 ● Order scan predicates by selectivity and cost, mitigate correlated predicates (exponential backoff) ● Detection of common join pattern of Primary key/Foreign key joins ● Compute selectivity of predicates for scans as well as joins ● Determine build and probe side for equi joins ● Select the ideal join type that minimizes resource utilization ○ Broadcast Join ○ Partition Join ● Identify joins which can benefit from Runtime filters ● Determine optimal join order SELECT l_shipmode, Sum(l_extendedprice) FROM orders, lineitem WHERE o_orderkey = l_orderkey AND l_comment LIKE '%long string%' AND l_receiptdate >= '1994-01-01' AND l_partkey < 100 AND o_orderdate < '1993-01-01' GROUP BY l_shipmode ORDER BY l_shipmode
  19. 19. 20© Cloudera, Inc. All rights reserved. Query optimizations based on statistics 02:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: l_partkey = o_orderkey | fk/pk conjuncts: l_orderkey = o_orderkey | runtime filters: RF000 <- o_orderkey | tuple-ids=1,0 row-size=113B cardinality=27,381,196 | |--05:EXCHANGE [BROADCAST] | | hosts=20 per-host-mem=0B | | tuple-ids=0 row-size=8B cardinality=68,452,805 | | | 00:SCAN HDFS [tpch_3000_parquet.orders, RANDOM] | partitions=366/2406 files=366 size=28.83GB | predicates: tpch_3000_parquet.orders.o_orderkey < 100 | table stats: 4,500,000,000 rows total | column stats: all | tuple-ids=0 row-size=8B cardinality=68,452,805 | 01:SCAN HDFS [tpch_3000_parquet.lineitem, RANDOM] partitions=2526/2526 files=2526 size=1.36TB predicates: l_partkey < 100, l_receiptdate >= '1994-01-01', l_comment LIKE '%long string% runtime filters: RF000 -> l_orderkey table stats: 18,000,048,306 rows total column stats: all tuple-ids=1 row-size=105B cardinality=1,800,004,831 ● Order scan predicates by selectivity and cost, mitigate correlated predicates (exponential backoff) ● Detection of common join pattern of Primary key/Foreign key joins ● Compute selectivity of predicates for scans as well as joins ● Determine build and probe side for equi joins ● Select the ideal join type that minimizes resource utilization ○ Broadcast Join ○ Partition Join ● Identify joins which can benefit from Runtime filters ● Determine optimal join order SELECT l_shipmode, Sum(l_extendedprice) FROM orders, lineitem WHERE o_orderkey = l_orderkey AND l_comment LIKE '%long string%' AND l_receiptdate >= '1994-01-01' AND l_partkey < 100 AND o_orderdate < '1993-01-01' GROUP BY l_shipmode ORDER BY l_shipmode
  20. 20. 21© Cloudera, Inc. All rights reserved. • Use metadata to avoid table accesses for partition key scans: • select min(month), max(year) from functional.alltypes; • month, year are partition keys of the table • Enabled by query option OPTIMIZE_PARTITION_KEY_SCANS • Applicable: • min(), max(), ndv() and aggregate functions with distinct keyword • partition keys only 01:AGGREGATE [FINALIZE] | output: min(month),max(year) | 00:UNION constant-operands=24 03:AGGREGATE [FINALIZE] | output: min:merge(month), max:merge(year) | 02:EXCHANGE [UNPARTITIONED] | 01:AGGREGATE | output: min(month), max(year) | 00:SCAN HDFS [functional.alltypes] partitions=24/24 files=24 size=478.45KB Plan without optimization Plan with optimization Optimization for metadata only queries
  21. 21. 22© Cloudera, Inc. All rights reserved. Select sum(l_extendedprice* (1 - l_discount)) as revenue from Lineitem, part Where ( p_partkey = l_partkey and p_brand = 'Brand#32' and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity >= 7 and l_quantity <= 7 + 10 and p_size between 1 and 5 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON') or( p_partkey = l_partkey and p_brand = 'Brand#35' and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') and l_quantity >= 15 and l_quantity <= 15 + 10 and p_size between 1 and 10 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON') Or( p_partkey = l_partkey and p_brand = 'Brand#24' and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') and l_quantity >= 26 and l_quantity <= 26 + 10 and p_size between 1 and 15 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON') ● Extract common conjuncts from disjunctions. (a AND b) OR (a AND b) ==> a AND b (a AND b AND c) OR (c) ==> c ● >100x speedup for TPC-DS Q13, Q48 & TPC-H Q19 00:SCAN HDFS [tpch_300_parquet.lineitem, RANDOM] partitions=1/1 files=259 size=63.71GB predicates: l_shipmode IN ('AIR', 'AIR REG'), l_shipinstruct = 'DELIVER IN PERSON' runtime filters: RF000 -> l_partkey table stats: 1799989091 rows total hosts=7 per-host-mem=528.00MB tuple-ids=0 row-size=80B cardinality=240533660 Scanner : Extract common conjuncts from disjunctions
  22. 22. 23© Cloudera, Inc. All rights reserved. • Not based on Map-Reduce • Single-threaded, row-based, Volcano-style (iterator-based with batches) query execution engine • Multi-threaded scans • Utilizes HDFS short-circuit local reads & multi-threaded I/O subsystem • Supports multiple file formats (Parquet, Avro, Text, Sequence, …) • Supports nested types (only in Parquet) • Code generation using LLVM • No transactions • No indices Query execution
  23. 23. 24© Cloudera, Inc. All rights reserved. • Impala does not have native indexes (e.g. B+-tree), but it does allow a type of indexing by partitions • What it is: physically dividing your data so that queries only need to access a subset • Partitioning schema is expressed through DDL • Pruning applied automatically when queries contain matching predicates CREATE TABLE Sales (...) PARTITIONED BY (INT year, INT month); SELECT … FROM Sales WHERE year >= 2012 AND month IN (1, 2, 3) SELECT … FROM Sales JOIN DateDim d USING date_key WHERE d.year >= 2012 AND d.month IN (1, 2, 3) CREATE TABLE Sales (...) PARTITIONED BY (INT date_key); or Partitioning
  24. 24. 25© Cloudera, Inc. All rights reserved. ● Sorting data files improves the effectiveness of file statistics (min/max) and compression (e.g., delta encoding). ● Predicates are evaluated against Min/Max statistics as well as Dictionary encoded columns, this approximates lazy materialization ● Sorting can be used on columns which have too many values to qualify for partitioning. ● Create sorted data files by adding the SORT BY clause during table creation. ● The Parquet community is working on extending the format for efficient point lookups. CREATE TABLE Sales (...) PARTITIONED BY (year INT, month INT) SORT BY (day, hour) Stored as Parquet; Optimizations for selective scans: Sorting
  25. 25. 26© Cloudera, Inc. All rights reserved. Metric Partition + Sorting Speedup Elapsed time 3x CPU time (seconds) 5x HDFS MBytes read 17x CREATE TABLE Sales (...) PARTITIONED BY (year INT, month INT) SORT BY (day, hour) Stored as Parquet; Business question: Find top 10 customers in terms of revenue who made purchases on Christmas eve in a given time window. SORT BY helps meet query SLAs without over- partitioning the table SELECT sum(ss_ext_sales_price) AS revenue, count(ss_quantity) AS quantity_count, customer_name FROM Sales WHERE Year = 2017 AND Month=12 AND 12 AND hour BETWEEN 1 AND 4 GROUP BY customer_name ORDER BY revenue DESC LIMIT 10; Optimizations for selective scans: Augment Partitioning
  26. 26. 27© Cloudera, Inc. All rights reserved. CREATE TABLE Sales (...) PARTITIONED BY (year INT, month INT) SORT BY (customer_id) Stored as Parquet; Business question: Find interactions for a specific customer for given time window SORT BY helps meet query SLAs without over partitioning the table SELECT * FROM Sales WHERE Year = 2016 AND customer_id = 4976004; Metric Partition + Sorting Speedup Elapsed time 18x HDFS MBytes read 22x Optimizations for selective scans:Complement Partitioning
  27. 27. 28© Cloudera, Inc. All rights reserved. • Impala uses runtime code generation to produce query specific versions of functions that are critical to performance. • In particular, code generation is applied to “inner loop” functions • Code generation (codegen) lets Impala use query-specific information to do less work • Remove conditionals • Propagate constant offsets, pointers, etc. • Inline virtual functions calls LLVM Codegen in Impala
  28. 28. 29© Cloudera, Inc. All rights reserved. Operations: • Hash join • Aggregation • Scans: Parquet, Text, Sequence, Avro • Expressions in all operators • Union • Sort • Top-N • Runtime filters Data Types: • TINYINT, SMALLINT, INT, BIGINT • FLOAT, DOUBLE • BOOLEAN • STRING, VARCHAR • DECIMAL Further optimizations: • Don’t codegen short running queries (Relies on cardinality estimates) • Limit amount of inlining for long and complex expressions (Group by 1K columns) LLVM Codegen in Impala
  29. 29. 30© Cloudera, Inc. All rights reserved. select l_extendedprice, l_orderkey from lineitem order by l_orderkey limit 100 l_orderkey l_extendedprice l_shipdate l_shipmode 26039617 13515 12/8/1997 FOB 30525093 16218 12/16/1997 REG AIR 28809990 7208 10/19/1997 AIR Codegen for Order by & Top-N
  30. 30. 31© Cloudera, Inc. All rights reserved. int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } select l_extendedprice, l_orderkey from lineitem order by l_orderkey limit 100 l_orderkey l_extendedprice l_shipdate l_shipmode 26039617 13515 12/8/1997 FOB 30525093 16218 12/16/1997 REG AIR 28809990 7208 10/19/1997 AIR Codegen for Order by & Top-N
  31. 31. 32© Cloudera, Inc. All rights reserved. void* ExprContext::GetValue(Expr* e, TupleRow* row) { switch (e->type_.type) { case TYPE_BOOLEAN: { .. .. } case TYPE_TINYINT: { .. .. } case TYPE_INT: { .. . int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } Codegen for Order by & Top-N
  32. 32. 33© Cloudera, Inc. All rights reserved. void* ExprContext::GetValue(Expr* e, TupleRow* row) { switch (e->type_.type) { case TYPE_BOOLEAN: { .. .. } case TYPE_TINYINT: { .. .. } case TYPE_INT: { .. . int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } Codegen for Order by & Top-N
  33. 33. 34© Cloudera, Inc. All rights reserved. int RawValue::Compare(const void* v1, const void* v2, const ColumnType& type) { switch (type.type) { case TYPE_INT: i1 = *reinterpret_cast<const int32_t*>(v1); i2 = *reinterpret_cast<const int32_t*>(v2); return i1 > i2 ? 1 : (i1 < i2 ? -1 : 0); case TYPE_BIGINT: b1 = *reinterpret_cast<const int64_t*>(v1); b2 = *reinterpret_cast<const int64_t*>(v2); return b1 > b2 ? 1 : (b1 < b2 ? -1 : 0); int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } Codegen for Order by & Top-N
  34. 34. 35© Cloudera, Inc. All rights reserved. int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key Codegen codeOriginal code int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } Codegen for Order by & Top-N
  35. 35. 36© Cloudera, Inc. All rights reserved. int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); // i = 0 int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); // i = 1 int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key } Codegen code • Perfectly unrolls “for each grouping column” loop • No switching on input type(s) • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } Codegen for Order by & Top-N
  36. 36. 37© Cloudera, Inc. All rights reserved. int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key Codegen code • No switching on input type(s) • Perfectly unrolls “for each grouping column” loop • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code Codegen for Order by & Top-N
  37. 37. 38© Cloudera, Inc. All rights reserved. int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key Codegen code • No switching on input type(s) • Perfectly unrolls “for each grouping column” loop • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code 10x more efficient code Codegen for Order by & Top-N
  38. 38. 39© Cloudera, Inc. All rights reserved. Network Preagg Preagg Preagg Merge Merge Merge select cust_id, sum(dollars) from sales group by cust_id; Scan ScanScan • Aggregations have two phases: • Pre-aggregation phase • Merge phase • The pre-aggregation phase greatly reduces network traffic if there are many input rows per grouping value. • E.g. many sales per customer. Distributed Aggregation in MPP
  39. 39. 40© Cloudera, Inc. All rights reserved. Network Preagg Preagg Preagg Merge Merge Merge select distinct * from sales; Scan ScanScan • Pre-aggregations consume: • Memory • CPU cycles • Pre-aggregations are not always effective at reducing network traffic • E.g. select distinct for nearly-distinct rows • Pre-aggregations can spill to disk under memory pressure • Disk I/O is bad - better to send to merge agg rather than disk Downside of Pre-aggregations
  40. 40. 41© Cloudera, Inc. All rights reserved. select distinct * from sales; • Reduction factor is dynamically estimated based on the actual data processed • Pre-aggregation expands memory usage only if reduction factor is good • Benefits: • Certain aggregations with low reduction factor see speedups of up to 40% • Memory consumption can be reduced by 50% or more • Streaming pre-aggregations don’t spill to disk Network Stream Stream Stream Merge Merge Merge Scan ScanScan Streaming Pre-aggregations in Impala
  41. 41. 42© Cloudera, Inc. All rights reserved. • When are Runtime filters useful • To Optimize selective equi-joins against large partitioned or unpartitioned tables • General idea • Avoid unnecessary I/O to read partition data, and avoid unnecessary network transmission by sending only the subset of rows that match the join keys across the network • Some predicates can only be computed at runtime • Use a Bloom filter, which uses a probability-based algorithm to store and query values for joins column(s) Optimizations for selective joins : Runtime filters
  42. 42. 43© Cloudera, Inc. All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows store_sales 43 billion rows customer 3.8 million Shuffle Shuffle store_sales customer customer_demo Runtime filters in action
  43. 43. 44© Cloudera, Inc. All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows Join #1 & #2 are expensive joins since left side of the joins have 43 billion rows store_sales 43 billion rows customer 3.8 million Shuffle Shuffle store_sales customer customer_demo Runtime filters in action
  44. 44. 45© Cloudera, Inc. All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows Create bloom filter from Join #2 on cd_demo_sk and push down to customer table scan store_sales 43 billion rows customer 3.8 million Shuffle Shuffle store_sales customer customer_demo Runtime filters in action
  45. 45. 46© Cloudera, Inc. All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows Reduced customer rows by 826X 3.8 million to 4,600 rows store_sales 43 billion rows customer 4,600 rows Shuffle Shuffle store_sales customer customer_demo Runtime filters in action
  46. 46. 47© Cloudera, Inc. All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows store_sales 43 billion rows customer 4,600 rows Shuffle Shuffle Create bloom filter from Join #1 on c_customer_sk and push down to store_sales table scan store_sales customer customer_demo Runtime filters in action
  47. 47. 48© Cloudera, Inc. All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Shuffle Join #1 49 million rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows store_sales 49 million rows customer 4,600 rows Shuffle Shuffle 877x reduction in rows 43 billion -> 49 million rows store_sales customer customer_demo Runtime filters in action
  48. 48. 49© Cloudera, Inc. All rights reserved. • Cache friendly, hash each key to a Bloom filter the size of a cache line or smaller • Use AVX2 instruction set for Insert and Find operations • Planner uses a cost model to decide which joins benefit from Runtime Filters • Bloom filters are sized based on estimated number of distinct values for the build side (1MB-16MB by default) • Ineffective runtime filters are disabled dynamically when • Bloom filters has a higher false positive rate • Filter selectivity (Reduce CPU overhead) • Track Min and Max values from build side and push new implied predicates to the scan of the probe side (Supported for Kudu in CDH5.14, Parquet support coming soon) • Runtime filters are codegened, avoids virtual functions calls and branching in inner loop Optimizations for selective joins : Runtime filters
  49. 49. 50© Cloudera, Inc. All rights reserved. Joins and aggregations under the hood
  50. 50. 51© Cloudera, Inc. All rights reserved. Joins and aggregations under the hood Most of the time is spent in L1 & L2 cache misses!
  51. 51. 52© Cloudera, Inc. All rights reserved. Joins and aggregations under the hood Old hash table probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } Read probe join key from input batch
  52. 52. 53© Cloudera, Inc. All rights reserved. Joins and aggregations under the hood Old hash table probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } Compute hash value
  53. 53. 54© Cloudera, Inc. All rights reserved. Joins and aggregations under the hood Old hash table probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } Find the partition ID
  54. 54. 55© Cloudera, Inc. All rights reserved. Joins and aggregations under the hood Old hash table probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } Probe hash table
  55. 55. 56© Cloudera, Inc. All rights reserved. Joins and aggregations under the hood Old hash table probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } Probe hash table CACHE-MISS Data dependency resulting in poor pipelining
  56. 56. 57© Cloudera, Inc. All rights reserved. Joins and aggregations under the hood Old hash table probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } Probe hash table CACHE-MISS Data dependency resulting in poor pipelining
  57. 57. 58© Cloudera, Inc. All rights reserved. Joins and aggregations in Impala : Current state Old hash table probe pseudo code Current hash tableJoins probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } int[] hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_values[i] =probe_batch_->GetRow(i); // Compute hash value hash_value[i] =Hash(probe_value); // Find the matching partition in the hash table partition_id[i] = hash_value >> (32 - NUM_PARTITIONING_BITS); // Prefetch the hash value into LLC hash_tbls_[partition_ids[i]]->Prefetch(hash_values[i]); } for (int i =0; i < probe_batch_size; i++) { // Probe the hash table iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); Prefetch the hash value
  58. 58. 59© Cloudera, Inc. All rights reserved. Joins and aggregations in Impala : Current state Old hash table probe pseudo code Current hash tableJoins probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } int[] hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_values[i] =probe_batch_->GetRow(i); // Compute hash value hash_value[i] =Hash(probe_value); // Find the matching partition in the hash table partition_id[i] = hash_value >> (32 - NUM_PARTITIONING_BITS); // Prefetch the hash value into LLC hash_tbls_[partition_ids[i]]->Prefetch(hash_values[i]); } for (int i =0; i < probe_batch_size; i++) { // Probe the hash table iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); Breakup the loop into two parts, to reduce data dependency Batch size is 1024 rows, giving enough time to ensure prefetched data is in LLC Prefetch the hash value
  59. 59. 60© Cloudera, Inc. All rights reserved. Joins and aggregations in Impala : Current state Old hash table probe pseudo code Current hash tableJoins probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } int[] hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_values[i] =probe_batch_->GetRow(i); // Compute hash value hash_value[i] =Hash(probe_value); // Find the matching partition in the hash table partition_id[i] = hash_value >> (32 - NUM_PARTITIONING_BITS); // Prefetch the hash value into LLC hash_tbls_[partition_ids[i]]->Prefetch(hash_values[i]); } for (int i =0; i < probe_batch_size; i++) { // Probe the hash table iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); Prefetch the hash value CACHE-HIT Breakup the loop into two parts, to reduce data dependency Batch size is 1024 rows, giving enough time to ensure prefetched data is in LLC
  60. 60. 61© Cloudera, Inc. All rights reserved. Joins and aggregations in Impala : Current state Old hash table probe pseudo code Current hash tableJoins probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } int[] hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_values[i] =probe_batch_->GetRow(i); // Compute hash value hash_value[i] =Hash(probe_value); // Find the matching partition in the hash table partition_id[i] = hash_value >> (32 - NUM_PARTITIONING_BITS); // Prefetch the hash value into LLC hash_tbls_[partition_ids[i]]->Prefetch(hash_values[i]); } for (int i =0; i < probe_batch_size; i++) { // Probe the hash table iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); Prefetch the hash value CACHE-HIT
  61. 61. 62© Cloudera, Inc. All rights reserved. Lessons learnt ● Don’t underestimate impact of cache-misses ● Prefetching is useful when done before data needs to be read from Memory ○ 30-40% speedup for Join and aggregation operations ● Removing data dependencies opens opportunities for improvement Caveats ● TLB misses is still an issue ● Prefetching is as good as the hash function used, won’t handle chaining Joins and aggregations in Impala : Current state
  62. 62. 63© Cloudera, Inc. All rights reserved. Impala Roadmap Focus • Performance & Scalability • Parquet scanner performance • Metadata • RPC Layer • Reliability • Node decommission • Better resource management • Cloud
  63. 63. 64© Cloudera, Inc. All rights reserved. Thank you https://github.com/apache/impala http://impala.apache.org/ Impala: A Modern, Open-Source SQL Engine for Hadoop

    Seja o primeiro a comentar

    Entre para ver os comentários

  • saintya

    Jan. 8, 2018
  • PetroRudenko

    Jan. 8, 2018
  • HimanshPal

    Jan. 8, 2018
  • lalaguozhe

    Jan. 9, 2018
  • MostafaMokhtar1

    Jan. 11, 2018
  • mlsx

    Jan. 17, 2018
  • StphaneNOTTER

    Feb. 14, 2018
  • TarunReddy14

    Mar. 7, 2018
  • NicolasTardy

    Apr. 7, 2018
  • DamienPignaud

    Jun. 7, 2018
  • ssuser9dc873

    Jul. 20, 2018
  • UmapathyV

    Jan. 15, 2019
  • cirroart

    Jan. 29, 2019
  • VivekJadhav59

    Apr. 11, 2019
  • masreis

    Jun. 28, 2019
  • SangSubChong

    Aug. 24, 2019
  • MASeyal

    Aug. 28, 2019
  • fuyao0932

    Dec. 9, 2019
  • zhouyuan

    May. 6, 2020
  • ssuser3cea18

    Aug. 24, 2020

Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile). To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.

Vistos

Vistos totais

9.646

No Slideshare

0

De incorporações

0

Número de incorporações

700

Ações

Baixados

397

Compartilhados

0

Comentários

0

Curtir

26

×