Enviar pesquisa
Carregar
Hive Correlation Optimizer
•
Transferir como PPTX, PDF
•
6 gostaram
•
3,820 visualizações
Yin Huai
Seguir
Presented at Hadoop Summit 2013 Hive User Group Meetup
Leia menos
Leia mais
Tecnologia
Arte e fotografia
Denunciar
Compartilhar
Denunciar
Compartilhar
1 de 31
Baixar agora
Recomendados
How Instacart’s Catalog Flourished While Hyper-Growing (ANT328-S) - AWS re:In...
How Instacart’s Catalog Flourished While Hyper-Growing (ANT328-S) - AWS re:In...
Amazon Web Services
Devry bis-155-final-exam-guide-new
Devry bis-155-final-exam-guide-new
shyaminfo104
Dervy bis-155-final-exam-guide-music-on-demand-new
Dervy bis-155-final-exam-guide-music-on-demand-new
individual484
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
Databricks
Devry bis 155 final exam guide (music on demand) new
Devry bis 155 final exam guide (music on demand) new
uopassignment
Dervy bis 155 final exam guide music on demand new
Dervy bis 155 final exam guide music on demand new
kxipvscsk02
BIS 155 Education Specialist / snaptutorial.com
BIS 155 Education Specialist / snaptutorial.com
McdonaldRyan131
Bis 155 Effective Communication / snaptutorial.com
Bis 155 Effective Communication / snaptutorial.com
Baileyac
Recomendados
How Instacart’s Catalog Flourished While Hyper-Growing (ANT328-S) - AWS re:In...
How Instacart’s Catalog Flourished While Hyper-Growing (ANT328-S) - AWS re:In...
Amazon Web Services
Devry bis-155-final-exam-guide-new
Devry bis-155-final-exam-guide-new
shyaminfo104
Dervy bis-155-final-exam-guide-music-on-demand-new
Dervy bis-155-final-exam-guide-music-on-demand-new
individual484
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
Databricks
Devry bis 155 final exam guide (music on demand) new
Devry bis 155 final exam guide (music on demand) new
uopassignment
Dervy bis 155 final exam guide music on demand new
Dervy bis 155 final exam guide music on demand new
kxipvscsk02
BIS 155 Education Specialist / snaptutorial.com
BIS 155 Education Specialist / snaptutorial.com
McdonaldRyan131
Bis 155 Effective Communication / snaptutorial.com
Bis 155 Effective Communication / snaptutorial.com
Baileyac
Bis 155 Education Organization / snaptutorial.com
Bis 155 Education Organization / snaptutorial.com
Baileya121
Dagobahic2020orange
Dagobahic2020orange
JixiongLIU
Bis 155 Enhance teaching / snaptutorial.com
Bis 155 Enhance teaching / snaptutorial.com
HarrisGeorg46
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Serban Tanasa
BIS 155 Education Organization -- snaptutorial.com
BIS 155 Education Organization -- snaptutorial.com
DavisMurphyB94
Bis 155 Exceptional Education / snaptutorial.com
Bis 155 Exceptional Education / snaptutorial.com
Davis142
BIS 155 Exceptional Education - snaptutorial.com
BIS 155 Exceptional Education - snaptutorial.com
DavisMurphyB28
Spatial query tutorial for nyc subway income level along subway
Spatial query tutorial for nyc subway income level along subway
Vivian S. Zhang
Olap Functions Suport in Informix
Olap Functions Suport in Informix
Bingjie Miao
Join optimization in hive
Join optimization in hive
Liyin Tang
Hive contributors meetup apache sentry
Hive contributors meetup apache sentry
Brock Noland
20081030linkedin
20081030linkedin
Jeff Hammerbacher
Hive Object Model
Hive Object Model
Zheng Shao
Hive query optimization infinity
Hive query optimization infinity
Shashwat Shriparv
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Cloudera, Inc.
Optimizing Hive Queries
Optimizing Hive Queries
DataWorks Summit
Hive ppt (1)
Hive ppt (1)
marwa baich
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
ragho
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
Hive tuning
Hive tuning
Michael Zhang
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
Mais conteúdo relacionado
Mais procurados
Bis 155 Education Organization / snaptutorial.com
Bis 155 Education Organization / snaptutorial.com
Baileya121
Dagobahic2020orange
Dagobahic2020orange
JixiongLIU
Bis 155 Enhance teaching / snaptutorial.com
Bis 155 Enhance teaching / snaptutorial.com
HarrisGeorg46
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Serban Tanasa
BIS 155 Education Organization -- snaptutorial.com
BIS 155 Education Organization -- snaptutorial.com
DavisMurphyB94
Bis 155 Exceptional Education / snaptutorial.com
Bis 155 Exceptional Education / snaptutorial.com
Davis142
BIS 155 Exceptional Education - snaptutorial.com
BIS 155 Exceptional Education - snaptutorial.com
DavisMurphyB28
Spatial query tutorial for nyc subway income level along subway
Spatial query tutorial for nyc subway income level along subway
Vivian S. Zhang
Olap Functions Suport in Informix
Olap Functions Suport in Informix
Bingjie Miao
Mais procurados
(9)
Bis 155 Education Organization / snaptutorial.com
Bis 155 Education Organization / snaptutorial.com
Dagobahic2020orange
Dagobahic2020orange
Bis 155 Enhance teaching / snaptutorial.com
Bis 155 Enhance teaching / snaptutorial.com
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
BIS 155 Education Organization -- snaptutorial.com
BIS 155 Education Organization -- snaptutorial.com
Bis 155 Exceptional Education / snaptutorial.com
Bis 155 Exceptional Education / snaptutorial.com
BIS 155 Exceptional Education - snaptutorial.com
BIS 155 Exceptional Education - snaptutorial.com
Spatial query tutorial for nyc subway income level along subway
Spatial query tutorial for nyc subway income level along subway
Olap Functions Suport in Informix
Olap Functions Suport in Informix
Destaque
Join optimization in hive
Join optimization in hive
Liyin Tang
Hive contributors meetup apache sentry
Hive contributors meetup apache sentry
Brock Noland
20081030linkedin
20081030linkedin
Jeff Hammerbacher
Hive Object Model
Hive Object Model
Zheng Shao
Hive query optimization infinity
Hive query optimization infinity
Shashwat Shriparv
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Cloudera, Inc.
Optimizing Hive Queries
Optimizing Hive Queries
DataWorks Summit
Hive ppt (1)
Hive ppt (1)
marwa baich
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
ragho
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
Hive tuning
Hive tuning
Michael Zhang
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
Hive Quick Start Tutorial
Hive Quick Start Tutorial
Carl Steinbach
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
Destaque
(15)
Join optimization in hive
Join optimization in hive
Hive contributors meetup apache sentry
Hive contributors meetup apache sentry
20081030linkedin
20081030linkedin
Hive Object Model
Hive Object Model
Hive query optimization infinity
Hive query optimization infinity
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Optimizing Hive Queries
Optimizing Hive Queries
Hive ppt (1)
Hive ppt (1)
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive tuning
Hive tuning
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Hive Quick Start Tutorial
Hive Quick Start Tutorial
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Semelhante a Hive Correlation Optimizer
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Altinity Ltd
Bring Cartography to the Cloud
Bring Cartography to the Cloud
Nick Dimiduk
Making pig fly optimizing data processing on hadoop presentation
Making pig fly optimizing data processing on hadoop presentation
Md Rasool
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Databricks
Performance Tuning Oracle's BI Applications
Performance Tuning Oracle's BI Applications
KPI Partners
IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoT
James Chittenden
Streaming SQL
Streaming SQL
Julian Hyde
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
Taro L. Saito
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
HostedbyConfluent
Release-3_TSD_Source_to_LZ_-_CIS_-_v1.2 2
Release-3_TSD_Source_to_LZ_-_CIS_-_v1.2 2
Saranya Mohan
SQL in the Hybrid World
SQL in the Hybrid World
Tanel Poder
Powerpivot web wordpress present
Powerpivot web wordpress present
MariAnne Woehrle
Make streaming processing towards ANSI SQL
Make streaming processing towards ANSI SQL
DataWorks Summit
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
InfluxData
MySQL Optimizer Overview
MySQL Optimizer Overview
Olav Sandstå
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Future of Data Meetup
Basics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration Techniques
Valmik Potbhare
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
Lucas Arruda
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
tdc-globalcode
Excel Secrets for Search Marketers
Excel Secrets for Search Marketers
Chris Haleua
Semelhante a Hive Correlation Optimizer
(20)
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Bring Cartography to the Cloud
Bring Cartography to the Cloud
Making pig fly optimizing data processing on hadoop presentation
Making pig fly optimizing data processing on hadoop presentation
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Performance Tuning Oracle's BI Applications
Performance Tuning Oracle's BI Applications
IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoT
Streaming SQL
Streaming SQL
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Release-3_TSD_Source_to_LZ_-_CIS_-_v1.2 2
Release-3_TSD_Source_to_LZ_-_CIS_-_v1.2 2
SQL in the Hybrid World
SQL in the Hybrid World
Powerpivot web wordpress present
Powerpivot web wordpress present
Make streaming processing towards ANSI SQL
Make streaming processing towards ANSI SQL
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
MySQL Optimizer Overview
MySQL Optimizer Overview
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Basics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration Techniques
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
Excel Secrets for Search Marketers
Excel Secrets for Search Marketers
Último
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
Alan Dix
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
Neo4j
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
AliaaTarek5
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
Ingrid Airi González
A Framework for Development in the AI Age
A Framework for Development in the AI Age
Cprime
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
DianaGray10
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
Knoldus Inc.
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
LoriGlavin3
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
LoriGlavin3
How to write a Business Continuity Plan
How to write a Business Continuity Plan
Databarracks
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
Mydbops
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Mark Goldstein
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
Hiroshi SHIBATA
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
panagenda
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
LoriGlavin3
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
MounikaPolabathina
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
Farhan Tariq
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
UiPathCommunity
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
ThousandEyes
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
Rick Flair
Último
(20)
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
A Framework for Development in the AI Age
A Framework for Development in the AI Age
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
How to write a Business Continuity Plan
How to write a Business Continuity Plan
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
Hive Correlation Optimizer
1.
© Hortonworks Inc.
2011 Hive Correlation Optimizer Yin Huai yhuai@hortonworks.com huai@cse.ohio-state.edu Page 1 Hadoop Summit 2013 Hive User Group Meetup
2.
© Hortonworks Inc.
2011 About me •Hive contributor •Summer intern at Hortonworks •4th year Ph.D. student at The Ohio State University •Research interests: query optimizations, file formats, distributed systems, and storage systems Page 2 Architecting the Future of Big Data
3.
© Hortonworks Inc.
2011 Outline •Query planning in Hive •Correlations in a query (Intra-query correlations) •Case studies •Automatically exploiting correlations (HIVE- 2206: Correlation Optimizer) Page 3 Architecting the Future of Big Data
4.
© Hortonworks Inc.
2011 Query planning Page 4 Architecting the Future of Big Data SELECT t1.c2, count(*) FROM t1 JOIN t2 ON (t1.c1=t2.c1) GROUP BY t1.c2 t1 t2 JOIN AGG t1.c1=t2.c1 Calculate count(*) for every group of t1.c2
5.
© Hortonworks Inc.
2011 Query planning Page 5 Architecting the Future of Big Data SELECT t1.c2, count(*) FROM t1 JOIN t2 ON (t1.c1=t2.c1) GROUP BY t1.c2 t1 t2 JOIN AGG Evaluate this query in distributed systems t1 t2 JOIN AGG Shuffle Shuffle c1 c2 How to shuffle? Use the key column(s)
6.
© Hortonworks Inc.
2011 Generating MapReduce jobs Page 6 Architecting the Future of Big Data t1 t2 JOIN AGG Shuffle Shuffle c2 c1 t1 t2 JOIN Shuffle tmp c1 tmp AGG Shuffle c2 1 MR job can shuffle data once Job 1 Job 2
7.
© Hortonworks Inc.
2011 Generating MapReduce jobs Page 7 Architecting the Future of Big Data t1 t2 JOIN Shuffle tmp c1 tmp AGG Shuffle c2 MapReuce will shuffle data for us, we just need to emit outputs from the Map phase We use ReduceSinkOperator (RS) to emit Map outputs. RSs are the end of a Map phase. t1 t2 JOIN tmp tmp AGG RS1 RS2 RS2 Job 1 Map Job 1 Reduce Job 2 Map Job 2 Reduce
8.
© Hortonworks Inc.
2011 Outline •Query planning in Hive •Correlations in a query (Intra-query correlations) •Case studies •Automatically exploiting correlations (HIVE- 2206: Correlation Optimizer) Page 8 Architecting the Future of Big Data
9.
© Hortonworks Inc.
2011 Intra-query correlations Page 9 Architecting the Future of Big Data SELECT x.c1, count(*) FROM t1 x JOIN t1 y ON (x.c1=y.c1) GROUP BY x.c1 t1 as x t1 as y JOIN AGG x.c1=y.c1 Calculate count(*) for every group of x.c1 Correlations: 1. Same input tables 2. JOIN and AGG using the same key
10.
© Hortonworks Inc.
2011 Intra-query correlations Page 10 Architecting the Future of Big Data x.c1=y.c1 Calculate count(*) for every group of z.c1 t1 as x t2 as y JOIN1 JOIN2 AGG1 t1 as z p.c1=q.c1 SELECT p.c1, q.c2, q.cnt FROM (SELECT x.c1 AS c1 FROM t1 x JOIN t2 y ON (x.c1=y.c1)) p JOIN (SELECT z.c1 AS c1, count(*) AS cnt FROM t1 z GROUP BY z.c1) q ON (p.c1=q.c1) Correlations: 1. Same input tables (t1) 2. JOIN1 and AGG1 using the same key 3. JOIN2 and all of its parents using the same key
11.
© Hortonworks Inc.
2011 Intra-query correlations • Defined in “YSmart: Yet Another SQL-to-MapReduce Translator” – http://ysmart.cse.ohio-state.edu/ – http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf • Targeting on operators which need to shuffle the data and inputs • Three kinds of correlations – Input correlation (IC): independent operators share the same input tables – Transit correlation (TC): independent operators have input correlation and also shuffle the data in the same way (e.g. using the same keys) – Job flow correlation (JFC): two dependent operators shuffle the data in the same way Page 11 Architecting the Future of Big Data t1 as x t2 as y JOIN1 AGG1 t1 as z IC t1 as x t2 as y JOIN1 AGG1 t1 as z x.c1=y.c1 group by z.c1 TC JOIN AGG x.c1=y.c1 group by z.c1 JFC
12.
© Hortonworks Inc.
2011 Correlation-unaware query planning Page 12 Architecting the Future of Big Data t1 t1 JOIN AGG Shuffle Shuffle c1 c1 Hive does not care: 1. If a table has been used multiple times 2. If data really needs to be shuffled t1 t1 JOIN Shuffle tmp c1 Job 1 tmp AGG Shuffle c1 Job 2 Drawbacks: 1. Unnecessary data loading 2. Unnecessary data shuffling 3. Unnecessary data materialization
13.
© Hortonworks Inc.
2011 Outline •Query planning in Hive •Correlations in a query (Intra-query correlations) •Case studies •Automatically exploiting correlations (HIVE- 2206: Correlation Optimizer) Page 13 Architecting the Future of Big Data
14.
© Hortonworks Inc.
2011 Case studies: TPC-H Q17 (Flattened) SELECT sum(l_extendedprice) / 7.0 as avg_yearly FROM (SELECT l_partkey, l_quantity, l_extendedprice FROM lineitem JOIN part ON (p_partkey=l_partkey) WHERE p_brand='Brand#35’ AND p_container = 'MED PKG’) touter JOIN (SELECT l_partkey as lp, 0.2 * avg(l_quantity) as lq FROM lineitem GROUP BY l_partkey) tinner ON (touter.l_partkey = tinnter.lp) WHERE touter.l_quantity < tinner.lq Page 14 Architecting the Future of Big Data
15.
© Hortonworks Inc.
2011 Case studies: TPC-H Q17 (Flattened) Page 15 Architecting the Future of Big Data lineitem part JOIN1 JOIN2 AGG1 lineitem AGG2 lineitem is used by JOIN1 and AGG1 JOIN1, AGG1, and JOIN2 share the same key
16.
© Hortonworks Inc.
2011 Case studies: TPC-H Q17 (Flattened) Page 16 Architecting the Future of Big Data lineitem part JOIN1 JOIN2 AGG1 lineitem AGG2 Job 1 Job 2 Job 3 Job 4 Without Correlation Optimizer
17.
© Hortonworks Inc.
2011 Case studies: TPC-H Q17 (Flattened) Page 17 Architecting the Future of Big Data lineitem part JOIN1 JOIN2 AGG1 lineitem AGG2 part JOIN1 JOIN2 AGG1 lineitem AGG2 Job 1 Job 2 Job 3 Job 4 Job 2 Job 1 Without Correlation Optimizer With Correlation Optimizer
18.
© Hortonworks Inc.
2011 Case studies: TPC-DS Q95 (Flattened) SELECT count(distinct ws1.ws_order_number) as order_count, sum(ws1.ws_ext_ship_cost) as total_shipping_cost, sum(ws1.ws_net_profit) as total_net_profit FROM web_sales ws1 JOIN customer_address ca ON (ws1.ws_ship_addr_sk = ca.ca_address_sk) JOIN web_site s ON (ws1.ws_web_site_sk = s.web_site_sk) JOIN date_dim d ON (ws1.ws_ship_date_sk = d.d_date_sk) LEFT SEMI JOIN (SELECT ws2.ws_order_number as ws_order_number FROM web_sales ws2 JOIN web_sales ws3 ON(ws2.ws_order_number = ws3.ws_order_number) WHERE ws2.ws_warehouse_sk <> ws3.ws_warehouse_sk) ws_wh1 ON (ws1.ws_order_number = ws_wh1.ws_order_number) LEFT SEMI JOIN (SELECT wr_order_number FROM web_returns wr JOIN (SELECT ws4.ws_order_number as ws_order_number FROM web_sales ws4 JOIN web_sales ws5 ON (ws4.ws_order_number = ws5.ws_order_number) WHERE ws4.ws_warehouse_sk <> ws5.ws_warehouse_sk) ws_wh2 ON (wr.wr_order_number = ws_wh2.ws_order_number)) tmp1 ON (ws1.ws_order_number = tmp1.wr_order_number) WHERE d.d_date >= '2001-05-01' AND d.d_date <= '2001-06-30’ AND ca.ca_state = 'NC’ AND s.web_company_name = 'pri' Page 18 Architecting the Future of Big Data
19.
© Hortonworks Inc.
2011 Case studies: TPC-DS Q95 (Flattened) Page 19 Architecting the Future of Big Data web_sales AGG customer_address web_site Map Join Semi Join web_sales web_sales JOIN1 web_sales web_sales JOIN1 web_returns JOIN2 date_dim
20.
© Hortonworks Inc.
2011 Case studies: TPC-DS Q95 (Flattened) Page 20 Architecting the Future of Big Data web_sales AGG customer_address web_site Map Join Semi Join web_sales web_sales JOIN1 web_sales web_sales JOIN1 web_returns JOIN2 Without Correlation Optimizer • 6 MapReduce jobs • Unnecessary data loading (black web_sales nodes) • Unnecessary data shuffling Job 6 Job 2 Job 3 Job 4 Job 5 Job 1 date_dim
21.
© Hortonworks Inc.
2011 Case studies: TPC-DS Q95 (Flattened) Page 21 Architecting the Future of Big Data web_sales AGG customer_address web_site Map Join Semi Join web_sales JOIN1 JOIN1 web_returns JOIN2 With Correlation Optimizer • Black web_sales nodes share the same data loading date_dim
22.
© Hortonworks Inc.
2011 Case studies: TPC-DS Q95 (Flattened) Page 22 Architecting the Future of Big Data web_sales AGG customer_address web_site Map Join Semi Join web_sales JOIN1 JOIN1 web_returns JOIN2 With Correlation Optimizer • Black web_sales nodes share the same data loading • 3 MapReduce jobs Job 1 Job 2 Job 3 date_dim
23.
© Hortonworks Inc.
2011 Case studies: TPC-DS Q95 (Flattened) Page 23 Architecting the Future of Big Data web_sales AGG customer_address web_site Map Join Semi Join web_sales JOIN1 web_returns JOIN2 Follow-up work • Evaluate JOIN1 only once without materializing a temporary table date_dim
24.
© Hortonworks Inc.
2011 Case studies: TPC-DS Q95 (Flattened) Page 24 Architecting the Future of Big Data web_sales AGG customer_address web_site Map Join Semi Join web_sales JOIN1 web_returns JOIN2 Follow-up work • Evaluate JOIN1 only once without materializing a temporary table • Only use 2 MapReduce jobs Job 1 Job 2 date_dim
25.
© Hortonworks Inc.
2011 Outline •Query planning in Hive •Correlations in a query (Intra-query correlations) •Case studies •Automatically exploiting correlations (HIVE- 2206: Correlation Optimizer) Page 25 Architecting the Future of Big Data
26.
© Hortonworks Inc.
2011 Objectives • Eliminate unnecessary data loading – Query planner will be aware what data will be loaded – Do as many things as possible for loaded data • Eliminate unnecessary data shuffling – Query planner will be aware when data really needs to be shuffled – Do as many things as possible before shuffling the data again Page 26 Architecting the Future of Big Data
27.
© Hortonworks Inc.
2011 ReduceSink Deduplication • HIVE-2340 • Handle chained Job Flow Correlations – e.g. Generating a single job for both Group By and Order By • Cannot handle complex patterns – e.g. Multiple Joins involved patterns • Need a fundamental solution • Need to exploit shared input tables Page 27 Architecting the Future of Big Data t1 RS1 AGG1 RS2 … t1 RS1 AGG1 …
28.
© Hortonworks Inc.
2011 Correlation Optimizer • 2-phase optimizer – Phase 1: Correlation Detection – Phase 2: Query plan tree transformation • This work is not just about the optimizer – New operators to support the execution of an optimized plan – A mechanism to coordinate the operator tree inside the Reduce phase Page 28 Architecting the Future of Big Data
29.
© Hortonworks Inc.
2011 Correlation detection Page 29 Architecting the Future of Big Data SELECT p.c1, q.c2, q.cnt FROM (SELECT x.c1 AS c1 FROM t1 x JOIN t2 y ON (x.c1=y.c1)) p JOIN (SELECT z.c1 AS c1, count(*) AS cnt FROM t1 z GROUP BY z.c1) q ON (p.c1=q.c1) 1. Traverse the tree all the way down to find matching keys in ReduceSinkOperators 2. Then, check input tables to find shared data loading opportunities t1 as x t2 as y JOIN1 JOIN2 AGG1 t1 as z RS1 RS2 RS3 RS4 RS5 Key: p.c1 Key: q.c1 Key: x.c1 Key: y.c1 Key: z.c1
30.
© Hortonworks Inc.
2011 Query plan tree transformation Page 30 Architecting the Future of Big Data SELECT p.c1, q.c2, q.cnt FROM (SELECT x.c1 AS c1 FROM t1 x JOIN t2 y ON (x.c1=y.c1)) p JOIN (SELECT z.c1 AS c1, count(*) AS cnt FROM t1 z GROUP BY z.c1) q ON (p.c1=q.c1) t1 as x t2 as y JOIN1 JOIN2 AGG1 t1 as z Key: p.c1 RS1 RS2 RS3 RS4 RS5 Key: q.c1 Key: x.c1 Key: y.c1 Key: z.c1 t1 as x, zt2 as y JOIN1 JOIN2 AGG1 RS1RS2 RS3
31.
© Hortonworks Inc.
2011 Thanks Architecting the Future of Big Data Page 31
Baixar agora