Enviar pesquisa
Carregar
Optimizing Hive Queries
•
73 gostaram
•
36,025 visualizações
Owen O'Malley
Seguir
Owen O'Malley gave a talk at Hadoop Summit EU 2013 about optimizing Hive queries.
Leia menos
Leia mais
Tecnologia
Vista de apresentação de diapositivos
Denunciar
Compartilhar
Vista de apresentação de diapositivos
Denunciar
Compartilhar
1 de 36
Baixar agora
Baixar para ler offline
Recomendados
Hive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
Apache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
Spark
Spark
Koushik Mondal
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
Apache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
Recomendados
Hive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
Apache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
Spark
Spark
Koushik Mondal
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
Apache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
Introduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
Streaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
Julian Hyde
Cassandra Introduction & Features
Cassandra Introduction & Features
DataStax Academy
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
Using Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
The Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
Databricks
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
Apache Spark overview
Apache Spark overview
DataArt
Introduction to memcached
Introduction to memcached
Jurriaan Persyn
Introduction to MongoDB
Introduction to MongoDB
Mike Dirolf
Hadoop
Hadoop
Nishant Gandhi
Spark overview
Spark overview
Lisa Hua
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Vinod Kumar Vavilapalli
Distributed Data processing in a Cloud
Distributed Data processing in a Cloud
elliando dias
Mais conteúdo relacionado
Mais procurados
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
Introduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
Streaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
Julian Hyde
Cassandra Introduction & Features
Cassandra Introduction & Features
DataStax Academy
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
Using Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
The Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
Databricks
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
Apache Spark overview
Apache Spark overview
DataArt
Introduction to memcached
Introduction to memcached
Jurriaan Persyn
Introduction to MongoDB
Introduction to MongoDB
Mike Dirolf
Hadoop
Hadoop
Nishant Gandhi
Spark overview
Spark overview
Lisa Hua
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
Mais procurados
(20)
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Introduction to Apache Spark
Introduction to Apache Spark
Streaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
Cassandra Introduction & Features
Cassandra Introduction & Features
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
Using Apache Hive with High Performance
Using Apache Hive with High Performance
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
The Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark overview
Apache Spark overview
Introduction to memcached
Introduction to memcached
Introduction to MongoDB
Introduction to MongoDB
Hadoop
Hadoop
Spark overview
Spark overview
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Semelhante a Optimizing Hive Queries
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Vinod Kumar Vavilapalli
Distributed Data processing in a Cloud
Distributed Data processing in a Cloud
elliando dias
Hadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
Ike Ellis
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
elliando dias
Intro to Big Data and NoSQL
Intro to Big Data and NoSQL
Don Demcsak
Why databases cry at night
Why databases cry at night
Michael Yarichuk
Redshift deep dive
Redshift deep dive
Amazon Web Services LATAM
Apache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit
Ozone and HDFS's Evolution
Ozone and HDFS's Evolution
DataWorks Summit
Ozone and HDFS’s evolution
Ozone and HDFS’s evolution
DataWorks Summit
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)
Don Demcsak
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
Databricks
Hardware Provisioning
Hardware Provisioning
MongoDB
Intro to Big Data
Intro to Big Data
Zohar Elkayam
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
RahulBhole12
Taming the resource tiger
Taming the resource tiger
Elizabeth Smith
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
Joe Alex
Apache Tez – Present and Future
Apache Tez – Present and Future
Jianfeng Zhang
Semelhante a Optimizing Hive Queries
(20)
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Distributed Data processing in a Cloud
Distributed Data processing in a Cloud
Hadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
Intro to Big Data and NoSQL
Intro to Big Data and NoSQL
Why databases cry at night
Why databases cry at night
Redshift deep dive
Redshift deep dive
Apache Tez – Present and Future
Apache Tez – Present and Future
Ozone and HDFS's Evolution
Ozone and HDFS's Evolution
Ozone and HDFS’s evolution
Ozone and HDFS’s evolution
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
Hardware Provisioning
Hardware Provisioning
Intro to Big Data
Intro to Big Data
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
Taming the resource tiger
Taming the resource tiger
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
Apache Tez – Present and Future
Apache Tez – Present and Future
Mais de Owen O'Malley
Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid Them
Owen O'Malley
Big Data's Journey to ACID
Big Data's Journey to ACID
Owen O'Malley
ORC Deep Dive 2020
ORC Deep Dive 2020
Owen O'Malley
Protect your private data with ORC column encryption
Protect your private data with ORC column encryption
Owen O'Malley
Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
Owen O'Malley
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
Strata NYC 2018 Iceberg
Strata NYC 2018 Iceberg
Owen O'Malley
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
ORC Column Encryption
ORC Column Encryption
Owen O'Malley
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
Owen O'Malley
Data protection2015
Data protection2015
Owen O'Malley
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
Owen O'Malley
Hadoop Security Architecture
Hadoop Security Architecture
Owen O'Malley
Adding ACID Updates to Hive
Adding ACID Updates to Hive
Owen O'Malley
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
ORC Files
ORC Files
Owen O'Malley
ORC File Introduction
ORC File Introduction
Owen O'Malley
Next Generation Hadoop Operations
Next Generation Hadoop Operations
Owen O'Malley
Next Generation MapReduce
Next Generation MapReduce
Owen O'Malley
Mais de Owen O'Malley
(20)
Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid Them
Big Data's Journey to ACID
Big Data's Journey to ACID
ORC Deep Dive 2020
ORC Deep Dive 2020
Protect your private data with ORC column encryption
Protect your private data with ORC column encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Strata NYC 2018 Iceberg
Strata NYC 2018 Iceberg
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
ORC Column Encryption
ORC Column Encryption
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
Data protection2015
Data protection2015
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
Hadoop Security Architecture
Hadoop Security Architecture
Adding ACID Updates to Hive
Adding ACID Updates to Hive
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
ORC Files
ORC Files
ORC File Introduction
ORC File Introduction
Next Generation Hadoop Operations
Next Generation Hadoop Operations
Next Generation MapReduce
Next Generation MapReduce
Último
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Delhi Call girls
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
Radu Cotescu
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Delhi Call girls
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Delhi Call girls
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
How to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
naman860154
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
wesley chun
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Igalia
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Miguel Araújo
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Drew Madelung
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Rafal Los
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Principled Technologies
🐬 The future of MySQL is Postgres 🐘
🐬 The future of MySQL is Postgres 🐘
RTylerCroy
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Enterprise Knowledge
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Anna Loughnan Colquhoun
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
Results
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Katpro Technologies
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
debabhi2
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
Delhi Call girls
Último
(20)
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
How to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
🐬 The future of MySQL is Postgres 🐘
🐬 The future of MySQL is Postgres 🐘
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
Optimizing Hive Queries
1.
Optimizing Hive Queries Owen
O’Malley Founder and Architect owen@hortonworks.com @owen_omalley © Hortonworks Inc. 2013: Page 1
2.
Who Am I? • Founder
and Architect at Hortonworks – Working on Hive, working with customer – Formerly Hadoop MapReduce & Security – Been working on Hadoop since beginning • Apache Hadoop, ASF – Hadoop PMC (Original VP) – Tez, Ambari, Giraph PMC – Mentor for: Accumulo, Kafka, Knox – Apache Member © Hortonworks Inc. 2013 Page 2
3.
Outline • Data Layout • Data Format • Joins • Debugging
© Hortonworks Inc. 2013 Page 3
4.
Data Layout Location, Location,
Location © Hortonworks Inc. 2013 Page 4
5.
Fundamental Questions • What is
your primary use case? – What kind of queries and filters? • How do you need to access the data? – What information do you need together? • How much data do you have? – What is your year to year growth? • How do you get the data? © Hortonworks Inc. 2013 Page 5
6.
HDFS Characteristics • Provides Distributed
File System – Very high aggregate bandwidth – Extreme scalability (up to 100 PB) – Self-healing storage – Relatively simple to administer • Limitations – Can’t modify existing files – Single writer for each file – Heavy bias for large files ( > 100 MB) © Hortonworks Inc. 2013 Page 6
7.
Choices for Layout • Partitions
– Top level mechanism for pruning – Primary unit for updating tables (& schema) – Directory per value of specified column • Bucketing – Hashed into a file, good for sampling – Controls write parallelism • Sort order – The order the data is written within file © Hortonworks Inc. 2013 Page 7
8.
Example Hive Layout • Directory
Structure warehouse/$database/$table • Partitioning /part1=$partValue/part2=$partValue • Bucketing /$bucket_$attempt (eg. 000000_0) • Sort – Each file is sorted within the file © Hortonworks Inc. 2013 Page 8
9.
Layout Guidelines • Limit the
number of partitions – 1,000 partitions is much faster than 10,000 – Nested partitions are almost always wrong • Gauge the number of buckets – Calculate file size and keep big (200-500MB) – Don’t forget number of files (Buckets * Parts) • Layout related tables the same way – Partition – Bucket and sort order © Hortonworks Inc. 2013 Page 9
10.
Normalization • Most databases suggest
normalization – Keep information about each thing together – Customer, Sales, Returns, Inventory tables • Has lots of good properties, but… – Is typically slow to query • Often best to denormalize during load – Write once, read many times – Additionally provides snapshots in time. © Hortonworks Inc. 2013 Page 10
11.
Data Format How is
your data stored? © Hortonworks Inc. 2013 Page 11
12.
Choice of Format • Serde
– How each record is encoded? • Input/Output (aka File) Format – How are the files stored? • Primary Choices – Text – Sequence File – RCFile – ORC (Coming Soon!) © Hortonworks Inc. 2013 Page 12
13.
Text Format • Critical to
pick a Serde – Default - ^A’s between fields – JSON – top level JSON record – CSV – commas between fields (on github) • Slow to read and write • Can’t split compressed files – Leads to huge maps • Need to read/decompress all fields © Hortonworks Inc. 2013 Page 13
14.
Sequence File • Traditional MapReduce
binary file format – Stores keys and values as classes – Not a good fit for Hive, which has SQL types – Hive always stores entire row as value • Splittable but only by searching file – Default block size is 1 MB • Need to read and decompress all fields © Hortonworks Inc. 2013 Page 14
15.
RC (Row Columnar)
File • Columns stored separately – Read and decompress only needed ones – Better compression • Columns stored as binary blobs – Depends on metastore to supply types • Larger blocks – 4 MB by default – Still search file for split boundary © Hortonworks Inc. 2013 Page 15
16.
ORC (Optimized Row
Columnar) • Columns stored separately • Knows types – Uses type-specific encoders – Stores statistics (min, max, sum, count) • Has light-weight index – Skip over blocks of rows that don’t matter • Larger blocks – 256 MB by default – Has an index for block boundaries © Hortonworks Inc. 2013 Page 16
17.
ORC - File
Layout © Hortonworks Inc. 2013 Page 17
18.
Example File Sizes
from TPC-DS © Hortonworks Inc. 2013 Page 18
19.
Compression • Need to pick
level of compression – None – LZO or Snappy – fast but sloppy – Best for temporary tables – ZLIB – slow and complete – Best for long term storage © Hortonworks Inc. 2013 Page 19
20.
Joins Putting the pieces
together © Hortonworks Inc. 2013 Page 20
21.
Default Assumption • Hive assumes
users are either: – Noobies – Hive developers • Default behavior is always finish – Little Engine that Could! • Experts could override default behaviors – Get better performance, but riskier • We’re working on improving heuristics © Hortonworks Inc. 2013 Page 21
22.
Shuffle Join • Default choice
– Always works (I’ve sorted a petabyte!) – Worst case scenario • Each process – Reads from part of one of the tables – Buckets and sorts on join key – Sends one bucket to each reduce • Works everytime! © Hortonworks Inc. 2013 Page 22
23.
Map Join • One table
is small (eg. dimension table) – Fits in memory • Each process – Reads small table into memory hash table – Streams through part of the big file – Joining each record from hash table • Very fast, but limited © Hortonworks Inc. 2013 Page 23
24.
Sort Merge Bucket
(SMB) Join • If both tables are: – Sorted the same – Bucketed the same – And joining on the sort/bucket column • Each process: – Reads a bucket from each table – Process the row with the lowest value • Very efficient if applicable © Hortonworks Inc. 2013 Page 24
25.
Debugging What could possibly
go wrong? © Hortonworks Inc. 2013 Page 25
26.
Performance Question • Which of
the following is faster? – select count(distinct(Col)) from Tbl – select count(*) from (select distict(Col) from Tbl) © Hortonworks Inc. 2013 Page 26
27.
Count Distinct
© Hortonworks Inc. 2013 Page 27
28.
Answer • Surprisingly the second
is usually faster – In the first case: – Maps send each value to the reduce – Single reduce counts them all – In the second case: – Maps split up the values to many reduces – Each reduce generates its list – Final job counts the size of each list – Singleton reduces are almost always BAD © Hortonworks Inc. 2013 Page 28
29.
Communication is Good! • Hive
doesn’t tell you what is wrong. – Expects you to know! – “Lucy, you have some ‘splaining to do!” • Explain tool provides query plan – Filters on input – Numbers of jobs – Numbers of maps and reduces – What the jobs are sorting by – What directories are they reading or writing © Hortonworks Inc. 2013 Page 29
30.
Blinded by Science • The
explanation tool is confusing. – It takes practice to understand. – It doesn’t include some critical details like partition pruning. • Running the query makes things clearer! – Pay attention to the details – Look at JobConf and job history files © Hortonworks Inc. 2013 Page 30
31.
Skew • Skew is typical
in real datasets. • A user complained that his job was slow – He had 100 reduces – 98 of them finished fast – 2 ran really slow • The key was a boolean… © Hortonworks Inc. 2013 Page 31
32.
Root Cause Analysis • Ambari
– Apache project building Hadoop installation and management tool – Provides metrics (Ganglia & Nagios) – Root Cause Analysis – Processes MapReduce job logs – Displays timing of each part of query plan © Hortonworks Inc. 2013 Page 32
33.
Root Cause Analysis
Screenshots © Hortonworks Inc. 2013 Page 33
34.
Root Cause Analysis
Screenshots © Hortonworks Inc. 2013 Page 34
35.
Thank You! Questions &
Answers @owen_omalley © Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Page 35
36.
ORCFile - Comparison
RC File Trevni ORC File Hive Type Model N N Y Separate complex columns N Y Y Splits found quickly N Y Y Default column group size 4MB 64MB* 250MB Files per a bucket 1 >1 1 Store min, max, sum, count N N Y Versioned metadata N Y Y Run length data encoding N N Y Store strings in dictionary N N Y Store row count N Y Y Skip compressed blocks N N Y Store internal indexes N N Y © Hortonworks Inc. 2013 Page 36
Baixar agora