SlideShare a Scribd company logo
1 of 54
Innovations In Apache Hadoop MapReduce,
Pig and Hive for improving query
performance

gopalv@apache.org
vinodkv@apache.org




                                      Page 1
© Hortonworks Inc. 2013
Operation Stinger




     © Hortonworks Inc. 2013   Page 3
Performance at any cost




        © Hortonworks Inc. 2013
• Scalability
   – Already works great, just don’t break it for performance gains
• Isolation + Security
   – Queries between different users run as different users
• Fault tolerance
   – Keep all of MR’s safety nets to work around bad nodes in clusters
• UDFs
   – Make sure they are “User” defined and not “Admin” defined




                              © Hortonworks Inc. 2013
First things first
• How far can we push Hive as it exists today?




                      © Hortonworks Inc. 2013
Benchmark spec
• The TPC-DS benchmark data+query set
• Query 27 (big joins small)
  – For all items sold in stores located in specified states during a given
    year, find the average quantity, average list price, average list sales
    price, average coupon amount for a given gender, marital status,
    education and customer demographic.
• Query 82 (big joins big)
  – List all items and current prices sold through the store channel from
    certain manufacturers in a given price range and consistently had a
    quantity between 100 and 500 on hand in a 60-day period.




                               © Hortonworks Inc. 2013
TL;DR
• TPC-DS Query 27, Scale=200, 10 EC2 nodes (40 disks)




                     © Hortonworks Inc. 2013
TL;DR - II
• TPC-DS Query 82, Scale=200, 10 EC2 nodes (40 disks)




                     © Hortonworks Inc. 2013
Forget the actual benchmark
• First of all, YMMV
  – Software
  – Hardware
  – Setup
  – Tuning
• Text formats seem to be the staple of all comparisons
  – Really?
  – Everybody’s using it but only for benchmarks!




                             © Hortonworks Inc. 2013
What did the trick?
• Mapreduce?
• HDFS?
• Or is it just Hive?




                        © Hortonworks Inc. 2013
Optional Advice




    © Hortonworks Inc. 2013
RCFile
• Binary RCFiles
• Hive pushes down column projections
• Less I/O, Less CPU
• Smaller files




                     © Hortonworks Inc. 2013
Data organization
• No data system at scale is loaded once & left alone
• Partitions are essential
• Data flows into new partitions every day




                      © Hortonworks Inc. 2013
A closer look
• Now revisiting the benchmark and its results




                      © Hortonworks Inc. 2013
Query27 - Before




     © Hortonworks Inc. 2013
Before




© Hortonworks Inc. 2013
Query 27 - After




    © Hortonworks Inc. 2013
After




© Hortonworks Inc. 2013
Query 82 - Before




     © Hortonworks Inc. 2013
Query 82 - After




    © Hortonworks Inc. 2013
What changed?
• Job Count/Correct plan
• Correct data formats
• Correct data organization
• Correct configuration




                      © Hortonworks Inc. 2013
What changed?
                             Data Formats
                                    Data Organization




                                            Query Plan




   © Hortonworks Inc. 2013
© Hortonworks Inc. 2013
Is that all?
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Parallelism
   – Spin-up times
   – Data locality
• In HDFS
   – Bad disks/deteriorating nodes



                              © Hortonworks Inc. 2013
In Hive
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Parallelism
   – Spin-up times
   – Data locality
• In HDFS
   – Bad disks/deteriorating nodes



                              © Hortonworks Inc. 2013
In Hive
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Parallelism
   – Spin-up times
   – Data locality
• In HDFS
   – Bad disks/deteriorating nodes



                              © Hortonworks Inc. 2013
Hive Metastore
• 1+N Select problem
  – SELECT partitions FROM tables;
  – /* for each needed partition */ SELECT * FROM Partition ..
  – For query 27 , generates > 5000 queries! 4-5 seconds lost on each call!
  – Lazy loading or Include/Join are general solutions
• Datanucleus/ORM issues
  – 100K NPEs try.. Catch.. Ignore..
• Metastore DB Schema revisit
  – Denormalize some/all of it?




                              © Hortonworks Inc. 2013
In Hive
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Parallelism
   – Spin-up times
   – Data locality
• In HDFS
   – Bad disks/deteriorating nodes



                              © Hortonworks Inc. 2013
RCFile issues
• RCFiles do not split well
   – Row groups and row group boundaries
• Small row groups vs big row groups
   – Sync() vs min split
   – Storage packing
• Run-length information is lost
   – Unnecessary deserialization costs




                              © Hortonworks Inc. 2013
ORC file format
• A single file as output of each task.
  – Dramatically simplifies integration with Hive
  – Lowers pressure on the NameNode
• Support for the Hive type model
  – Complex types (struct, list, map, union)
  – New types (datetime, decimal)
  – Encoding specific to the column type
• Split files without scanning for markers
• Bound the amount of memory required for
  reading or writing.


                             © Hortonworks Inc. 2013
In Hive
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Parallelism
   – Spin-up times
   – Data locality
• In HDFS
   – Bad disks/deteriorating nodes



                              © Hortonworks Inc. 2013
CPU intensive code




      © Hortonworks Inc. 2013
CPU intensive code
• Hive query engine processes one row at a time
   – Very inefficient in terms of CPU usage
• Lazy deserialization: layers
• Object inspector calls
• Lots of virtual method calls




                               © Hortonworks Inc. 2013
Tighten your loops




     © Hortonworks Inc. 2013
Vectorization to the rescue
• Process a row batch at a time instead of a single row
• Row batch to consist of column vectors
   – The column vector will consist of array(s) of primitive types as far as
     possible
• Each operator will process the whole column vector at a
  time
• File formats to give out vectorized batches for processing
• Underlying research promises
   – Better instruction pipelines and cache usage
   – Mechanical sympathy




                                © Hortonworks Inc. 2013
Vectorization: Prelim results
• Functionality
   – Some arithmetic operators and filters using primitive type columns
   – Have a basic integration benchmark to prove that the whole setup
     works
• Performance
   – Micro benchmark
   – More than 30x improvement in the CPU time
   – Disclaimer:
       – Micro benchmark!
       – Include io or deserialization costs or complex and string datatypes




                                © Hortonworks Inc. 2013
In YARN+MR
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Data locality
   – Parallelism
   – Spin-up times
• In HDFS
   – Bad disks/deteriorating nodes



                              © Hortonworks Inc. 2013
In YARN+MR
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Data locality
   – Parallelism
   – Spin-up times
• In HDFS
   – Bad disks/deteriorating nodes



                              © Hortonworks Inc. 2013
Data Locality
• CombineInputFormat
• AM interaction with locality
• Short-circuit reads!
• Delay scheduling
   – Good for throughput
   – Bad for latency




                           © Hortonworks Inc. 2013
In YARN+MR
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Data locality
   – Parallelism
   – Spin-up times
• In HDFS
   – Bad disks/deteriorating nodes



                              © Hortonworks Inc. 2013
Parallelism
• Can tune it (to some extent)
   – Controlling splits/reducer count
• Hive doesn’t know dynamic cluster status
   – Benchmarks max out clusters, real jobs may or may not
• Hive does not let you control parallelism
   – particularly in case of multiple jobs in a query




                                © Hortonworks Inc. 2013
In YARN+MR
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Data locality
   – Parallelism
   – Spin-up times
• In HDFS
   – Bad disks/deteriorating nodes



                              © Hortonworks Inc. 2013
Spin up times
• AM startup costs
• Task startup costs
• Multiple waves of map tasks




                       © Hortonworks Inc. 2013
Apache Tez
• Generic DAG workflow
• Container re-use
• AM pool service




                         © Hortonworks Inc. 2013
AM Pool Service
• Pre-launches a pool of AMs
• Jobs submitted to these pre-launched AMs
  – Saves 3-5 seconds
• Pre-launched AMs can pre-allocate containers
• Tasks can be started as soon as the job is submitted
  – Saves 2-3 seconds




                        © Hortonworks Inc. 2013
Container reuse
• Tez MapReduce AM supports Container reuse
• Launched JVMs are re-used between tasks
  – about 4-5 seconds saved in case of multiple waves
• Allows future enhancements
  – re-using task data structures across splits




                               © Hortonworks Inc. 2013
In HDFS
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Data locality
   – Parallelism
   – Spin-up times
• In HDFS
   – Bad disks/deteriorating nodes



                              © Hortonworks Inc. 2013
Speculation/bad disks
• No cluster remains at 100% forever
• Bad disks cause latency issues
  – Speculation is one defense, but it is not enough
  – Fault tolerance is a safety net
• Possible solutions:
  – More feedback from HDFS about stale nodes, bad/slow disks
  – Volume scheduling




                              © Hortonworks Inc. 2013
General guidelines
• Benchmarking
  – Be wary of benchmarks! Including ours!
  – Algebra with X




                            © Hortonworks Inc. 2013
General guidelines contd.
• Benchmarks: To repeat, YMMV.
• Benchmark *your* use-case.
• Decide your problem size
   – If (smallData) {
         Mysql/Postgres/Your smart phone
    } else {
         –Make it work
         –Make it scale
         –Make it faster
     }
• If it is (seems to be) slow, file a bug, spend a little time!
• Replacing systems without understanding them
   – Is an easy way to have an illusion of progress



                               © Hortonworks Inc. 2013
Related talks
• “Optimizing Hive Queries” by Owen O’Malley
• “What’s New and What’s Next in Apache Hive” by Gunther
  Hagleitner




                      © Hortonworks Inc. 2013
Credits
• Arun C Murthy
• Bikas Saha
• Gopal Vijayaraghavan
• Hitesh Shah
• Siddharth Seth
• Vinod Kumar Vavilapalli
• Alan Gates
• Ashutosh Chauhan
• Vikram Dixit
• Gunther Hagleitner
• Owen O’Malley
• Jintendranath Pandey
• Yahoo!, Facebook, Twitter, SAP and Microsoft all contributing.

                          © Hortonworks Inc. 2013
Q&A
• Thanks!




            © Hortonworks Inc. 2013

More Related Content

What's hot

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
Implementing Parallelism in PostgreSQL - PGCon 2014
Implementing Parallelism in PostgreSQL - PGCon 2014Implementing Parallelism in PostgreSQL - PGCon 2014
Implementing Parallelism in PostgreSQL - PGCon 2014EDB
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemDataWorks Summit/Hadoop Summit
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationCloudera, Inc.
 
Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges DataWorks Summit
 
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera, Inc.
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaManish Maheshwari
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application ResourcesDataWorks Summit
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookBigDataCloud
 
Storage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesStorage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesLINE Corporation (Tech Unit)
 
Keep your hadoop cluster at its best! v4
Keep your hadoop cluster at its best! v4Keep your hadoop cluster at its best! v4
Keep your hadoop cluster at its best! v4Chris Nauroth
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Kathleen Ting
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldDataWorks Summit
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
What's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemWhat's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemCloudera, Inc.
 

What's hot (20)

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Implementing Parallelism in PostgreSQL - PGCon 2014
Implementing Parallelism in PostgreSQL - PGCon 2014Implementing Parallelism in PostgreSQL - PGCon 2014
Implementing Parallelism in PostgreSQL - PGCon 2014
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote Presentation
 
Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges
 
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling Impala
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
 
Storage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesStorage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messages
 
Keep your hadoop cluster at its best! v4
Keep your hadoop cluster at its best! v4Keep your hadoop cluster at its best! v4
Keep your hadoop cluster at its best! v4
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
YARN Federation
YARN Federation YARN Federation
YARN Federation
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
What's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemWhat's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File System
 

Viewers also liked

Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionacogoluegnes
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureVinod Kumar Vavilapalli
 
Hadoop Summit San Jose 2015: YARN - Past, Present and Future
Hadoop Summit San Jose 2015: YARN - Past, Present and FutureHadoop Summit San Jose 2015: YARN - Past, Present and Future
Hadoop Summit San Jose 2015: YARN - Past, Present and FutureVinod Kumar Vavilapalli
 
Pig TPC-H Benchmark and Performance Tuning
Pig TPC-H Benchmark and Performance TuningPig TPC-H Benchmark and Performance Tuning
Pig TPC-H Benchmark and Performance TuningJie Li
 
Hadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureHadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureVinod Kumar Vavilapalli
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi
 

Viewers also liked (9)

Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
 
Hadoop Summit San Jose 2015: YARN - Past, Present and Future
Hadoop Summit San Jose 2015: YARN - Past, Present and FutureHadoop Summit San Jose 2015: YARN - Past, Present and Future
Hadoop Summit San Jose 2015: YARN - Past, Present and Future
 
Pig TPC-H Benchmark and Performance Tuning
Pig TPC-H Benchmark and Performance TuningPig TPC-H Benchmark and Performance Tuning
Pig TPC-H Benchmark and Performance Tuning
 
Hadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureHadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and Future
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 

Similar to Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

Innovations in Apache Hadoop MapReduce Pig Hive for Improving Query Performance
Innovations in Apache Hadoop MapReduce Pig Hive for Improving Query PerformanceInnovations in Apache Hadoop MapReduce Pig Hive for Improving Query Performance
Innovations in Apache Hadoop MapReduce Pig Hive for Improving Query PerformanceDataWorks Summit
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerIke Ellis
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecturesaipriyacoool
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Building HBase Applications - Ted Dunning
Building HBase Applications - Ted DunningBuilding HBase Applications - Ted Dunning
Building HBase Applications - Ted DunningMapR Technologies
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitData Con LA
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoopGeoff Hendrey
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesDataWorks Summit
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2TarjeiRomtveit
 
ARCHITECTING INFLUXENTERPRISE FOR SUCCESS
ARCHITECTING INFLUXENTERPRISE FOR SUCCESSARCHITECTING INFLUXENTERPRISE FOR SUCCESS
ARCHITECTING INFLUXENTERPRISE FOR SUCCESSInfluxData
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storagehybrid cloud
 

Similar to Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance (20)

Innovations in Apache Hadoop MapReduce Pig Hive for Improving Query Performance
Innovations in Apache Hadoop MapReduce Pig Hive for Improving Query PerformanceInnovations in Apache Hadoop MapReduce Pig Hive for Improving Query Performance
Innovations in Apache Hadoop MapReduce Pig Hive for Improving Query Performance
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
Yarnthug2014
Yarnthug2014Yarnthug2014
Yarnthug2014
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Building HBase Applications - Ted Dunning
Building HBase Applications - Ted DunningBuilding HBase Applications - Ted Dunning
Building HBase Applications - Ted Dunning
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2
 
ARCHITECTING INFLUXENTERPRISE FOR SUCCESS
ARCHITECTING INFLUXENTERPRISE FOR SUCCESSARCHITECTING INFLUXENTERPRISE FOR SUCCESS
ARCHITECTING INFLUXENTERPRISE FOR SUCCESS
 
Containers and Big Data
Containers and Big Data Containers and Big Data
Containers and Big Data
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
 

Recently uploaded

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

  • 1. Innovations In Apache Hadoop MapReduce, Pig and Hive for improving query performance gopalv@apache.org vinodkv@apache.org Page 1
  • 3. Operation Stinger © Hortonworks Inc. 2013 Page 3
  • 4. Performance at any cost © Hortonworks Inc. 2013
  • 5. • Scalability – Already works great, just don’t break it for performance gains • Isolation + Security – Queries between different users run as different users • Fault tolerance – Keep all of MR’s safety nets to work around bad nodes in clusters • UDFs – Make sure they are “User” defined and not “Admin” defined © Hortonworks Inc. 2013
  • 6. First things first • How far can we push Hive as it exists today? © Hortonworks Inc. 2013
  • 7. Benchmark spec • The TPC-DS benchmark data+query set • Query 27 (big joins small) – For all items sold in stores located in specified states during a given year, find the average quantity, average list price, average list sales price, average coupon amount for a given gender, marital status, education and customer demographic. • Query 82 (big joins big) – List all items and current prices sold through the store channel from certain manufacturers in a given price range and consistently had a quantity between 100 and 500 on hand in a 60-day period. © Hortonworks Inc. 2013
  • 8. TL;DR • TPC-DS Query 27, Scale=200, 10 EC2 nodes (40 disks) © Hortonworks Inc. 2013
  • 9. TL;DR - II • TPC-DS Query 82, Scale=200, 10 EC2 nodes (40 disks) © Hortonworks Inc. 2013
  • 10. Forget the actual benchmark • First of all, YMMV – Software – Hardware – Setup – Tuning • Text formats seem to be the staple of all comparisons – Really? – Everybody’s using it but only for benchmarks! © Hortonworks Inc. 2013
  • 11. What did the trick? • Mapreduce? • HDFS? • Or is it just Hive? © Hortonworks Inc. 2013
  • 12. Optional Advice © Hortonworks Inc. 2013
  • 13. RCFile • Binary RCFiles • Hive pushes down column projections • Less I/O, Less CPU • Smaller files © Hortonworks Inc. 2013
  • 14. Data organization • No data system at scale is loaded once & left alone • Partitions are essential • Data flows into new partitions every day © Hortonworks Inc. 2013
  • 15. A closer look • Now revisiting the benchmark and its results © Hortonworks Inc. 2013
  • 16. Query27 - Before © Hortonworks Inc. 2013
  • 18. Query 27 - After © Hortonworks Inc. 2013
  • 20. Query 82 - Before © Hortonworks Inc. 2013
  • 21. Query 82 - After © Hortonworks Inc. 2013
  • 22. What changed? • Job Count/Correct plan • Correct data formats • Correct data organization • Correct configuration © Hortonworks Inc. 2013
  • 23. What changed? Data Formats Data Organization Query Plan © Hortonworks Inc. 2013
  • 25. Is that all? • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Parallelism – Spin-up times – Data locality • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 26. In Hive • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Parallelism – Spin-up times – Data locality • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 27. In Hive • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Parallelism – Spin-up times – Data locality • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 28. Hive Metastore • 1+N Select problem – SELECT partitions FROM tables; – /* for each needed partition */ SELECT * FROM Partition .. – For query 27 , generates > 5000 queries! 4-5 seconds lost on each call! – Lazy loading or Include/Join are general solutions • Datanucleus/ORM issues – 100K NPEs try.. Catch.. Ignore.. • Metastore DB Schema revisit – Denormalize some/all of it? © Hortonworks Inc. 2013
  • 29. In Hive • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Parallelism – Spin-up times – Data locality • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 30. RCFile issues • RCFiles do not split well – Row groups and row group boundaries • Small row groups vs big row groups – Sync() vs min split – Storage packing • Run-length information is lost – Unnecessary deserialization costs © Hortonworks Inc. 2013
  • 31. ORC file format • A single file as output of each task. – Dramatically simplifies integration with Hive – Lowers pressure on the NameNode • Support for the Hive type model – Complex types (struct, list, map, union) – New types (datetime, decimal) – Encoding specific to the column type • Split files without scanning for markers • Bound the amount of memory required for reading or writing. © Hortonworks Inc. 2013
  • 32. In Hive • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Parallelism – Spin-up times – Data locality • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 33. CPU intensive code © Hortonworks Inc. 2013
  • 34. CPU intensive code • Hive query engine processes one row at a time – Very inefficient in terms of CPU usage • Lazy deserialization: layers • Object inspector calls • Lots of virtual method calls © Hortonworks Inc. 2013
  • 35. Tighten your loops © Hortonworks Inc. 2013
  • 36. Vectorization to the rescue • Process a row batch at a time instead of a single row • Row batch to consist of column vectors – The column vector will consist of array(s) of primitive types as far as possible • Each operator will process the whole column vector at a time • File formats to give out vectorized batches for processing • Underlying research promises – Better instruction pipelines and cache usage – Mechanical sympathy © Hortonworks Inc. 2013
  • 37. Vectorization: Prelim results • Functionality – Some arithmetic operators and filters using primitive type columns – Have a basic integration benchmark to prove that the whole setup works • Performance – Micro benchmark – More than 30x improvement in the CPU time – Disclaimer: – Micro benchmark! – Include io or deserialization costs or complex and string datatypes © Hortonworks Inc. 2013
  • 38. In YARN+MR • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Data locality – Parallelism – Spin-up times • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 39. In YARN+MR • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Data locality – Parallelism – Spin-up times • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 40. Data Locality • CombineInputFormat • AM interaction with locality • Short-circuit reads! • Delay scheduling – Good for throughput – Bad for latency © Hortonworks Inc. 2013
  • 41. In YARN+MR • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Data locality – Parallelism – Spin-up times • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 42. Parallelism • Can tune it (to some extent) – Controlling splits/reducer count • Hive doesn’t know dynamic cluster status – Benchmarks max out clusters, real jobs may or may not • Hive does not let you control parallelism – particularly in case of multiple jobs in a query © Hortonworks Inc. 2013
  • 43. In YARN+MR • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Data locality – Parallelism – Spin-up times • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 44. Spin up times • AM startup costs • Task startup costs • Multiple waves of map tasks © Hortonworks Inc. 2013
  • 45. Apache Tez • Generic DAG workflow • Container re-use • AM pool service © Hortonworks Inc. 2013
  • 46. AM Pool Service • Pre-launches a pool of AMs • Jobs submitted to these pre-launched AMs – Saves 3-5 seconds • Pre-launched AMs can pre-allocate containers • Tasks can be started as soon as the job is submitted – Saves 2-3 seconds © Hortonworks Inc. 2013
  • 47. Container reuse • Tez MapReduce AM supports Container reuse • Launched JVMs are re-used between tasks – about 4-5 seconds saved in case of multiple waves • Allows future enhancements – re-using task data structures across splits © Hortonworks Inc. 2013
  • 48. In HDFS • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Data locality – Parallelism – Spin-up times • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 49. Speculation/bad disks • No cluster remains at 100% forever • Bad disks cause latency issues – Speculation is one defense, but it is not enough – Fault tolerance is a safety net • Possible solutions: – More feedback from HDFS about stale nodes, bad/slow disks – Volume scheduling © Hortonworks Inc. 2013
  • 50. General guidelines • Benchmarking – Be wary of benchmarks! Including ours! – Algebra with X © Hortonworks Inc. 2013
  • 51. General guidelines contd. • Benchmarks: To repeat, YMMV. • Benchmark *your* use-case. • Decide your problem size – If (smallData) { Mysql/Postgres/Your smart phone } else { –Make it work –Make it scale –Make it faster } • If it is (seems to be) slow, file a bug, spend a little time! • Replacing systems without understanding them – Is an easy way to have an illusion of progress © Hortonworks Inc. 2013
  • 52. Related talks • “Optimizing Hive Queries” by Owen O’Malley • “What’s New and What’s Next in Apache Hive” by Gunther Hagleitner © Hortonworks Inc. 2013
  • 53. Credits • Arun C Murthy • Bikas Saha • Gopal Vijayaraghavan • Hitesh Shah • Siddharth Seth • Vinod Kumar Vavilapalli • Alan Gates • Ashutosh Chauhan • Vikram Dixit • Gunther Hagleitner • Owen O’Malley • Jintendranath Pandey • Yahoo!, Facebook, Twitter, SAP and Microsoft all contributing. © Hortonworks Inc. 2013
  • 54. Q&A • Thanks! © Hortonworks Inc. 2013

Editor's Notes

  1. Since the time we started this, we’ve seen multiple people benchmark hive comparing its text format processors against alternatives
  2. Not mapreduce, not hdfs, just plain hive
  3. Layers of inspectors that identify column type, de-serialize data and determine appropriate expression routines in the inner loop
  4. I wrote all of the code and Jitendra was just consulting :P