SlideShare uma empresa Scribd logo
1 de 54
Innovations In Apache Hadoop MapReduce,
Pig and Hive for improving query
performance

gopalv@apache.org
vinodkv@apache.org




                                      Page 1
© Hortonworks Inc. 2013
Operation Stinger




     © Hortonworks Inc. 2013   Page 3
Performance at any cost




        © Hortonworks Inc. 2013
Performance at any cost
• Scalability
   – Already works great, just don’t break it for performance gains
• Isolation + Security
   – Queries between different users run as different users
• Fault tolerance
   – Keep all of MR’s safety nets to work around bad nodes in clusters
• UDFs
   – Make sure they are “User” defined and not “Admin” defined




                               © Hortonworks Inc. 2013
First things first
• How far can we push Hive as it exists today?




                      © Hortonworks Inc. 2013
Benchmark spec
• The TPC-DS benchmark data+query set
• Query 27 (big joins small)
  – For all items sold in stores located in specified states during a given year,
    find the average quantity, average list price, average list sales price,
    average coupon amount for a given gender, marital status, education
    and customer demographic.
• Query 82 (big joins big)
  – List all items and current prices sold through the store channel from
    certain manufacturers in a given price range and consistently had a
    quantity between 100 and 500 on hand in a 60-day period.




                               © Hortonworks Inc. 2013
TL;DR - II
• TPC-DS Query 82, Scale=200, 10 EC2 nodes (40 disks)

3500
         3257.692


3000                2862.669


2500



2000                                                                   Text
                                                                       RCFile
1500                                                                   Partitioned RCFile
                                                                       Partitioned RCFile + Optimizations

1000



500
                                          255.641
                                                              71.114
   0
                               Query 82



                                    © Hortonworks Inc. 2013
Forget the actual benchmark
• First of all, YMMV
  – Software
  – Hardware
  – Setup
  – Tuning
• Text formats seem to be the staple of all comparisons
  – Really?
  – Everybody’s using it but only for benchmarks!




                             © Hortonworks Inc. 2013
What did the trick?
• Mapreduce?
• HDFS?
• Or is it just Hive?




                        © Hortonworks Inc. 2013
Optional Advice




    © Hortonworks Inc. 2013
RCFile
• Binary RCFiles
• Hive pushes down column projections
• Less I/O, Less CPU
• Smaller files




                     © Hortonworks Inc. 2013
Data organization
• No data system at scale is loaded once & left alone
• Partitions are essential
• Data flows into new partitions every day




                      © Hortonworks Inc. 2013
A closer look
• Now revisiting the benchmark and its results




                      © Hortonworks Inc. 2013
Query27 - Before
Stage-3                                                                                16




Stage-2                                                                               17




Stage-1                                                                          49




Stage-6                                                             355




Stage-5                                             512




Stage-4             553



          0   200         400   600                800    1000   1200     1400              1600




                                © Hortonworks Inc. 2013
Before




© Hortonworks Inc. 2013
Query 27 - After

                                        Time


 Stage-9                                    33




Stage-10       5




           0       5    10    15                  20   25   30   35   40




                             © Hortonworks Inc. 2013
After




© Hortonworks Inc. 2013
Query 82 - Before

Stage-3                                                                         17




Stage-2                                                                         17



                                                                                            Start
                                                                                            Time

Stage-1                                                    2199




Stage-4       1025




          0   500      1000    1500                     2000      2500   3000        3500




                              © Hortonworks Inc. 2013
Query 82 - After



Stage-1                             71




          0   10    20   30                  40    50   60   70   80




                         © Hortonworks Inc. 2013
What changed?
• Job Count/Correct plan
• Correct data formats
• Correct data organization
• Correct configuration




                      © Hortonworks Inc. 2013
© Hortonworks Inc. 2013
Is that all?
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Parallelism
   – Spin-up times
   – Data locality
• In HDFS
   – Bad disks/deteriorating nodes




                              © Hortonworks Inc. 2013
In Hive
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Parallelism
   – Spin-up times
   – Data locality
• In HDFS
   – Bad disks/deteriorating nodes




                              © Hortonworks Inc. 2013
In Hive
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Parallelism
   – Spin-up times
   – Data locality
• In HDFS
   – Bad disks/deteriorating nodes




                              © Hortonworks Inc. 2013
Hive Metastore
• 1+N Select problem
  – SELECT partitions FROM tables;
  – /* for each needed partition */ SELECT * FROM Partition ..
  – For query 27 , generates > 5000 queries! 4-5 seconds lost on each call!
  – Lazy loading or Include/Join are general solutions
• Datanucleus/ORM issues
  – 100K NPEs try.. Catch.. Ignore..
• Metastore DB Schema revisit
  – Denormalize some/all of it?




                               © Hortonworks Inc. 2013
In Hive
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Parallelism
   – Spin-up times
   – Data locality
• In HDFS
   – Bad disks/deteriorating nodes




                              © Hortonworks Inc. 2013
RCFile issues
• RCFiles do not split well
   – Row groups and row group boundaries
• Small row groups vs big row groups
   – Sync() vs min split
   – Storage packing
• Run-length information is lost
   – Unnecessary deserialization costs




                              © Hortonworks Inc. 2013
ORC file format
• A single file as output of each task.
  – Dramatically simplifies integration with Hive
  – Lowers pressure on the NameNode
• Support for the Hive type model
  – Complex types (struct, list, map, union)
  – New types (datetime, decimal)
  – Encoding specific to the column type
• Split files without scanning for markers
• Bound the amount of memory required for
  reading or writing.


                             © Hortonworks Inc. 2013
In Hive
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Parallelism
   – Spin-up times
   – Data locality
• In HDFS
   – Bad disks/deteriorating nodes




                              © Hortonworks Inc. 2013
CPU intensive code




      © Hortonworks Inc. 2013
CPU intensive code
• Hive query engine processes one row at a time
   – Very inefficient in terms of CPU usage
• Lazy deserialization: layers
• Object inspector calls
• Lots of virtual method calls




                               © Hortonworks Inc. 2013
Tighten your loops




     © Hortonworks Inc. 2013
Vectorization to the rescue
• Process a row batch at a time instead of a single row
• Row batch to consist of column vectors
   – The column vector will consist of array(s) of primitive types as far as
     possible
• Each operator will process the whole column vector at a
  time
• File formats to give out vectorized batches for processing
• Underlying research promises
   – Better instruction pipelines and cache usage
   – Mechanical sympathy




                                © Hortonworks Inc. 2013
Vectorization: Prelim results
• Functionality
   – Some arithmetic operators and filters using primitive type columns
   – Have a basic integration benchmark to prove that the whole setup
     works
• Performance
   – Micro benchmark
   – More than 30x improvement in the CPU time
   – Disclaimer:
       – Micro benchmark!
       – Include io or deserialization costs or complex and string datatypes




                                © Hortonworks Inc. 2013
In YARN+MR
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Data locality
   – Parallelism
   – Spin-up times
• In HDFS
   – Bad disks/deteriorating nodes




                              © Hortonworks Inc. 2013
In YARN+MR
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Data locality
   – Parallelism
   – Spin-up times
• In HDFS
   – Bad disks/deteriorating nodes




                              © Hortonworks Inc. 2013
Data Locality
• CombineInputFormat
• AM interaction with locality
• Short-circuit reads!
• Delay scheduling
   – Good for throughput
   – Bad for latency




                           © Hortonworks Inc. 2013
In YARN+MR
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Data locality
   – Parallelism
   – Spin-up times
• In HDFS
   – Bad disks/deteriorating nodes




                              © Hortonworks Inc. 2013
Parallelism
• Can tune it (to some extent)
   – Controlling splits/reducer count
• Hive doesn’t know dynamic cluster status
   – Benchmarks max out clusters, real jobs may or may not
• Hive does not let you control parallelism
   – particularly in case of multiple jobs in a query




                                © Hortonworks Inc. 2013
In YARN+MR
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Data locality
   – Parallelism
   – Spin-up times
• In HDFS
   – Bad disks/deteriorating nodes




                              © Hortonworks Inc. 2013
Spin up times
• AM startup costs
• Task startup costs
• Multiple waves of map tasks




                       © Hortonworks Inc. 2013
Apache Tez
• Generic DAG workflow
• Container re-use
• AM pool service




                         © Hortonworks Inc. 2013
AM Pool Service
• Pre-launches a pool of AMs
• Jobs submitted to these pre-launched AMs
  – Saves 3-5 seconds
• Pre-launched AMs can pre-allocate containers
• Tasks can be started as soon as the job is submitted
  – Saves 2-3 seconds




                        © Hortonworks Inc. 2013
Container reuse
• Tez MapReduce AM supports Container reuse
• Launched JVMs are re-used between tasks
  – about 4-5 seconds saved in case of multiple waves
• Allows future enhancements
  – re-using task data structures across splits




                               © Hortonworks Inc. 2013
In HDFS
• NO!
• In Hive
   – Metastore
   – RCFile issues
   – CPU intensive code
• In YARN+MR
   – Data locality
   – Parallelism
   – Spin-up times
• In HDFS
   – Bad disks/deteriorating nodes




                              © Hortonworks Inc. 2013
Speculation/bad disks
• No cluster remains at 100% forever
• Bad disks cause latency issues
  – Speculation is one defense, but it is not enough
  – Fault tolerance is a safety net
• Possible solutions:
  – More feedback from HDFS about stale nodes, bad/slow disks
  – Volume scheduling




                              © Hortonworks Inc. 2013
General guidelines
• Benchmarking
  – Be wary of benchmarks! Including ours!
  – Algebra with X




                            © Hortonworks Inc. 2013
General guidelines contd.
• Benchmarks: To repeat, YMMV.
• Benchmark *your* use-case.
• Decide your problem size
   – If (smallData) {
         Mysql/Postgres/Your smart phone
    } else {
         – Make it work
         – Make it scale
         – Make it faster
     }
• If it is (seems to be) slow, file a bug, spend a little time!
• Replacing systems without understanding them
   – Is an easy way to have an illusion of progress



                               © Hortonworks Inc. 2013
Related talks
• “Optimizing Hive Queries” by Owen O’Malley
• “What’s New and What’s Next in Apache Hive” by Gunther
  Hagleitner




                      © Hortonworks Inc. 2013
Credits
• Arun C Murthy
• Bikas Saha
• Gopal Vijayaraghavan
• Hitesh Shah
• Siddharth Seth
• Vinod Kumar Vavilapalli
• Alan Gates
• Ashutosh Chauhan
• Vikram Dixit
• Gunther Hagleitner
• Owen O’Malley
• Jintendranath Pandey
• Yahoo!, Facebook, Twitter, SAP and Microsoft all contributing.

                          © Hortonworks Inc. 2013
Q&A
• Thanks!




            © Hortonworks Inc. 2013

Mais conteúdo relacionado

Mais procurados

Oracle+golden+gate+introduction
Oracle+golden+gate+introductionOracle+golden+gate+introduction
Oracle+golden+gate+introductionxiakaicd
 
Escape From Amazon: Tips/Techniques for Reducing AWS Dependencies
Escape From Amazon: Tips/Techniques for Reducing AWS DependenciesEscape From Amazon: Tips/Techniques for Reducing AWS Dependencies
Escape From Amazon: Tips/Techniques for Reducing AWS DependenciesSoam Acharya
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hiverxu
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationCloudera, Inc.
 
Access to Open Test Infrastructures using Panlab2 - Anastasius Gavras (Eurescom)
Access to Open Test Infrastructures using Panlab2 - Anastasius Gavras (Eurescom)Access to Open Test Infrastructures using Panlab2 - Anastasius Gavras (Eurescom)
Access to Open Test Infrastructures using Panlab2 - Anastasius Gavras (Eurescom)NGN Test Centre
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetupt3rmin4t0r
 
CERN Agile Infrastructure, Road to Production
CERN Agile Infrastructure, Road to ProductionCERN Agile Infrastructure, Road to Production
CERN Agile Infrastructure, Road to ProductionSteve Traylen
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsCloudera, Inc.
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Addressing plans
Addressing plansAddressing plans
Addressing plansenes373
 
Amscan and Tuning and Optimizing for Custom PLM Add-ons
Amscan and Tuning and Optimizing for Custom PLM Add-onsAmscan and Tuning and Optimizing for Custom PLM Add-ons
Amscan and Tuning and Optimizing for Custom PLM Add-onsAras
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerDataWorks Summit
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jDataWorks Summit
 
Presentacion oracle exadata & exalogic f. podesta -yatch club 19 de abril 2012
Presentacion oracle exadata & exalogic   f. podesta -yatch club 19 de abril 2012Presentacion oracle exadata & exalogic   f. podesta -yatch club 19 de abril 2012
Presentacion oracle exadata & exalogic f. podesta -yatch club 19 de abril 2012ValeVilloslada
 
Engineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the FutureEngineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the FutureBob Rhubart
 
Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012Weiwei Chen
 

Mais procurados (20)

Oracle+golden+gate+introduction
Oracle+golden+gate+introductionOracle+golden+gate+introduction
Oracle+golden+gate+introduction
 
Escape From Amazon: Tips/Techniques for Reducing AWS Dependencies
Escape From Amazon: Tips/Techniques for Reducing AWS DependenciesEscape From Amazon: Tips/Techniques for Reducing AWS Dependencies
Escape From Amazon: Tips/Techniques for Reducing AWS Dependencies
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote Presentation
 
Access to Open Test Infrastructures using Panlab2 - Anastasius Gavras (Eurescom)
Access to Open Test Infrastructures using Panlab2 - Anastasius Gavras (Eurescom)Access to Open Test Infrastructures using Panlab2 - Anastasius Gavras (Eurescom)
Access to Open Test Infrastructures using Panlab2 - Anastasius Gavras (Eurescom)
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetup
 
CERN Agile Infrastructure, Road to Production
CERN Agile Infrastructure, Road to ProductionCERN Agile Infrastructure, Road to Production
CERN Agile Infrastructure, Road to Production
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Addressing plans
Addressing plansAddressing plans
Addressing plans
 
Amscan and Tuning and Optimizing for Custom PLM Add-ons
Amscan and Tuning and Optimizing for Custom PLM Add-onsAmscan and Tuning and Optimizing for Custom PLM Add-ons
Amscan and Tuning and Optimizing for Custom PLM Add-ons
 
HBase internals
HBase internalsHBase internals
HBase internals
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
 
Presentacion oracle exadata & exalogic f. podesta -yatch club 19 de abril 2012
Presentacion oracle exadata & exalogic   f. podesta -yatch club 19 de abril 2012Presentacion oracle exadata & exalogic   f. podesta -yatch club 19 de abril 2012
Presentacion oracle exadata & exalogic f. podesta -yatch club 19 de abril 2012
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Engineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the FutureEngineered Systems: Oracle’s Vision for the Future
Engineered Systems: Oracle’s Vision for the Future
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
A Multi Colored YARN
A Multi Colored YARNA Multi Colored YARN
A Multi Colored YARN
 
Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012
 

Destaque

A longa viagem da biblioteca dos reis - Lilia Schwarcz
A longa viagem da biblioteca dos reis - Lilia SchwarczA longa viagem da biblioteca dos reis - Lilia Schwarcz
A longa viagem da biblioteca dos reis - Lilia SchwarczAurelio Junior
 
Sesión 01 - Comunicaciones Internas
Sesión 01 - Comunicaciones InternasSesión 01 - Comunicaciones Internas
Sesión 01 - Comunicaciones InternasLima Innova
 
El desayuno de los campeones
El desayuno de los campeonesEl desayuno de los campeones
El desayuno de los campeonesAbel GY
 
Audit Vault Database Firewall 12.2.0.1.0 installation
Audit Vault Database Firewall 12.2.0.1.0 installationAudit Vault Database Firewall 12.2.0.1.0 installation
Audit Vault Database Firewall 12.2.0.1.0 installationPinto Das
 
Intelligent Governance - Using Smart Gateways
Intelligent Governance - Using Smart GatewaysIntelligent Governance - Using Smart Gateways
Intelligent Governance - Using Smart GatewaysDavid Walton
 
Nimesh modernist paper
Nimesh modernist paperNimesh modernist paper
Nimesh modernist paperDave Nimesh B
 
The Escaping Classroom
The Escaping ClassroomThe Escaping Classroom
The Escaping ClassroomAndrew Smith
 
Using Periscope #SocMedHE15
Using Periscope #SocMedHE15Using Periscope #SocMedHE15
Using Periscope #SocMedHE15Andrew Smith
 
презентация1 группа № 2
презентация1 группа № 2презентация1 группа № 2
презентация1 группа № 2yuyukul
 
Making a Will
Making a WillMaking a Will
Making a Willa_sophi
 

Destaque (13)

concorso story
concorso storyconcorso story
concorso story
 
A longa viagem da biblioteca dos reis - Lilia Schwarcz
A longa viagem da biblioteca dos reis - Lilia SchwarczA longa viagem da biblioteca dos reis - Lilia Schwarcz
A longa viagem da biblioteca dos reis - Lilia Schwarcz
 
Sesión 01 - Comunicaciones Internas
Sesión 01 - Comunicaciones InternasSesión 01 - Comunicaciones Internas
Sesión 01 - Comunicaciones Internas
 
El desayuno de los campeones
El desayuno de los campeonesEl desayuno de los campeones
El desayuno de los campeones
 
Audit Vault Database Firewall 12.2.0.1.0 installation
Audit Vault Database Firewall 12.2.0.1.0 installationAudit Vault Database Firewall 12.2.0.1.0 installation
Audit Vault Database Firewall 12.2.0.1.0 installation
 
Intelligent Governance - Using Smart Gateways
Intelligent Governance - Using Smart GatewaysIntelligent Governance - Using Smart Gateways
Intelligent Governance - Using Smart Gateways
 
Pre fire planning
Pre fire planningPre fire planning
Pre fire planning
 
Nimesh modernist paper
Nimesh modernist paperNimesh modernist paper
Nimesh modernist paper
 
The Escaping Classroom
The Escaping ClassroomThe Escaping Classroom
The Escaping Classroom
 
Using Periscope #SocMedHE15
Using Periscope #SocMedHE15Using Periscope #SocMedHE15
Using Periscope #SocMedHE15
 
презентация1 группа № 2
презентация1 группа № 2презентация1 группа № 2
презентация1 группа № 2
 
Making a Will
Making a WillMaking a Will
Making a Will
 
Journalism
JournalismJournalism
Journalism
 

Semelhante a Innovations in Apache Hadoop MapReduce Pig Hive for Improving Query Performance

Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...Vinod Kumar Vavilapalli
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureJianfeng Zhang
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureRajesh Balamohan
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Toward low-latency Java applications - javaOne 2014
Toward low-latency Java applications - javaOne 2014Toward low-latency Java applications - javaOne 2014
Toward low-latency Java applications - javaOne 2014John Davies
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerIke Ellis
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitMilind Bhandarkar
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
YARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformYARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformTsuyoshi OZAWA
 
Endeca Performance Considerations
Endeca Performance ConsiderationsEndeca Performance Considerations
Endeca Performance ConsiderationsCirrus10
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwordsSzehon Ho
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleYifeng Jiang
 
Migration from Redshift to Spark
Migration from Redshift to SparkMigration from Redshift to Spark
Migration from Redshift to SparkSky Yin
 

Semelhante a Innovations in Apache Hadoop MapReduce Pig Hive for Improving Query Performance (20)

Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Toward low-latency Java applications - javaOne 2014
Toward low-latency Java applications - javaOne 2014Toward low-latency Java applications - javaOne 2014
Toward low-latency Java applications - javaOne 2014
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
2014 08-20-pit-hug
2014 08-20-pit-hug2014 08-20-pit-hug
2014 08-20-pit-hug
 
YARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformYARN: a resource manager for analytic platform
YARN: a resource manager for analytic platform
 
Endeca Performance Considerations
Endeca Performance ConsiderationsEndeca Performance Considerations
Endeca Performance Considerations
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 
Scaling tappsi
Scaling tappsiScaling tappsi
Scaling tappsi
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scale
 
Migration from Redshift to Spark
Migration from Redshift to SparkMigration from Redshift to Spark
Migration from Redshift to Spark
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Último (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Innovations in Apache Hadoop MapReduce Pig Hive for Improving Query Performance

  • 1. Innovations In Apache Hadoop MapReduce, Pig and Hive for improving query performance gopalv@apache.org vinodkv@apache.org Page 1
  • 3. Operation Stinger © Hortonworks Inc. 2013 Page 3
  • 4. Performance at any cost © Hortonworks Inc. 2013
  • 5. Performance at any cost • Scalability – Already works great, just don’t break it for performance gains • Isolation + Security – Queries between different users run as different users • Fault tolerance – Keep all of MR’s safety nets to work around bad nodes in clusters • UDFs – Make sure they are “User” defined and not “Admin” defined © Hortonworks Inc. 2013
  • 6. First things first • How far can we push Hive as it exists today? © Hortonworks Inc. 2013
  • 7. Benchmark spec • The TPC-DS benchmark data+query set • Query 27 (big joins small) – For all items sold in stores located in specified states during a given year, find the average quantity, average list price, average list sales price, average coupon amount for a given gender, marital status, education and customer demographic. • Query 82 (big joins big) – List all items and current prices sold through the store channel from certain manufacturers in a given price range and consistently had a quantity between 100 and 500 on hand in a 60-day period. © Hortonworks Inc. 2013
  • 8.
  • 9. TL;DR - II • TPC-DS Query 82, Scale=200, 10 EC2 nodes (40 disks) 3500 3257.692 3000 2862.669 2500 2000 Text RCFile 1500 Partitioned RCFile Partitioned RCFile + Optimizations 1000 500 255.641 71.114 0 Query 82 © Hortonworks Inc. 2013
  • 10. Forget the actual benchmark • First of all, YMMV – Software – Hardware – Setup – Tuning • Text formats seem to be the staple of all comparisons – Really? – Everybody’s using it but only for benchmarks! © Hortonworks Inc. 2013
  • 11. What did the trick? • Mapreduce? • HDFS? • Or is it just Hive? © Hortonworks Inc. 2013
  • 12. Optional Advice © Hortonworks Inc. 2013
  • 13. RCFile • Binary RCFiles • Hive pushes down column projections • Less I/O, Less CPU • Smaller files © Hortonworks Inc. 2013
  • 14. Data organization • No data system at scale is loaded once & left alone • Partitions are essential • Data flows into new partitions every day © Hortonworks Inc. 2013
  • 15. A closer look • Now revisiting the benchmark and its results © Hortonworks Inc. 2013
  • 16. Query27 - Before Stage-3 16 Stage-2 17 Stage-1 49 Stage-6 355 Stage-5 512 Stage-4 553 0 200 400 600 800 1000 1200 1400 1600 © Hortonworks Inc. 2013
  • 18. Query 27 - After Time Stage-9 33 Stage-10 5 0 5 10 15 20 25 30 35 40 © Hortonworks Inc. 2013
  • 20. Query 82 - Before Stage-3 17 Stage-2 17 Start Time Stage-1 2199 Stage-4 1025 0 500 1000 1500 2000 2500 3000 3500 © Hortonworks Inc. 2013
  • 21. Query 82 - After Stage-1 71 0 10 20 30 40 50 60 70 80 © Hortonworks Inc. 2013
  • 22. What changed? • Job Count/Correct plan • Correct data formats • Correct data organization • Correct configuration © Hortonworks Inc. 2013
  • 23.
  • 25. Is that all? • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Parallelism – Spin-up times – Data locality • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 26. In Hive • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Parallelism – Spin-up times – Data locality • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 27. In Hive • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Parallelism – Spin-up times – Data locality • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 28. Hive Metastore • 1+N Select problem – SELECT partitions FROM tables; – /* for each needed partition */ SELECT * FROM Partition .. – For query 27 , generates > 5000 queries! 4-5 seconds lost on each call! – Lazy loading or Include/Join are general solutions • Datanucleus/ORM issues – 100K NPEs try.. Catch.. Ignore.. • Metastore DB Schema revisit – Denormalize some/all of it? © Hortonworks Inc. 2013
  • 29. In Hive • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Parallelism – Spin-up times – Data locality • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 30. RCFile issues • RCFiles do not split well – Row groups and row group boundaries • Small row groups vs big row groups – Sync() vs min split – Storage packing • Run-length information is lost – Unnecessary deserialization costs © Hortonworks Inc. 2013
  • 31. ORC file format • A single file as output of each task. – Dramatically simplifies integration with Hive – Lowers pressure on the NameNode • Support for the Hive type model – Complex types (struct, list, map, union) – New types (datetime, decimal) – Encoding specific to the column type • Split files without scanning for markers • Bound the amount of memory required for reading or writing. © Hortonworks Inc. 2013
  • 32. In Hive • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Parallelism – Spin-up times – Data locality • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 33. CPU intensive code © Hortonworks Inc. 2013
  • 34. CPU intensive code • Hive query engine processes one row at a time – Very inefficient in terms of CPU usage • Lazy deserialization: layers • Object inspector calls • Lots of virtual method calls © Hortonworks Inc. 2013
  • 35. Tighten your loops © Hortonworks Inc. 2013
  • 36. Vectorization to the rescue • Process a row batch at a time instead of a single row • Row batch to consist of column vectors – The column vector will consist of array(s) of primitive types as far as possible • Each operator will process the whole column vector at a time • File formats to give out vectorized batches for processing • Underlying research promises – Better instruction pipelines and cache usage – Mechanical sympathy © Hortonworks Inc. 2013
  • 37. Vectorization: Prelim results • Functionality – Some arithmetic operators and filters using primitive type columns – Have a basic integration benchmark to prove that the whole setup works • Performance – Micro benchmark – More than 30x improvement in the CPU time – Disclaimer: – Micro benchmark! – Include io or deserialization costs or complex and string datatypes © Hortonworks Inc. 2013
  • 38. In YARN+MR • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Data locality – Parallelism – Spin-up times • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 39. In YARN+MR • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Data locality – Parallelism – Spin-up times • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 40. Data Locality • CombineInputFormat • AM interaction with locality • Short-circuit reads! • Delay scheduling – Good for throughput – Bad for latency © Hortonworks Inc. 2013
  • 41. In YARN+MR • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Data locality – Parallelism – Spin-up times • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 42. Parallelism • Can tune it (to some extent) – Controlling splits/reducer count • Hive doesn’t know dynamic cluster status – Benchmarks max out clusters, real jobs may or may not • Hive does not let you control parallelism – particularly in case of multiple jobs in a query © Hortonworks Inc. 2013
  • 43. In YARN+MR • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Data locality – Parallelism – Spin-up times • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 44. Spin up times • AM startup costs • Task startup costs • Multiple waves of map tasks © Hortonworks Inc. 2013
  • 45. Apache Tez • Generic DAG workflow • Container re-use • AM pool service © Hortonworks Inc. 2013
  • 46. AM Pool Service • Pre-launches a pool of AMs • Jobs submitted to these pre-launched AMs – Saves 3-5 seconds • Pre-launched AMs can pre-allocate containers • Tasks can be started as soon as the job is submitted – Saves 2-3 seconds © Hortonworks Inc. 2013
  • 47. Container reuse • Tez MapReduce AM supports Container reuse • Launched JVMs are re-used between tasks – about 4-5 seconds saved in case of multiple waves • Allows future enhancements – re-using task data structures across splits © Hortonworks Inc. 2013
  • 48. In HDFS • NO! • In Hive – Metastore – RCFile issues – CPU intensive code • In YARN+MR – Data locality – Parallelism – Spin-up times • In HDFS – Bad disks/deteriorating nodes © Hortonworks Inc. 2013
  • 49. Speculation/bad disks • No cluster remains at 100% forever • Bad disks cause latency issues – Speculation is one defense, but it is not enough – Fault tolerance is a safety net • Possible solutions: – More feedback from HDFS about stale nodes, bad/slow disks – Volume scheduling © Hortonworks Inc. 2013
  • 50. General guidelines • Benchmarking – Be wary of benchmarks! Including ours! – Algebra with X © Hortonworks Inc. 2013
  • 51. General guidelines contd. • Benchmarks: To repeat, YMMV. • Benchmark *your* use-case. • Decide your problem size – If (smallData) { Mysql/Postgres/Your smart phone } else { – Make it work – Make it scale – Make it faster } • If it is (seems to be) slow, file a bug, spend a little time! • Replacing systems without understanding them – Is an easy way to have an illusion of progress © Hortonworks Inc. 2013
  • 52. Related talks • “Optimizing Hive Queries” by Owen O’Malley • “What’s New and What’s Next in Apache Hive” by Gunther Hagleitner © Hortonworks Inc. 2013
  • 53. Credits • Arun C Murthy • Bikas Saha • Gopal Vijayaraghavan • Hitesh Shah • Siddharth Seth • Vinod Kumar Vavilapalli • Alan Gates • Ashutosh Chauhan • Vikram Dixit • Gunther Hagleitner • Owen O’Malley • Jintendranath Pandey • Yahoo!, Facebook, Twitter, SAP and Microsoft all contributing. © Hortonworks Inc. 2013
  • 54. Q&A • Thanks! © Hortonworks Inc. 2013

Notas do Editor

  1. Since the time we started this, we’ve seen multiple people benchmark hive comparing its text format processors against alternatives
  2. Notmapreduce, not hdfs, just plain hive
  3. Layers of inspectors that identify column type, de-serialize data and determine appropriate expression routines in the inner loop
  4. I wrote all of the code and Jitendra was just consulting :P