O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Query Performance Tuning – Apache Hive
ExecutionEngine
1. Use SparkSQL Thrift Server(SignificantlyfasterthanHive onTez).
2...
Próximos SlideShares
Carregando em…5
×
Próximos SlideShares
Real-Time Data Flows with Apache NiFi
Avançar
Transfira para ler offline e ver em ecrã inteiro.

5

Compartilhar

Baixar para ler offline

Cheat Sheet - Hive Performance Tuning

Baixar para ler offline

Some of my notes on performance tuning Apache Hive

Livros relacionados

Gratuito durante 30 dias do Scribd

Ver tudo

Cheat Sheet - Hive Performance Tuning

  1. 1. Query Performance Tuning – Apache Hive ExecutionEngine 1. Use SparkSQL Thrift Server(SignificantlyfasterthanHive onTez). 2. ExecutionEngine:MR< Tez< Spark 3. Use pre-warmedcontainers(hive will keepfewcontainersalways running). –Good for small jobs 4. Keepcontainersalive afterjobcompletion(at-leastforsome time). Storage Format 5. Use compressionforstoreddata. GZipand BZip2 compressionisdetected automaticallybyhive (noneedtospecifyanyspecialInputFormatof SerDe). 6. Enable compressionforintermediate processingaswell. 7. Use Parquet or ORC fileswithBloom filters. 8. Insteadof storinginraw text,use a standardizeddataformatlike Avroor Parquet. Indexes& Statistics 9. Alwayskeepthe statisticsrefreshed(usedbycostbasedoptimizer) – Significantperformance improvementforjoin&aggregate queries.Do thisfor Table,Partition&Columnlevel.Enable CBO(hive.cbo.enable) 10. Use Aggregate indexes(forcommonlyusedaggregationhierarchies) – undocumentedfeature.Basicallyit’sanaggregate table. 11. Use materializedtables/views. Miscellaneous 12. Use Hbase or some otherNoSQL to store lookupvalues/metadata. 13. Use Map Side join hint(distributesmall tables). /*+MAPJOIN(b) */ 14. Increase replicationfactor forselectedtables. 15. Use bucketing– can do mapside joinforlarge tablesif bothare partitioned&bucketedinthe same way. 16. In case of joins,specifythe largesttable inthe end. Partitioning(Savesdisk and avoid brute-force data traversal) 17. Use on fieldsinwhereclause 18. Wide Partitionare betterthandeeppartitions 19. Partitiononlowcardinalityfields. Data Modeling 20. De-normalizedatasets(keepDimensiontablesandFactJoinDimension tables). 21. Use curatedzonesforde-dupes,latestcopymaintenance 22. Nesteddatasetsperformbetterthanjoins. 23. For hierarchical data,use nestedsets.
  • DanielSobrado1

    Jun. 6, 2020
  • lalpal

    Feb. 16, 2019
  • KrishnakumarMenon

    Nov. 20, 2018
  • VijayM30

    May. 10, 2018
  • saanviaarna1

    Apr. 20, 2018

Some of my notes on performance tuning Apache Hive

Vistos

Vistos totais

875

No Slideshare

0

De incorporações

0

Número de incorporações

17

Ações

Baixados

5

Compartilhados

0

Comentários

0

Curtir

5

×