SlideShare uma empresa Scribd logo
1 de 33
Apache Hive for modern
DBAs
Luís Marques
About me
Oracle ACE
Data and Linux geek
Long time opensource
supporter
works for @redgluept as
Data Architect
@drune
Big Data Thinking Strategy
●Think small
●Think big
●Don’t think at all (hype is here)
What is Apache Hive?
●open source, TB/PB scale date warehousing
framework based on Hadoop
●The first and more complete SQL-on-”Hadoop”
●SQL:2003 and SQL:2011 compatible
●Data store on several formats
●Several execution engines available
●Interactive Query support (In-memory cache)
Apache Hive - Before you ask
●Datawarehouse/OLAP activities (data mining, data
exploration, batch processing, ETL, etc) - “The
heavy lifting of data”
●Low cost scaling, built as extensibility in mind
●Use large datasets (gigabytes/terabytes) scale
●Don’t use Hive for any OLTP activities
●ACID exists, not recommended yet
The reason behind Hive
I had written, as part of working with the Feed team - what became - a rather complicated MR
job to rank friends by mutual friends.
In doing so I had pretty much used every Hadoop trick in the bag (partitioners, separate
map and reduce sorting keys, comparators, in-memory hash tables and so on) and realized how
hard it was to write an optimal MR job (particularly on large data sets).
Assembling data into complex data structures was also painful.
I really wanted to see these types of operators exposed in a high level declarative form so
that the average user would never have to go through this. Fortunately - our team had
Oracle veterans well versed in the art of SQL.
Joydeep Sen Sarma (Facebook)
The reason behind Hive
Instead of complex MR jobs
You have declarative language...
Apache Hive versions & branches
master branch-1
Version 2.x
New code
New features
Version 1.x
Stable
Backwards
compatibility
Critical
bugs
Hadoop 1.x and 2.x
supported
Hadoop 2.x
supported
stable features
Data Model (data units & types)
●Supports primitive column types (integers,
numbers, strings, date time and booleans)
●Supports complex types: Structs, Maps and
Arrays
●Concept of databases, tables, partitions and
buckets
●SerDe: serialize and deserialized API is used to
move data in and out of tables
Data Model (partitions & bucketing)
● Partitioning: used for distributing load horizontally,performance benefit,
organization data
PARTITIONED BY (flightName STRING, AircraftName STRING)
/employees/flightName=ABC/AircraftName=XYZ
● Buckets (clusters): decomposing data sets into more manageable parts, help
on map-side joins, and correct sampling on the same bucket
“Records with the same flightID will always be stored in the same bucket.
Assuming the number of flightID is much greater than the number of buckets, each
bucket will have many flightIDs”
CLUSTERED BY (flightID) INTO XX BUCKETS;
Data Model (complex data types)
Array Ordered collection of
fields. Fields of the
same type
array(1,2)
Map Unordered key value
pairs. Keys are
primitives, values are
any type
Map (‘a’, 1, ‘b’, 2)
Struct A collection of named
fields
Struct(‘a’,10, 2.5)
Data model
HiveQL
●HiveQL is an SQL-like query language for Hive
●Supports DDL and DML
●Supports multi-table inserts
●Possible to write custom map-reduce scripts
●Supports UDF, UDAF UDTF
DDL (some examples)
HIVE> CREATE DATABASE/SCHEMA, TABLE, VIEW, INDEX
HIVE> DROP DATABASE/SCHEMA, TABLE, VIEW, INDEX
HIVE> TRUNCATE TABLE
HIVE> ALTER DATABASE/SCHEMA, TABLE, VIEW
HIVE> SHOW DATABASES/SCHEMAS, TABLES, TBLPROPERTIES, VIEWS,
PARTITIONS, FUNCTIONS
HIVE> DESCRIBE DATABASE/SCHEMA, table_name, view_name
File formats
● Parquet: compressed, efficient columnar data
representation available to any project in the Hadoop
● ORC: made for Hive, support Hive type model,columnar
storage, block compression, predicate pushdown, ACID*,
etc
● Avro: JSON for defining data types and protocols, and
serializes data in a compact binary format
● Compressed file formats (LZO, *.GZIP)
● Plain Text Files
● Any other type to data subject to a format is possible to be
read (csv, json, xml, etc)
ORC
●Stored as columns and compressed = smaller disk
reads
●ORC has a built-in index, min/max values, and
other aggregates (eg: sum,max) = skip entire
blocks to speed up reads
●ORC implements predicate pushdown and bloom
filters
●ORC scale
●You should use it :-)
Indexing
● Not recommended because of ORC;
● ORC has build in Indexes which allow the format to skip
blocks of data during read
● Hive indexes are implemented as tables
● Compact indexes and bitmap indexes supported
● Tables that provide information about which data is in
which blocks and are used to skip data (like ORC already
does)
● Not supported on Tez engine - ignored
● Indexes in Hive are not like indexes in other databases.
File formats & Indexing
Hive Architecture
Hive Web
Interface
Hive CLI (beeline, hive)
Hive JDBC/ODBC
Driver
Compiler (Parser, Semantic Analyser,
Logical Plan Generator, Query plan
Generator)
Executor
Optimizer
Metastore
client
Trift Server (HiveServer2)
Metastore RDBMS
Execution
Engines
Map Reduce Tez Spark
Resource Management YARN
Storage HDFS HBase
Azure Storage
Amazon S3
Metastore
● Typically stored in a RDBMS (MySQL; SQLServer;
PostgreSQL, Derby*) - ACID and concurrency on metadata
querys
● Contains: metadata for databases, tables or partitions
● Provides two features: data discovery and data abstraction
● Data abstraction: provide information about data formats,
extractors and loaders in table creation and reused, (ex:
dictionary tables - Oracle)
● Data discovery: discover relevant and specific data, allow
other tools to use metadata to explore data (Ex: SparkSQL)
See it in action
Execution engines
● 3 execution engines are available:
○ MapReduce (mr)
○ Tez
○ Spark
MR: The original, most stable and more reliable, batch oriented, disk-
based parallel (like traditional Hadoop MR jobs).
Tez: High performance batch and interactive data processing. Stable in
99% of the time. The one that you should use. Default on HDP.
Spark: Uses Apache Spark (in-memory computing platform), High-
performance (like Tez), not used in production (yet), good progress
MapReduce vs Tez/Spark
MapReduce:
● One pair of map and reduce does one level of aggregation over the
data. Complex computations typically require multiple such steps.
Tez/Spark:
● DAG (Directed Acyclic Graph)
● The graph does not have cycles because the fault tolerance
mechanism used by Tez is re-execution of failed tasks
● The limitations of MapReduce in Hadoop became a key point to
introduce DAG
● Pipelining consecutive map steps into one
● Enforce concurrency and serialization between MapReduce jobs
Tez & DAGs
DAG Definition:
● Data processing is expressed in the form of a directed acyclic graph
(DAG)
Two main components:
● vertices - in the graph representing processing of data
○ user logic, that analyses and modifies the data, sits in the vertices
● edges - representing movement of data between the processing
○ Defines routing of data between tasks (One-To-One, Broadcast
Scatter-Gather)
○ Defines when a consumer task is scheduled (Sequential,
Concurrent)
○ Defines the lifetime/reliability of a task output
Hive Cost Based Optimizer - Why
● Distributed SQL query processing in Hadoop differs from conventional
relational query engine when it comes to handling of intermediate
result sets
● Query processing requires sorting and reassembling of intermediate
result set - shuffling
● Most of the existing optimizations in Hive are about minimizing
shuffling cost and logical optimizations like filter push down,
projection pruning and partition pruning
● Join reordering and join algorithm possible with cost based optimizer.
Hive CBO - What to get
● Based on a project called Apache Calcite (https://calcite.apache.org/)
● You can get using a Cost Based Optimizer:
○ How to order Join (join reordering)
○ Algorithm to use for a Join
○ Intermediate result be persisted or should it be recomputed on
failure
○ degree of parallelism at any operator (number of mappers and
reducers
○ Semi Join selection
○ (others optimizer tricks like histograms)
Execution Engines
Hive - The present-future
● Tez and Spark head to head on performance and stability
● LLAP (Long Live and Process) - Hive interactive querys
● ACID
Hive next big thing: LLAP
● Sub second querys (Interactive Querys)
● In-memory caching layer with async I/O
● Fast concurrent execution
● Move from disk oriented to memory oriented execution (trend)
● Disks are connect to CPU via network - data locality is not relevant
Thank you
Questions?
@drune
https://www.linkedin.com/in/lc
marques/
luis.marques@redglue.eu
@redgluept
www.redglue.eu
Apache Hive for modern DBAs

Mais conteúdo relacionado

Mais procurados

Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
mundlapudi
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
prabakaranbrick
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
Edureka!
 
Whitepaper: Exadata Consolidation Success Story
Whitepaper: Exadata Consolidation Success StoryWhitepaper: Exadata Consolidation Success Story
Whitepaper: Exadata Consolidation Success Story
Kristofferson A
 
(Aaron myers) hdfs impala
(Aaron myers)   hdfs impala(Aaron myers)   hdfs impala
(Aaron myers) hdfs impala
NAVER D2
 

Mais procurados (20)

Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
 
Hadoop admin
Hadoop adminHadoop admin
Hadoop admin
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload management
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
Introduction to hadoop high availability
Introduction to hadoop high availability Introduction to hadoop high availability
Introduction to hadoop high availability
 
Get to know PostgreSQL!
Get to know PostgreSQL!Get to know PostgreSQL!
Get to know PostgreSQL!
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 
Whitepaper: Exadata Consolidation Success Story
Whitepaper: Exadata Consolidation Success StoryWhitepaper: Exadata Consolidation Success Story
Whitepaper: Exadata Consolidation Success Story
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling Impala
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
(Aaron myers) hdfs impala
(Aaron myers)   hdfs impala(Aaron myers)   hdfs impala
(Aaron myers) hdfs impala
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 

Semelhante a Apache Hive for modern DBAs

P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
MaharajothiP
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
NetajiGandi1
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 

Semelhante a Apache Hive for modern DBAs (20)

Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
HBase introduction talk
HBase introduction talkHBase introduction talk
HBase introduction talk
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 

Último

Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
David Celestin
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
Kayode Fayemi
 
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
ZurliaSoop
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
amilabibi1
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Hung Le
 

Último (17)

Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
 
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait Cityin kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Introduction to Artificial intelligence.
Introduction to Artificial intelligence.Introduction to Artificial intelligence.
Introduction to Artificial intelligence.
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
Zone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptxZone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptx
 
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
 

Apache Hive for modern DBAs

  • 1.
  • 2.
  • 3. Apache Hive for modern DBAs Luís Marques
  • 4. About me Oracle ACE Data and Linux geek Long time opensource supporter works for @redgluept as Data Architect @drune
  • 5. Big Data Thinking Strategy ●Think small ●Think big ●Don’t think at all (hype is here)
  • 6. What is Apache Hive? ●open source, TB/PB scale date warehousing framework based on Hadoop ●The first and more complete SQL-on-”Hadoop” ●SQL:2003 and SQL:2011 compatible ●Data store on several formats ●Several execution engines available ●Interactive Query support (In-memory cache)
  • 7. Apache Hive - Before you ask ●Datawarehouse/OLAP activities (data mining, data exploration, batch processing, ETL, etc) - “The heavy lifting of data” ●Low cost scaling, built as extensibility in mind ●Use large datasets (gigabytes/terabytes) scale ●Don’t use Hive for any OLTP activities ●ACID exists, not recommended yet
  • 8. The reason behind Hive I had written, as part of working with the Feed team - what became - a rather complicated MR job to rank friends by mutual friends. In doing so I had pretty much used every Hadoop trick in the bag (partitioners, separate map and reduce sorting keys, comparators, in-memory hash tables and so on) and realized how hard it was to write an optimal MR job (particularly on large data sets). Assembling data into complex data structures was also painful. I really wanted to see these types of operators exposed in a high level declarative form so that the average user would never have to go through this. Fortunately - our team had Oracle veterans well versed in the art of SQL. Joydeep Sen Sarma (Facebook)
  • 9. The reason behind Hive Instead of complex MR jobs You have declarative language...
  • 10. Apache Hive versions & branches master branch-1 Version 2.x New code New features Version 1.x Stable Backwards compatibility Critical bugs Hadoop 1.x and 2.x supported Hadoop 2.x supported stable features
  • 11. Data Model (data units & types) ●Supports primitive column types (integers, numbers, strings, date time and booleans) ●Supports complex types: Structs, Maps and Arrays ●Concept of databases, tables, partitions and buckets ●SerDe: serialize and deserialized API is used to move data in and out of tables
  • 12. Data Model (partitions & bucketing) ● Partitioning: used for distributing load horizontally,performance benefit, organization data PARTITIONED BY (flightName STRING, AircraftName STRING) /employees/flightName=ABC/AircraftName=XYZ ● Buckets (clusters): decomposing data sets into more manageable parts, help on map-side joins, and correct sampling on the same bucket “Records with the same flightID will always be stored in the same bucket. Assuming the number of flightID is much greater than the number of buckets, each bucket will have many flightIDs” CLUSTERED BY (flightID) INTO XX BUCKETS;
  • 13. Data Model (complex data types) Array Ordered collection of fields. Fields of the same type array(1,2) Map Unordered key value pairs. Keys are primitives, values are any type Map (‘a’, 1, ‘b’, 2) Struct A collection of named fields Struct(‘a’,10, 2.5)
  • 15. HiveQL ●HiveQL is an SQL-like query language for Hive ●Supports DDL and DML ●Supports multi-table inserts ●Possible to write custom map-reduce scripts ●Supports UDF, UDAF UDTF
  • 16. DDL (some examples) HIVE> CREATE DATABASE/SCHEMA, TABLE, VIEW, INDEX HIVE> DROP DATABASE/SCHEMA, TABLE, VIEW, INDEX HIVE> TRUNCATE TABLE HIVE> ALTER DATABASE/SCHEMA, TABLE, VIEW HIVE> SHOW DATABASES/SCHEMAS, TABLES, TBLPROPERTIES, VIEWS, PARTITIONS, FUNCTIONS HIVE> DESCRIBE DATABASE/SCHEMA, table_name, view_name
  • 17. File formats ● Parquet: compressed, efficient columnar data representation available to any project in the Hadoop ● ORC: made for Hive, support Hive type model,columnar storage, block compression, predicate pushdown, ACID*, etc ● Avro: JSON for defining data types and protocols, and serializes data in a compact binary format ● Compressed file formats (LZO, *.GZIP) ● Plain Text Files ● Any other type to data subject to a format is possible to be read (csv, json, xml, etc)
  • 18. ORC ●Stored as columns and compressed = smaller disk reads ●ORC has a built-in index, min/max values, and other aggregates (eg: sum,max) = skip entire blocks to speed up reads ●ORC implements predicate pushdown and bloom filters ●ORC scale ●You should use it :-)
  • 19. Indexing ● Not recommended because of ORC; ● ORC has build in Indexes which allow the format to skip blocks of data during read ● Hive indexes are implemented as tables ● Compact indexes and bitmap indexes supported ● Tables that provide information about which data is in which blocks and are used to skip data (like ORC already does) ● Not supported on Tez engine - ignored ● Indexes in Hive are not like indexes in other databases.
  • 20. File formats & Indexing
  • 21. Hive Architecture Hive Web Interface Hive CLI (beeline, hive) Hive JDBC/ODBC Driver Compiler (Parser, Semantic Analyser, Logical Plan Generator, Query plan Generator) Executor Optimizer Metastore client Trift Server (HiveServer2) Metastore RDBMS Execution Engines Map Reduce Tez Spark Resource Management YARN Storage HDFS HBase Azure Storage Amazon S3
  • 22. Metastore ● Typically stored in a RDBMS (MySQL; SQLServer; PostgreSQL, Derby*) - ACID and concurrency on metadata querys ● Contains: metadata for databases, tables or partitions ● Provides two features: data discovery and data abstraction ● Data abstraction: provide information about data formats, extractors and loaders in table creation and reused, (ex: dictionary tables - Oracle) ● Data discovery: discover relevant and specific data, allow other tools to use metadata to explore data (Ex: SparkSQL)
  • 23. See it in action
  • 24. Execution engines ● 3 execution engines are available: ○ MapReduce (mr) ○ Tez ○ Spark MR: The original, most stable and more reliable, batch oriented, disk- based parallel (like traditional Hadoop MR jobs). Tez: High performance batch and interactive data processing. Stable in 99% of the time. The one that you should use. Default on HDP. Spark: Uses Apache Spark (in-memory computing platform), High- performance (like Tez), not used in production (yet), good progress
  • 25. MapReduce vs Tez/Spark MapReduce: ● One pair of map and reduce does one level of aggregation over the data. Complex computations typically require multiple such steps. Tez/Spark: ● DAG (Directed Acyclic Graph) ● The graph does not have cycles because the fault tolerance mechanism used by Tez is re-execution of failed tasks ● The limitations of MapReduce in Hadoop became a key point to introduce DAG ● Pipelining consecutive map steps into one ● Enforce concurrency and serialization between MapReduce jobs
  • 26. Tez & DAGs DAG Definition: ● Data processing is expressed in the form of a directed acyclic graph (DAG) Two main components: ● vertices - in the graph representing processing of data ○ user logic, that analyses and modifies the data, sits in the vertices ● edges - representing movement of data between the processing ○ Defines routing of data between tasks (One-To-One, Broadcast Scatter-Gather) ○ Defines when a consumer task is scheduled (Sequential, Concurrent) ○ Defines the lifetime/reliability of a task output
  • 27. Hive Cost Based Optimizer - Why ● Distributed SQL query processing in Hadoop differs from conventional relational query engine when it comes to handling of intermediate result sets ● Query processing requires sorting and reassembling of intermediate result set - shuffling ● Most of the existing optimizations in Hive are about minimizing shuffling cost and logical optimizations like filter push down, projection pruning and partition pruning ● Join reordering and join algorithm possible with cost based optimizer.
  • 28. Hive CBO - What to get ● Based on a project called Apache Calcite (https://calcite.apache.org/) ● You can get using a Cost Based Optimizer: ○ How to order Join (join reordering) ○ Algorithm to use for a Join ○ Intermediate result be persisted or should it be recomputed on failure ○ degree of parallelism at any operator (number of mappers and reducers ○ Semi Join selection ○ (others optimizer tricks like histograms)
  • 30. Hive - The present-future ● Tez and Spark head to head on performance and stability ● LLAP (Long Live and Process) - Hive interactive querys ● ACID
  • 31. Hive next big thing: LLAP ● Sub second querys (Interactive Querys) ● In-memory caching layer with async I/O ● Fast concurrent execution ● Move from disk oriented to memory oriented execution (trend) ● Disks are connect to CPU via network - data locality is not relevant

Notas do Editor

  1. SQL:2011 - Seventh revision of the ISO (1987) and ANSI (1986) standard for the SQL database query language
  2. 2007 - 15TB 2009 - 2PB
  3. SQL:2011 - Seventh revision of the ISO (1987) and ANSI (1986) standard for the SQL database query language
  4. https://cwiki.apache.org/confluence/display/Hive/HowToContribute#HowToContribute-UnderstandingHiveBranches Release and feature branches not added to slide as we might be complex
  5. https://cwiki.apache.org/confluence/display/Hive/Tutorial
  6. https://cwiki.apache.org/confluence/display/Hive/Tutorial
  7. https://cwiki.apache.org/confluence/display/Hive/Tutorial
  8. Desc tables, datatypes, Create database Tables (store textfiles, external, orc, partitioned orc): describe, describe extended Use complex types example: Describe array table select flightName, AircraftColors[1] from flights.flightperfarray; -- Arrays Partitions: Layout: hdfs dfs -ls /apps/hive/warehouse/flights.db/flightperfpartorc
  9. https://cwiki.apache.org/confluence/display/Hive/Tutorial
  10. Predicate Pushdown: Running operations that filter or cutdown data as close to the beginning of your map reduce pipeline as possible Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.
  11. https://cwiki.apache.org/confluence/display/Hive/IndexDev
  12. Show create tables different formats (ORC and PLAINTEXT) Create an index on a table: Not supported in TEZ Set hive.execution.engine=mr create index idxFlightNum on table flightperfall(flightnum) AS 'COMPACT' WITH DEFERRED REBUILD; alter index idxFlightNum ON flightperfall rebuild; show formatted index on flightperfall; explain select * from flightperfall where flightnum=613 limit 1; set hive.optimize.index.filter.compact.minsize=10; explain select * from flightperfall where flightnum=613 limit 1; Set hive.optimize.index.filter.compact.minsize=5368709120 Execution times; Show operator tree with index and without index ORC vs CSV query time: select * from flightperfall_orc where flightnum=613 limit 1;
  13. Describe - components HiveCLI - management tools Ambari - Apache Ambari is a tool for provisioning, managing, and monitoring Apache Hadoop clusters. HiveServer2 - HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results. It is based on Apache Thrift RPC. Itis an improved version of HiveServer and supports multi-client concurrency and authentication and better support for open API clients like JDBC and ODBC. Driver - Driver manages the life cycle of a HiveQL statement during compilation, optimization and execution Compiler - The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore: Parser – Transform a query string to a parse tree representation Semantic Analyser - Transform the parse tree to an internal query representation (column names are verified and expansions like * are performed), Type-checking and any implicit type conversions and partition checking. Logical Plan Generator - Convert the internal query representation to a logical plan, which consists of a tree of operators. This step also includes the optimizer to transform the plan to improve performance; Query Plan Generator – Convert the logical plan to a series of map-reduce tasks (or DAGs stages) Optimizer - As of 2011, it was rule-based and performed the following: column pruning and predicate pushdown. Now it is cost based like RDBMS. Executor engine (Processing) - The component which executes the execution plan created by the compiler. The plan is a DAG of stages. The execution engine manages the dependencies between these different stages of the plan and executes these stages Metastore - The component that stores all the structure information of the various tables and partitions in the warehouse including column and column type information, the serializers and deserializers necessary to read and write data and the corresponding HDFS files where the data is stored. https://cwiki.apache.org/confluence/display/Hive/Design
  14. ssh root@127.0.0.1 -p 2222 (sandbox) Test CLI (beeline and hive cmd) Beeline: !connect jdbc:hive2://localhost:10000 Show ambari - Identify metastore hive (mysql database) - mysql -u root -p ; password: hadoop ; show databases; use hive; select * from DBS; select * from TBLS; Identify execution engines: SET hive.execution.engine Identify CBO active: set hive.cbo.enable; set hive.compute.query.using.stats; set hive.stats.fetch.column.stats; set hive.stats.fetch.partition.stats; explain select * from sample_07, sample_08 where sample_07.code = sample_08.code and sample_07.salary > 1000; Conditions for CBO: example: statistics of table, colums or other (too few joins). Show a database, a table and a file stored in HDFS Hdfs
  15. Tez – Hindi for “speed” Example: jobs A and B are independent of each other, but job C needs the results from A and B to complete, Tez will execute A and B in any order and forward the results to C
  16. One-To-One: Data from the ith producer task routes to the ith consumer task. Broadcast: Data from a producer task routes to all consumer tasks. Scatter-Gather: Producer tasks scatter data into shards and consumer tasks gather the shards Sequential: Consumer task may be scheduled after a producer task completes. Concurrent: Consumer task must be co-scheduled with a producer task.
  17. In Hive most of the optimizations are not based on the cost of query execution. Most of the optimizations do not rearrange the operator tree except for filter push down and operator merging.
  18. In Hive most of the optimizations are not based on the cost of query execution. Most of the optimizations do not rearrange the operator tree except for filter push down and operator merging. http://web.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-14-2.pdf
  19. Query: SELECT year, month, origin, dest, distance FROM flights.flightperfall_orc where flightnum in (select max(flightnum) from flights.flightperfpartorc where year=2008) MR: (41.1 seconds) Tez: (3.761 seconds) Show Tez View (via ambari) analyze table customer COMPUTE STATISTICS; analyze table customer COMPUTE STATISTICS for columns; use foodmart; explain select * from sales_fact_dec_1998 sf, customer c, product p, store ss where sf.customer_id = c.customer_id and p.product_id = sf.product_id and ss.store_id = sf.store_id and sf.customer_id > 100 and ss.store_id = 5