2. whoami
• Developer with SQL Server team since 2001
• Apache contributor
• Hive
• Hadoop core (security)
• stackoverflow user 105929s
• @rusanu
3. What is Hive
• Datawarehouse for querying and managing large datasets
• A query engine that use Hadoop MapReduce for execution
• A SQL abstraction for creating MapReduce algorithms
• SQL interface to HDFS data
• Developed at Facebook
VLDB 2009: Hive - A Warehousing Solution Over a Map-Reduce
Framework
• ASF top project since September 2010
4. How does Hive work
• SQL submitted via CLI or
Hiveserver(2)
• Metadata describing tables
stored in RDBMS
• Driver compiles/optimizes
execution plan
• Plan and execution engine
submitted to Hadoop as job
• MR invokes Hive execution
engine which executes plan
HiveHadoop
Metastore
RDBMS
HCatalog
HDFS
Driver
Compiles, Optimizes
MapReduce
Task
Task
Split
Split
CLI Hiveserver2
ODBC JDBCShell
Job
Tracker
Beeswax
5. Hive Query execution
• Compilation/Optimization results in an AST containing operators eg:
• FetchOperator: scans source data (the input split)
• SelectOperator: projects column values, computes
• GroupByOperator: aggregate functions (SUM, COUNT etc)
• JoinOperator:joins
• The plan forms a DAG of MR jobs
• The plan tree is serialized (Kryo)
• Hive Driver dispatches jobs
• Multiple stages can result in multiple jobs
• Task execution picks up the plan and start iterating the plan
• MR emits values (rows) into the topmost operator (Fetch)
• Rows propagate down the tree
• ReduceSinkOperator emits map output for shuffle
• Each operator implements both a map side and a reduce side algorithm
• Executes the one appropriate for the current task
• MR does the shuffle, many operators rely on it as part of their algorithm
• Eg. SortOperator, GroupByOperator
• Multi-stage queries create intermediate output and the driver submits new job to continue next stage
• TEZ execution: map-reduce-reduce, usually eliminates multiple stages (more later)
• Vectorized execution mode emits batches of rows (1024 rows)
6. Hive features
• Data types:
• Numeric: tinyint, smallint, int, bigint, float,
double, decimal(precision, scale)
• Date/Time: timestamp, date
• Character types: string, char(size), varchar(size)
• Misc. types: Boolean, binary
• Complex types: ARRAY<type>, MAP<type, type>,
STRUCT<name:type, name:type>, UNIONTYPE<type,
type, type>
• Storage formats: text, sequencefile, ORC, Parquet, RC,
arbitrary SerDe
• Data Load: INSERT, LOAD, external tables, dynamic
partitioning
• Bucketized tables
• JOIN optimizations: MapJoin, SMB join
• ACID DML (INSERT/UPDATE/DELETE)
• Only supported for ORC storage for now
• Columnar storage, vectorized execution
• Cost based optimizer (new)
• HiveQL: SQL dialect, drives toward ANSI-92 compliance
• Subqueries, joins, common table expressions
• Lateral views (CROSS APPLY)
• SELECT … FROM table
LATERAL VIEW explode(column)
• Windowing and analytical functions
• LEAD, LAG, FIRST_VALUE, LAST_VALUE
• RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK, NTILE
• OVER clause for aggregates
• PARTITION BY, ORDER BY
• WINDOW specification
• SELECT SUM(a) OVER (PARTITION BY b ORDER BY c
ROWS 3 PRECEDING AND 3 FOLLOWING)
• WINDOW clause
• SELECT SUM(b) OVER w FROM t WINDOW w AS (PARTITION
BY b ORDER BY c ROWS BETWEEN CURRENT ROW AND 2
FOLLOWING)
• GROUPING SETs, CUBE, ROLLUP
• XPath, custom UDF
• TRANSFORM: arbitrary map_script, reduce_script
7. Hive engines
• MapReduce
• Default, widely
available
• Complex queries
require stages ->
stop-and-go
• Always on disk
shuffle
• TEZ
• Generalized MRR
(DAG)
• Pipelining
• Memory shuffle
• JOINs (bipartite)
• Custom sort
• HIVE can optimize
plans for TEZ
• Recommended
engine
• SPARK
• HIVE-7292
• In development
• Not to be confused
with Shark or Spark-
SQL
8. Hive pros and cons
+ Capable of handling PB scale
+ Decent performance
+ Fairly advanced SQL features
+ Integrates with Hadoop ecosystem
+ Share the data (HDFS)
+ Leverage existing clusters
+ High Availability
+ Disaster Recoverability
+ Partitioning, Clustering, Bucketing
+ Positive momentum, active
development
+ No licensing costs
- Not good for ad-hoc due to high
latency (job submit time)
• Topic is actively pursued by Hive/Tez/Yarn
- ANSI-SQL gaps
- Poor toolset (hivecli, ODBC drivers)
- In-house RDBMS operating expertize
does not translate to Hive
- Outperformed by (costly) high-end
proprietary solutions
9. How to use Hive
+As a part of the ETL in transforming large data ingress (click-stream,
mobile uploads, access log etc) into query able form
+ Alternative to PIG for those that favor SQL
+Run one-off queries to analyze large unstructured data sets
+ Power of SQL to get insight into ‘collect everything’
+DW/BI
+ Can also be part of the ETL that loads the DW
+ Deploy on TEZ not on M/R
+ Use ORC or as storage Parquet format
+Use recent releases (Hive 0.14 or later)
10. When to avoid Hive
- Replace RDBMS/OLTP
- Ad-hoc BI (latency still too high, will improve soon)
- When the dataset is small
• 512 GB RAM is cheap
• If it fits in memory, is not Big data
- When data changes frequently
- If you have an infinite budget
11. Some alternatives to Hive
• Columnar storage in a traditional RDBMS
• MySQL: ICE, InfiniDB
• PostgreSQL: cstore_fwd (Citus)
• SQL Server columnstore
• Amazon Red-shift
• Azure SQL Database v12
• Impala
• Presto
• Spark SQL
• Note that Impala and Spark SQL can share Hive’s metastore
12. Links
• Hive Language Manual:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual
• Join strategies in Hive:
https://cwiki.apache.org/confluence/download/attachments/273620
54/Hive+Summit+2011-join.pdf
• Hive on Tez:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez
• Hive on Spark:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark