Hive big-data meetup

HIVE
Bucharest Big Data Meetup
March 10, 2015

whoami
• Developer with SQL Server team since 2001
• Apache contributor
• Hive
• Hadoop core (security)
• stackoverflow user 105929s
• @rusanu

What is Hive
• Datawarehouse for querying and managing large datasets
• A query engine that use Hadoop MapReduce for execution
• A SQL abstraction for creating MapReduce algorithms
• SQL interface to HDFS data
• Developed at Facebook
VLDB 2009: Hive - A Warehousing Solution Over a Map-Reduce
Framework
• ASF top project since September 2010

How does Hive work
• SQL submitted via CLI or
Hiveserver(2)
• Metadata describing tables
stored in RDBMS
• Driver compiles/optimizes
execution plan
• Plan and execution engine
submitted to Hadoop as job
• MR invokes Hive execution
engine which executes plan
HiveHadoop
Metastore
RDBMS
HCatalog
HDFS
Driver
Compiles, Optimizes
MapReduce
Task
Task
Split
Split
CLI Hiveserver2
ODBC JDBCShell
Job
Tracker
Beeswax

Hive Query execution
• Compilation/Optimization results in an AST containing operators eg:
• FetchOperator: scans source data (the input split)
• SelectOperator: projects column values, computes
• GroupByOperator: aggregate functions (SUM, COUNT etc)
• JoinOperator:joins
• The plan forms a DAG of MR jobs
• The plan tree is serialized (Kryo)
• Hive Driver dispatches jobs
• Multiple stages can result in multiple jobs
• Task execution picks up the plan and start iterating the plan
• MR emits values (rows) into the topmost operator (Fetch)
• Rows propagate down the tree
• ReduceSinkOperator emits map output for shuffle
• Each operator implements both a map side and a reduce side algorithm
• Executes the one appropriate for the current task
• MR does the shuffle, many operators rely on it as part of their algorithm
• Eg. SortOperator, GroupByOperator
• Multi-stage queries create intermediate output and the driver submits new job to continue next stage
• TEZ execution: map-reduce-reduce, usually eliminates multiple stages (more later)
• Vectorized execution mode emits batches of rows (1024 rows)

Hive features
• Data types:
• Numeric: tinyint, smallint, int, bigint, float,
double, decimal(precision, scale)
• Date/Time: timestamp, date
• Character types: string, char(size), varchar(size)
• Misc. types: Boolean, binary
• Complex types: ARRAY<type>, MAP<type, type>,
STRUCT<name:type, name:type>, UNIONTYPE<type,
type, type>
• Storage formats: text, sequencefile, ORC, Parquet, RC,
arbitrary SerDe
• Data Load: INSERT, LOAD, external tables, dynamic
partitioning
• Bucketized tables
• JOIN optimizations: MapJoin, SMB join
• ACID DML (INSERT/UPDATE/DELETE)
• Only supported for ORC storage for now
• Columnar storage, vectorized execution
• Cost based optimizer (new)
• HiveQL: SQL dialect, drives toward ANSI-92 compliance
• Subqueries, joins, common table expressions
• Lateral views (CROSS APPLY)
• SELECT … FROM table
LATERAL VIEW explode(column)
• Windowing and analytical functions
• LEAD, LAG, FIRST_VALUE, LAST_VALUE
• RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK, NTILE
• OVER clause for aggregates
• PARTITION BY, ORDER BY
• WINDOW specification
• SELECT SUM(a) OVER (PARTITION BY b ORDER BY c
ROWS 3 PRECEDING AND 3 FOLLOWING)
• WINDOW clause
• SELECT SUM(b) OVER w FROM t WINDOW w AS (PARTITION
BY b ORDER BY c ROWS BETWEEN CURRENT ROW AND 2
FOLLOWING)
• GROUPING SETs, CUBE, ROLLUP
• XPath, custom UDF
• TRANSFORM: arbitrary map_script, reduce_script

Hive engines
• MapReduce
• Default, widely
available
• Complex queries
require stages ->
stop-and-go
• Always on disk
shuffle
• TEZ
• Generalized MRR
(DAG)
• Pipelining
• Memory shuffle
• JOINs (bipartite)
• Custom sort
• HIVE can optimize
plans for TEZ
• Recommended
engine
• SPARK
• HIVE-7292
• In development
• Not to be confused
with Shark or Spark-
SQL

Hive pros and cons
+ Capable of handling PB scale
+ Decent performance
+ Fairly advanced SQL features
+ Integrates with Hadoop ecosystem
+ Share the data (HDFS)
+ Leverage existing clusters
+ High Availability
+ Disaster Recoverability
+ Partitioning, Clustering, Bucketing
+ Positive momentum, active
development
+ No licensing costs
- Not good for ad-hoc due to high
latency (job submit time)
• Topic is actively pursued by Hive/Tez/Yarn
- ANSI-SQL gaps
- Poor toolset (hivecli, ODBC drivers)
- In-house RDBMS operating expertize
does not translate to Hive
- Outperformed by (costly) high-end
proprietary solutions

How to use Hive
+As a part of the ETL in transforming large data ingress (click-stream,
mobile uploads, access log etc) into query able form
+ Alternative to PIG for those that favor SQL
+Run one-off queries to analyze large unstructured data sets
+ Power of SQL to get insight into ‘collect everything’
+DW/BI
+ Can also be part of the ETL that loads the DW
+ Deploy on TEZ not on M/R
+ Use ORC or as storage Parquet format
+Use recent releases (Hive 0.14 or later)

When to avoid Hive
- Replace RDBMS/OLTP
- Ad-hoc BI (latency still too high, will improve soon)
- When the dataset is small
• 512 GB RAM is cheap
• If it fits in memory, is not Big data
- When data changes frequently
- If you have an infinite budget

Some alternatives to Hive
• Columnar storage in a traditional RDBMS
• MySQL: ICE, InfiniDB
• PostgreSQL: cstore_fwd (Citus)
• SQL Server columnstore
• Amazon Red-shift
• Azure SQL Database v12
• Impala
• Presto
• Spark SQL
• Note that Impala and Spark SQL can share Hive’s metastore

Links
• Hive Language Manual:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual
• Join strategies in Hive:
https://cwiki.apache.org/confluence/download/attachments/273620
54/Hive+Summit+2011-join.pdf
• Hive on Tez:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez
• Hive on Spark:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark

Hive big-data meetup

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Hive big-data meetup

Semelhante a Hive big-data meetup (20)

Último

Último (20)

Hive big-data meetup