This presentation is one of my talks at "Global Big Data Conference" held in end of January'14. This presentation is mainly targeted the audience to let them understand overview of Hive and getting hands-on-experience on Hive Query Language. The overview part focuses on What is the need for Hive? Hive Architecture, Hive Components, Hive Query Language, and many others.
1. Hive
BALA KRISHNA G
Global Big Data Bootcamp – Jan 2014
(http://globalbigdataconference.com)
Global Big Data Conference - 2014
2. My introduction
Senior Software and Research Engineer
Big data trainer
Experience on Hadoop and Strom for more than 1.5 years
Worked at various big companies SUN/ORACLE, IBM, etc.,
www.linkedin.com/in/gbalakrishna/
bala.gsbk@outlook.com
Speaker : Bala
Global Big Data Conference - 2014
2
3. Agenda
Class structure
– 1 hour lecture and 1 ½ hour lab
Lecture
–
–
–
–
–
–
–
Need for Hive
Hive history
Hive powered by
What is Hive?
Hive Architecture
Hive Query Life cycle
Hive Query Language (HiveQL)
Lab:
– Extensive hands-on-experience on Hive
– Derive various insights from a real-world dataset by Hive
Speaker : Bala
Global Big Data Conference - 2014
3
4. Need for Hive
Do I need to
learn JAVA?
Speaker : Bala
Global Big Data Conference - 2014
Don’t worry!
I am here to
rescue you
4
5. Need for Hive contd.,
In general, one MR job is not suffice to derive BI (Business
Intelligence)
Oftentimes, require a series of complex MR jobs chained
together (Advanced data processing)
MR 4
MR 1
MR 6
MR 2
MR 3
MR 5
Speaker : Bala
Global Big Data Conference - 2014
legends
MR – Map Reduce
Mapper Task
Reducer Task
5
6. Need for Hive contd.,
20 lines of code in Hive can result into ~200 lines of Java code
Lowers the development time significantly (~16 times)
300
300
code
250
200
200
Minutes
250
time
150
100
150
100
50
50
0
0
Hadoop
Speaker : Bala
Pig
Hadoop
Global Big Data Conference - 2014
Pig
6
7. Need for Hive contd.,
Just focuses on “WHAT” part of your data analysis
“HOW” part is rest assured by framework
HOW
Speaker : Bala
Global Big Data Conference - 2014
7
8. Hive powered by
Uses for processing large amount of user and
central to meet company reporting need’s
Data analytics and Data cleaning
Ad hoc queries reporting and analytics
And many more…
https://cwiki.apache.org/confluence/display/Hive/PoweredBy
Speaker : Bala
Global Big Data Conference - 2014
8
9. What is Hive?
Data warehouse built on top of Hadoop
Provides an SQL like interface to analyze data
An open source project under apache
Works on high throughput and high latency
principle (same as Hadoop)
Ability to plug-in custom Map Reduce programs
Mainly targeted for structured data
Hides Map Reduce program complexities to end
user
Speaker : Bala
Global Big Data Conference - 2014
9
11. Metastore
Stores metadata of tables like database location, owner,
creation time, access attributes, table schema, etc.,
Comprises of two components 1) Service 2) Data storage
Hive Service
Embedded
Metastore
Driver
Metastore
Service
Local
Metastore
Driver
Metastore
Service
Remote
Metastore
Driver
Speaker : Bala
Derby
MySQL
Metastore
Server
Global Big Data Conference - 2014
MySQL
11
12. Hive Query Life cycle Insight
Speaker : Bala
Global Big Data Conference - 2014
12
13. Hive Query Life cycle contd.,
1
Hive
Interface
14
11
10
Execution
Engine
13
Driver
12
Hadoop
Map
Reduce
9
Metastore
2
Compiler
3
Parser
Semantic
Analyzer
8
5
4
Speaker : Bala
Physical
plan
Optimizer
generator
6
6
Global Big Data Conference - 2014
Logical
plan
generator
7
7
Optimizer
13
14. Data Models
Database: Holds namespace for tables
Table: Container of actual data
sample
Id
Name
Age
Sex
State
In Hive warehouse
stored as a folder
/user/$USER/warehouse/sample
Speaker : Bala
Global Big Data Conference - 2014
14
15. Data Models contd.,
Partition: Horizontal slice of table by a partition key
Let say sample table is partitioned by state column
sample
Id
Name
Age
Sex
State
Partition 1
Partition 2
Stored as many subfolders under sample directory
/user/$USER/warehouse/State=AL/
/user/$USER/warehouse/State=NC/
/user/$USER/warehouse/State=GA/
/user/$USER/warehouse/State=ND/
Speaker : Bala
Global Big Data Conference - 2014
15
16. Data Models contd.,
Bucket: Divides into further chunks by an other column for
sampling
Let say sample table is partitioned by ‘State’ column and
clustered by ‘Age’ column of 2 buckets
In warehouse, the data is stored as
/user/$USER/warehouse/State=AL/part-00000
/user/$USER/warehouse/State=AL/part-00001
/user/$USER/warehouse/State=GA/part-00000
/user/$USER/warehouse/State=GA/part-00001
.
.
/user/$USER/warehouse/State=ND/part-00000
/user/$USER/warehouse/State=ND/part-00001
Speaker : Bala
Global Big Data Conference - 2014
16
17. Data Loading Techniques
Managed Table: Tables managed by Hive Ware House
– Copy file from local file system to Hive Ware House
1)
Local FS
copy
HDFS
File
Hive
Warehouse
– Copy file from HDFS to Hive Ware House
2)
HDFS
File
Speaker : Bala
copy
Hive
Warehouse
Global Big Data Conference - 2014
17
18. Data Loading Techniques contd.,
External Table: Tables are just referenced by Hive Ware House
– Directly managing file in HDFS with out copying it into Hive Ware House
3)
HDFS
File
Speaker : Bala
Referenced
referenced
Global Big Data Conference - 2014
Hive
Warehouse
18
19. Data Loading Techniques contd.,
Explain when to go for external table and managed table?
Speaker : Bala
Global Big Data Conference - 2014
19
20. Question - 01
In which scenario you use Hive?
1.
2.
Structured data
3.
Any kind of data
4.
Speaker : Bala
Completely unstructured nasty data
None of the above
Global Big Data Conference - 2014
20
21. Question – 01 answer
2. Hive is mainly used to analyze
structured data. Typically, Hive runs on
the data that is generated by
MapReduce job (or) pig
Speaker : Bala
Global Big Data Conference - 2014
21
22. Question - 02
Which option is not correct about
Metastore?
1.
2.
It has information about number of
partitions and number of buckets
3.
It can give you time at which the table is
created
4.
Speaker : Bala
It stores the table location
It stores the actual data
Global Big Data Conference - 2014
22
23. Question – 02 answer
4. Metastore stores only the metadata.
Actual data is stored in HDFS.
Speaker : Bala
Global Big Data Conference - 2014
23
24. Question – 03 (last question)
What is incorrect about Hive?
1.
2.
Hive runs on top of HDFS
3.
Hive is a proprietary software
4.
Speaker : Bala
Hive internally generates MapReduce
jobs to serve your query
Hive supports multiple interfaces to
interact with
Global Big Data Conference - 2014
24
25. Question – 03 answer
3. Hive is an open source. Not a
proprietary software. Hive community
is growing very rapidly.
Speaker : Bala
Global Big Data Conference - 2014
25
26. Hive Query Language (Hive QL)
Data types – provides types for variables
DDL – provides a way to define databases, tables, etc.,
DML – provides a way to modify content
Query statements – provides a way to retrieve the content
Speaker : Bala
Global Big Data Conference - 2014
26
27. Data types
Booleans:
Primitive Types
TINYINT (1 byte)
SMALLINT (2 bytes)
INT (4 bytes)
BIGINT (8 bytes)
BOOLEAN
(TRUE or FALSE)
String:
STRING
(sequence of
characters)
Speaker : Bala
Integers:
Floating point
numbers:
Usage
variable_name <Data Type>
ex: name STRING
Global Big Data Conference - 2014
Float (4 bytes)
Double (8 bytes)
27
28. Data types contd.,
ARRAY
Usage
collection of multiple
same data type values
name ARRAY <primitive type>
ex: marks ARRAY<INT>
Complex Types
Usage
STRUCT
collection of multiple
different data type
values
MAP
collection of
(key, value) pairs
Speaker : Bala
Global Big Data Conference - 2014
name STRUCT <type1, type2,
type3, …>
ex: record STRUCT <name
STRING, id INT, marks
ARRAY<INT>>
Usage
name MAP <key, value>
ex: score MAP<STRING, INT>
28
29. Data types contd.,
Key must be a primitive in MAP
Referencing complex types
Previous example:
– marks ARRAY<INT>
– record STRUCT <name STRING, id INT, marks ARRAY<INT>>
– score MAP<STRING, INT>
SELECT marks[0], record.name, score[‘joe’]
Complex type inside a complex type is allowed
– array inside a struct (as seen before)
Speaker : Bala
Global Big Data Conference - 2014
29
30. DDL
CREATE TABLE sample(id INT, name STRING,
schema
STRING, state STRING)
COMMENT ‘This is a sample table’
PARTITIONED BY (state STRING)
age INT,
sex
comments for readability
partition data by state column
ROW FORMAT DELIMITED
rows are delimited by ‘n’
FIELDS TERMINATED BY ‘,’
fields are terminated by ‘,’
STORED AS TEXTFILE;
store file as a text file
Table is created in warehouse directory and completely managed by Hive
Specific row format and file format can be expressed by custom SerDe
Speaker : Bala
Global Big Data Conference - 2014
30
31. SerDe
SerDe stands for Serializer and Deserializer
Deserializer
HDFS
File
InputFile
Format
<Key,
Value>
Deserializ
er
Row
Serializer
<Key,
Value>
OutputFile
Format
HDFS
File
Serializer
Row
Speaker : Bala
Global Big Data Conference - 2014
31
32. DDL contd.,
CREATE EXTERNAL TABLE external_sample(id INT, name STRING,
age INT, sex STRING, state STRING)
LOCATION ‘/user/department/sample’
Table is not created in warehouse directory and just referenced by Hive
The file referenced is in HDFS (hdfs://user/department/sample)
Speaker : Bala
Global Big Data Conference - 2014
32
33. DDL contd.,
DELETE TABLE sample
Since sample table is managed by Hive, it deletes entire data along with
metadata
DELETE TABLE external_sample
Since external_sample table is *not* managed by Hive, it just deletes the
metadata leaving actual data untouched
Speaker : Bala
Global Big Data Conference - 2014
33
34. DML
Load data into managed table from local file system
LOAD DATA LOCAL INPATH '/home/hive/sample.txt' INTO TABLE
sample;
The file ‘/home/hive/sample.txt’ is in local file system
It is copied into Hive warehouse folder
Load data into managed table from HDFS
LOAD DATA INPATH '/user/hive/sample.txt' INTO TABLE
sample;
The file ‘/user/hive/sample.txt’ is in HDFS
It is copied into Hive warehouse folder
Speaker : Bala
Global Big Data Conference - 2014
34
35. DML contd.,
Insert results into a new table
INSERT OVERWRITE TABLE newsample
SELECT * from sample;
newsample table must be created before hand
select query results are loaded (overwritten) into new sample
Create a new table with automatically derived schema
CREATE TABLE newsample
AS SELECT * from sample;
creates newsample time with automatically derived schema
query results are populated into it
Speaker : Bala
Global Big Data Conference - 2014
35
36. Query statements
To list available databases
SHOW DATABASES;
To use a particular database
USE <databasename>;
To list all tables available in a database
SHOW TABLES;
Speaker : Bala
Global Big Data Conference - 2014
36
37. Query statements contd.,
select
SELECT * FROM sample;
Aggregation functions
SELECT COUNT(DISTINCT state) FROM sample;
Group by, Sort by, Order by
SELECT COUNT(*) FROM sample GROUP BY state;
SELECT * FROM sample SORT BY id DESC;
FROM sample SELECT * ORDER BY id ASC;
Speaker : Bala
Global Big Data Conference - 2014
37
38. Query statements contd.,
Joins
SELECT s.* , o.*
FROM sample s
JOIN orders o
ON (s.id = o.id)
Left join and Right joins are also supported
Multiple joins are accepted
Speaker : Bala
Global Big Data Conference - 2014
38
39. Custom Functions
UDF:
– User defined function
– Complex/additional logic can be expressed
– Operates on row by row
UDAF:
– User defined aggregate function
– Custom aggregated function logic can be written
– Operates on groups retrieved by group by clause
UDTF:
– User defined table function
– Operates on entire table
Speaker : Bala
Global Big Data Conference - 2014
39
40. Hive Limitations
Not suitable for unstructured data
Perfectly suitable for OLAP system (analysis)
Representing machine learning algorithms can be a challenging
task
Performance tradeoff with actual MR programs in various
scenarios
– The gap is narrowing with release to release
Speaker : Bala
Global Big Data Conference - 2014
40
41. Important practical tips
Hive logs: /tmp/$USER/hive.log
To know available functions: SET FUNCTIONS
To know help about a specific function: DESCRIBE FUNCTION
<function_name>
Explain about config files the one in /usr/lib/hive/conf folder
– hive-site.xml, hive-default.xml, (or) specify custom file using –f option ?
SETTING parameters in the hive session
Speaker : Bala
Global Big Data Conference - 2014
41
42. References
Hadoop: The Definitive Guide -Tom White
https://cwiki.apache.org/confluence/display/Hive/Home
http://www.sfbayacm.org/wp/wpcontent/uploads/2010/01/sig_2010_v21.pdf
Venner, Jason (2009). Pro Hadoop
http://hortonworks.com/big-data-insights/how-facebook-uses-hadoopand-hive/
Speaker : Bala
Global Big Data Conference - 2014
42
46. Schema on Read (?)
[To do] where to put this slide?
Explain what is schema on read
Explain what is schema on write
Advantages of using schema on read
– Faster load time
– Impacts query time
Speaker : Bala
Global Big Data Conference - 2014
46