Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Scaling API-first – The story of a global engineering organization
Apache Hive
1.
2. Motivation
Analysis of Data made by both engineering
and non-engineering people.
The data are growing fast. In 2007, the
volume was 15TB and it grew up to 200TB in
2010.
Current RDBMS can NOT handle it.
Current solution are not available, not
scalable, Expensive and Proprietary.
2
3. Map/Reduce -Apache Hadoop
MapReduce is a programing model and an
associated implementation introduced by
Goolge in 2004.
Apache Hadoop is a software framework
inspired by Google's MapReduce.
3
4. Motivation (cont.)
Hadoop supports data-intensive distributed
applications.
However...
–Map-reduce hard to program (users know
sql/bash/python).
–No schema.
4
5. What is HIVE?
A data warehouse infrastructure built on top
of Hadoop for providing data
summarization, query, and analysis.
–ETL.
–Structure.
–Access to different storage.
–Query execution via MapReduce.
Key Building Principles:
–SQL is a familiar language
–Extensibility –
Types, Functions, Formats, Scripts
–Performance 5
6. Hive Applications
Log processing
Text mining
Document indexing
Customer-facing business intelligence
(e.g., Google Analytics)
Predictive modeling, hypothesis testing
6
9. Type System
Primitive types
–Integers:TINYINT, SMALLINT, INT, BIGINT.
–Boolean: BOOLEAN.
–Floating point numbers: FLOAT, DOUBLE .
–String: STRING.
Complex types
–Structs: {a INT; b INT}.
–Maps: M['group'].
–Arrays: ['a', 'b', 'c'], A[1] returns 'b'.
9
10. Physical Layout
Warehouse directory in HDFS
– e.g., /user/hive/warehouse
Table row data stored in subdirectories of
warehouse
Partitions form subdirectories of table
directories
Actual data stored in flat files
– Control char-delimited text, or SequenceFiles
– With custom SerDe, can use arbitrary format
1
0
12. Examples – DDL Operations
CREATE TABLE sample (foo INT, bar STRING)
PARTITIONED BY (ds STRING);
SHOW TABLES '.*s';
DESCRIBE sample;
ALTER TABLE sample ADD COLUMNS (new_col INT);
DROP TABLE sample;
1
2
13. Examples – DML Operations
LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO
TABLE sample PARTITION (ds='2012-02-24');
LOAD DATA INPATH '/user/falvariz/hive/sample.txt'
OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');
1
3
14. Examples – DML Operations
LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO
TABLE sample PARTITION (ds='2012-02-24');
LOAD DATA INPATH '/user/falvariz/hive/sample.txt'
OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');
1
4
15. SELECTS and FILTERS
SELECT foo FROM sample WHERE ds='2012-02-24';
INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM
sample WHERE ds='2012-02-24';
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive-sample-out'
SELECT * FROM sample;
1
5
16. Aggregations and Groups
SELECT MAX(foo) FROM sample;
SELECT ds, COUNT(*), SUM(foo) FROM sample GROUP BY ds;
FROM sample s INSERT OVERWRITE TABLE bar SELECT
s.bar, count(*) WHERE s.foo > 0 GROUP BY s.bar;
1
6
23. Installing Hive
From a Release Tarball:
$ wget http://archive.apache.org/dist/
hadoop/hive/hive-0.5.0/hive-0.5.0-
bin.tar.gz
$ tar xvzf hive-0.5.0-bin.tar.gz
$ cd hive-0.5.0-bin
$ export HIVE_HOME=$PWD
$ export PATH=$HIVE_HOME/bin:$PATH
2
3
24. Hive Dependencies
Java 1.6
Hadoop >= 0.17-0.20
Hive *MUST* be able to find Hadoop:
– $HADOOP_HOME=<hadoop-install-dir>
– Add $HADOOP_HOME/bin to $PATH
Hive needs r/w access to /tmp and
/user/hive/warehouse on HDFS:
$ hadoop fs –mkdir /tmp
$ hadoop fs –mkdir /user/hive/warehouse
$ hadoop fs –chmod g+w /tmp
$ hadoop fs –chmod g+w /user/hive/warehouse
2
4
25. Hive Configuration
Default
configuration in
$HIVE_HOME/conf/hive-default.xml
– DO NOT TOUCH THIS FILE!
Re(Define) properties in
$HIVE_HOME/conf/hive-site.xml
Use $HIVE_CONF_DIR to specify alternate conf dir
location
You can override Hadoop configuration properties in
Hive’s configuration,
e.g:– mapred.reduce.tasks=1
2
5
27. Hive CLI Commands
Start a terminal and run
$ hive
Should see a prompt like:
hive>
Set a Hive or Hadoop conf prop:
– hive> set propkey=value;
List all properties and values:
– hive> set –v;
Add a resource to the DCache:
– hive> add [ARCHIVE|FILE|JAR] filename;
2
7
29. Conclusion
A easy way to process large scale data.
Support SQL-based queries.
Provide more user defined interfaces to extend
Programmability.
Files in HDFS are immutable. Tipically:
–Log processing: Daily Report, User Activity
Measurement
–Data/Text mining: Machine learning (Training Data)
–Business intelligence: Advertising Delivery,Spam
Detection
2
9