Apache Hive

Motivation
Analysis of Data made by both engineering
and non-engineering people.
The data are growing fast. In 2007, the
volume was 15TB and it grew up to 200TB in
2010.
Current RDBMS can NOT handle it.
Current solution are not available, not
scalable, Expensive and Proprietary.

2

Map/Reduce -Apache Hadoop
MapReduce is a programing model and an
associated implementation introduced by
Goolge in 2004.
Apache Hadoop is a software framework
inspired by Google's MapReduce.

3

Motivation (cont.)
Hadoop supports data-intensive distributed
applications.
However...
–Map-reduce hard to program (users know
sql/bash/python).
–No schema.

4

What is HIVE?
A data warehouse infrastructure built on top
of Hadoop for providing data
summarization, query, and analysis.
–ETL.
–Structure.
–Access to different storage.
–Query execution via MapReduce.
Key Building Principles:
–SQL is a familiar language
–Extensibility –
Types, Functions, Formats, Scripts
–Performance 5

Hive Applications
Log processing
Text mining
Document indexing
Customer-facing business intelligence
(e.g., Google Analytics)
Predictive modeling, hypothesis testing

6

Hive Architecture

7

Data Units
Databases.
Tables.
Partitions.
Buckets (or Clusters).

8

Type System
Primitive types
–Integers:TINYINT, SMALLINT, INT, BIGINT.
–Boolean: BOOLEAN.
–Floating point numbers: FLOAT, DOUBLE .
–String: STRING.
Complex types
–Structs: {a INT; b INT}.
–Maps: M['group'].
–Arrays: ['a', 'b', 'c'], A[1] returns 'b'.

9

Physical Layout
Warehouse directory in HDFS
– e.g., /user/hive/warehouse
Table row data stored in subdirectories of
warehouse
Partitions form subdirectories of table
directories
Actual data stored in flat files
– Control char-delimited text, or SequenceFiles
– With custom SerDe, can use arbitrary format

1
0

Examples – DDL Operations

CREATE TABLE sample (foo INT, bar STRING)
PARTITIONED BY (ds STRING);

SHOW TABLES '.*s';

DESCRIBE sample;

ALTER TABLE sample ADD COLUMNS (new_col INT);

DROP TABLE sample;

1
2

Examples – DML Operations

LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO
TABLE sample PARTITION (ds='2012-02-24');

LOAD DATA INPATH '/user/falvariz/hive/sample.txt'
OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');

1
3

Examples – DML Operations
LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO
TABLE sample PARTITION (ds='2012-02-24');

LOAD DATA INPATH '/user/falvariz/hive/sample.txt'
OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');

1
4

SELECTS and FILTERS
SELECT foo FROM sample WHERE ds='2012-02-24';

INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM
sample WHERE ds='2012-02-24';

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive-sample-out'
SELECT * FROM sample;

1
5

Aggregations and Groups
SELECT MAX(foo) FROM sample;

SELECT ds, COUNT(*), SUM(foo) FROM sample GROUP BY ds;

FROM sample s INSERT OVERWRITE TABLE bar SELECT
s.bar, count(*) WHERE s.foo > 0 GROUP BY s.bar;

1
6

Extension mechanisms

1
8

Built-in Functions
Mathematical: round, floor, ceil, rand, exp...

Collection:
size, map_keys, map_values, array_contains.

Type Conversion: cast.

Date: from_unixtime, to_date, year, datediff...

Conditional: if, case, coalesce.

String: length, reverse, upper, trim...

2
0

Install and Config Hive

2
2

Installing Hive
From a Release Tarball:
$ wget http://archive.apache.org/dist/
hadoop/hive/hive-0.5.0/hive-0.5.0-
bin.tar.gz
$ tar xvzf hive-0.5.0-bin.tar.gz
$ cd hive-0.5.0-bin
$ export HIVE_HOME=$PWD
$ export PATH=$HIVE_HOME/bin:$PATH

2
3

Hive Dependencies
 Java 1.6
Hadoop >= 0.17-0.20
Hive *MUST* be able to find Hadoop:
– $HADOOP_HOME=<hadoop-install-dir>
– Add $HADOOP_HOME/bin to $PATH
Hive needs r/w access to /tmp and
/user/hive/warehouse on HDFS:
$ hadoop fs –mkdir /tmp
$ hadoop fs –mkdir /user/hive/warehouse
$ hadoop fs –chmod g+w /tmp
$ hadoop fs –chmod g+w /user/hive/warehouse

2
4

Hive Configuration
Default
configuration in
$HIVE_HOME/conf/hive-default.xml
– DO NOT TOUCH THIS FILE!
Re(Define) properties in
$HIVE_HOME/conf/hive-site.xml
 Use $HIVE_CONF_DIR to specify alternate conf dir
location
You can override Hadoop configuration properties in
Hive’s configuration,
e.g:– mapred.reduce.tasks=1

2
5

Hive CLI Commands
Start a terminal and run
$ hive
Should see a prompt like:
hive>
Set a Hive or Hadoop conf prop:
– hive> set propkey=value;
 List all properties and values:
– hive> set –v;
 Add a resource to the DCache:
– hive> add [ARCHIVE|FILE|JAR] filename;

2
7

Hive CLI Commands
List tables:
– hive> show tables;
Describe a table:
– hive> describe <tablename>;
More information:
– hive> describe extended <tablename>;
List Functions:
– hive> show functions;
More information:
– hive> describe function<functionname>;

2
8

Conclusion
A easy way to process large scale data.
Support SQL-based queries.
Provide more user defined interfaces to extend
Programmability.
Files in HDFS are immutable. Tipically:
–Log processing: Daily Report, User Activity
Measurement
–Data/Text mining: Machine learning (Training Data)
–Business intelligence: Advertising Delivery,Spam
Detection

2
9

Apache Hive

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (17)

Semelhante a Apache Hive

Semelhante a Apache Hive (20)

Último

Último (20)

Apache Hive