Mais conteúdo relacionado Semelhante a Learning Apache HIVE - Data Warehouse and Query Language for Hadoop (20) Learning Apache HIVE - Data Warehouse and Query Language for Hadoop2. | ©2012, Cognizant2
HIVE
Data Warehousing Solution built on top of Hadoop
Provides SQL-like query language named HiveQL
– Minimal learning curve for people with SQL expertise
– Data analysts are target audience
Early Hive development work started at Facebook in 2007
Today, Facebook counts 29% of its employees (and growing!)
as Hive users.
https://www.facebook.com/note.php?note_id=114588058858
Today Hive is an Apache project under Hadoop
– http://hive.apache.org
3. | 2012 Cognizant Technology Solutions
Hive Provides
3
• Ability to bring structure to various data Formats
• Simple interface for ad hoc querying,analyzing and
summarizing large amounts of data
• Access to files on various data stores such
as HDFS and HBase
4. | ©2012, Cognizant4
Hive
Hive does NOT provide low latency or realtime queries.
Even querying small amounts of data may take minutes.
Designed for scalability and ease-of-use rather than low latency
responses
5. | ©2012, Cognizant5
Hive
Translates HiveQL statements into a set of MapReduce Jobs
which are then executed on a Hadoop Cluster.
6. | ©2012, Cognizant6
Hive Metastore
To support features like schema(s) and data partitioning Hive
keeps its metadata in a Relational Database
Packaged with Derby, a lightweight embedded SQL DB
Default Derby based is good for evaluation an testing
Schema is not shared between users as each user has their own
instance of embedded Derby Stored in metastore_db directory
which resides in the directory that hive was started from
• Can easily switch another SQL installation such as MySQL
7. | ©2012, Cognizant7
Metastore Deployment Modes : Embedded Mode
Default metastore deployment mode for CDH.
Both the database and the metastore service run embedded in
the main HiveServer process
Both are started for you when you start the HiveServer process.
Support only one active user at a time and is not certified for
production use.
8. | ©2012, Cognizant8
Metastore Deployment Modes : Local Mode
Hive metastore service runs
in the same process as the
main HiveServer process.
The metastore database runs
in a separate process, and
can be on a separate host.
The embedded metastore
service communicates with
the metastore database over
JDBC.
11. | ©2012, Cognizant11
Hive Interface Options
Command Line Interface (CLI)
– Will use exclusively in these slides
• Hive Web Interface
https://cwiki.apache.org/confluence/display/Hive/HiveWebInterface
• Java Database Connectivity (JDBC)
– https://cwiki.apache.org/confluence/display/Hive/HiveClient
BEELINE for Hivesrver2 (new in CDH4)
- http://sqlline.sourceforge.net/#manual
12. | ©2012, Cognizant12
Data Types
[cts318692@aster4 ~]$ hive
Logging initialized using configuration in
jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.2.1.jar!/hive-
log4j.properties
Hive history
file=/tmp/cts318692/hive_job_log_cts318692_201308071622_200
5272769.txt
hive>
Launch Hive Command Line Interface
(CLI)
Location of the session’s log file
hive> !cat data/user-posts.txt;
user1,Funny Story,1343182026191
user2,Cool Deal,1343182133839
user4,Interesting Post,1343182154633
user5,Yet Another Blog,13431839394
hive>
Can execute local commands
within CLI, place a command
in between ! and ;
13. | ©2012, Cognizant13
Data Types
Numeric Types
TINYINT
SMALLINT
INT
BIGINT
FLOAT
DOUBLE
DECIMAL (Note: Only available starting with Hive 0.11.0)
Date/Time Types
TIMESTAMP (Note: Only available starting with
Hive 0.8.0)
DATE (Note: Only available starting with Hive 0.12.0)
Misc Types
BOOLEAN
STRING
BINARY (Note: Only available starting with Hive 0.8.0)
15. | ©2012, Cognizant15
Check physical storage of hive
[cts318692@aster4 ~]$ hive -S -e "set" | grep warehouse
hive.metastore.warehouse.dir=/user/hive/warehouse
hive.warehouse.subdir.inherit.perms=true
This is the location where hive stores
its data.
16. | ©2012, Cognizant16
Creating DataBase
hive> CREATE DATABASE IF NOT EXISTS som COMMENT 'my
database'
> LOCATION '/user/cts318692/someshwar/hivestore/'
> WITH DBPROPERTIES ('creator'='someshwar
kale','date'='2013-06-08');
OK
Time taken: 0.046 seconds
Used to suppress
warnings
Database name,
Hive opens default database when u open a
new session
You can override ‘/usr/hive/warehouse’
default location for the new directory
Table propertiesPhysical storage for som
database
18. | ©2012, Cognizant18
Creating Table
For complex data types map,
arrays,structures
For map key and value eg. ‘key’
^C ’value’ (003=ctrlC=^C)
Column seperator Definition
21. | ©2012, Cognizant21
Create ..like
If you omit the EXTERNAL keyword and the original table is
external, the new table will also be external.
If you omit EXTERNAL and the original table is managed,
the new table will also be managed. However, if you include
the EXTERNAL keyword and the original table is managed,
the new table will be external. Even in this scenario, the
LOCATION clause will still be optional.
24. | ©2012, Cognizant
Dropping DataBase and Table
By default, Hive won’t permit
you to drop a database if it
contains tables. You can either
drop the tables first or append
the CASCADE keyword to the
command, which will cause
the Hive to drop the tables in the
database first.
25. | ©2012, Cognizant
Partitions
To increase performance Hive has the capability to partition data
– The values of partitioned column divide a table into
segments
– Entire partitions can be ignored at query time
– Similar to relational databases’ indexes but not as
Granular
Partitions have to be properly crated by users
– When inserting data must specify a partition
At query time, whenever appropriate, Hive will automatically filter
out partitions
28. | ©2012, Cognizant
Loading data to table
LOAD DATA LOCAL ... copies the local data to the final location in the
distributed filesystem, while LOAD DATA ... (i.e., without LOCAL) moves
the data to the final location.
Necessary if table to which we are loading
the data is partitioned. This is known as
Static partitioning as we are providing the
partition value in the query
Partitions are physically stored under
separate directories
29. | ©2012, Cognizant
Schema Violations
hive> LOAD DATA LOCAL INPATH
> 'data/user-posts-inconsistentFormat.txt'
> OVERWRITE INTO TABLE posts;
OK
Time taken: 0.612 seconds
hive> select * from posts;
OK
user1 Funny Story 1343182026191
user2 Cool Deal NULL
user4 Interesting Post 1343182154633
user5 Yet Another Blog 13431839394
Time taken: 0.136 seconds
null is set for any value that
violates pre-defined schema
31. | ©2012, Cognizant
Cntd…
There is no difference in syntax
• When partitioned column is specified in the
where clause entire directories/partitions could
be ignored
32. | ©2012, Cognizant
Bucketing
• Break data into a set of buckets based on a hash
function of a "bucket column"
– Capability to execute queries on a sub-set of random data
• Doesn’t automatically enforce bucketing
– User is required to specify the number of buckets by setting hash of
Reducer
hive> mapred.reduce.tasks = 256;
OR
hive> hive.enforce.bucketing = true;
Either manually set the hash
of
reducers to be the number of
buckets or you can use
‘hive.enforce.bucketing’ which
will set it on your behalf.
46. | ©2012, Cognizant
Table generating functions
Only a single expression in the
SELECT clause is supported with
UDTF's'.
52. | ©2012, Cognizant
Points to remember
Only equality joins are allowed.
More than 2 tables can be joined in the same query e.g.
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1)
JOIN c ON (c.key = b.key2)
is a valid join.
A single map/reduce job if for every table the same column is used in
the join clause -
ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
is converted into two map/reduce jobs because key1 column from b
is used in the first join condition and key2 column from b is used in the
second one.
53. | ©2012, Cognizant
ORDER BY and SORT BY
ORDER BY uses single reducer to sort the data, which may take
an unacceptably long time to execute for larger data sets.
Hive adds an alternative, SORT BY, that orders the data only
within each reducer, thereby performing a local ordering, where
each reducer’s output will be sorted.
55. | ©2012, Cognizant
UNION ALL and Nested select
Each subquery of the union query must produce the
same number of columns, and for each column, its
type must match all the column types in the same
position.
57. | ©2012, Cognizant
Lateral view
Lateral view is used in conjunction with user-defined table
generating functions such as explode().
A lateral view first applies the UDTF to each row of base table and
then joins resulting output rows to the input rows to form a virtual
table having the supplied table alias.
Syntax-
1. LATERAL VIEW udtf(expression) tableAlias AS columnAlias
60. | ©2012, Cognizant
UDF
Hive actually uses reflection to find methods whose names are
evaluate and matches the arguments used in the HiveQL function
call.
Hive can work with both the Hadoop Writables and the Java
primitives, but it’s recommended to work with the Writables since
they can be reused.
Input arguments type and return type must be same.
63. | ©2012, Cognizant
between operator
hive> select name,salary from employees2 where salary between
80000 and 100000;
Total MapReduce jobs = 1
Launching Job 1 out of 1
....
OK
John Doe 100000.0
John Doe 100000.0
Mary Smith 80000.0
Mary Smith 80000.0
Time taken: 14.39 seconds
Both values (lower and upper) are inclusive.
64. | ©2012, Cognizant
HiveServer2
As of CDH4.1, you can deploy HiveServer2, an improved version of
HiveServer that supports a new Thrift API tailored for JDBC and
ODBC clients, Kerberos authentication, and multi-client concurrency.
There is also a new CLI for HiveServer2 named BeeLine.
HiveServer2
Connection URL ===== jdbc:hive2://<host>:<port>
Driver Class =========== org.apache.hive.jdbc.HiveDriver
HiveServer1
Connection URL ===== jdbc:hive://<host>:<port>
Driver Class ========org.apache.hadoop.hive.jdbc.HiveDriver
65. | ©2012, Cognizant
BEELINE
$ /usr/lib/hive/bin/beeline
beeline> !connect jdbc:hive2://localhost:10000 username password
org.apache.hive.jdbc.HiveDriver
0: jdbc:hive2://localhost:10000>
67. | ©2012, Cognizant
References
Hive
Edward Capriolo (Author), Dean Wampler
(Author), Jason
Rutherglen (Author)
O'Reilly Media; 1 edition (October 3, 2012)
Chapter About Hive
Hadoop in Action
Chuck Lam (Author)
Manning Publications; 1st Edition (December,
2010)