In a recent Big Data Warehousing Meetup in NYC, Caserta Concepts partnered with Datameer to explore big data analytics techniques. In the presentation, we made a Hive vs. Pig Comparison. For more information on our services or this presentation, please visit www.casertaconcepts.com or contact us at info (at) casertaconcepts.com.
http://www.casertaconcepts.com
1. Big Data Warehousing Meetup
Today’s Topic: Exploring Big Data
Analytics Techniques with Datameer
Sponsored By:
2. WELCOME!
Joe Caserta
Founder & President, Caserta Concepts
3. Agenda
7:00 Networking
Grab a slice of pizza and a drink...
7:15 Joe Caserta Welcome
President, Caserta Concepts About the Meetup and about Caserta Concepts
Author, Data Warehouse ETL Toolkit
7:30 Elliott Cordo Pig and Hive
Principal Consultant, Caserta Concepts Walkthrough of these powerful native Hadoop tools
7:50 Adam Gugliciello Datameer
Solutions Engineer, Datameer
8:10 - More Networking
9:00 Tell us what you’re up to…
4. About BDW Meetup
• Big Data is a complex, rapidly
changing landscape
• We want to share our stories and
hear about yours
• Great networking opportunity for like
minded data nerds
• Opportunities to collaborate on
exciting projects
• Next BDW Meetup: April 22.
• Topic: Intro to NoSQL Databases
5. About Caserta Concepts
Focused Industries Served
Expertise
• Financial Services
• Big Data Analytics • Healthcare / Insurance
• Data Warehousing • Retail / eCommerce
• Business Intelligence • Digital Media / Marketing
• Strategic Data • K-12 / Higher Education
Ecosystems
Founded in 2001
• President: Joe Caserta, industry thought leader,
consultant, educator and co-author, The Data
Warehouse ETL Toolkit (Wiley, 2004)
7. Expertise & Offerings
Strategic Roadmap/
Assessment/Consulting
Big Data
Analytics
Data Warehousing/
ETL/Data Integration
BI/Visualization/
Analytics
Master Data Management
10. ANALYZING DATA: PIG AND HIVE
Elliott Cordo
Principal Consultant, Caserta Concepts
11. Big Data Analysis
• Let’s review some tools for analyzing and processing Big
Data
• We will go over some simple use cases – point out what is
interesting about them
• Develop a point of view of what each one is well suited for.
12. Big Data Analysis – Map Reduce?
Distributed programming framework – Divide and Conquer!
• Master divides work into digestible chunks and distributes to worker nodes
– > MAP
• Work from nodes is then collected by the master and combined to form an
answer -> REDUCE
Powerful tool for to solve interesting computational problems at scale
13. HELP
• We are doing low-level language coding to perform low-
level operations
• For productivity we need higher level tools!
• We will get help from a few animals!
N1 N2 N3 N4 N5
Hadoop Distributed File System (HDFS)
14. HIVE
• The Hadoop “Data Warehouse”
• HiveQL is a SQL-Like interface that allows you to abstract
“relational-db like” structure on top of non-relational or
unstructured data
• Flat Files, JSON, Web logs
• HBase, Casandra, other NoSQL stores like MongoDB
• Thanks to ODBC/JDBC drivers some conventional BI
tools can interact with Hive
• Ability to integrate custom programming, mappers,
reducers
15. HIVE
But don’t get too excited!
• Hive is not a Database, especially in terms of
optimizations.
• SQL is interpreted to Map Reduce Jobs, expect even
simple queries to be around a minute or more.
Start query,
go get coffee
• But now that expectations have been set, it’s still a very
useful tool
16. HIVE DDL– Create and load a table
hive> create table user_movie_ratings(
> user_id int,
> movie_id int, Looks like a typical
> rating int,
> time_unix_ts string) table declaration,
> row format delimited except we are specify
> fields terminated by 't' the ingested file
> stored as textfile; format
OK
Time taken: 0.395 seconds
hive> load data inpath '/user/hive/staging/data/u.data' overwrite into table
user_movie_ratings;
Loading data to table default.user_movie_ratings
Deleted hdfs://localhost:54310/user/hive/warehouse/user_movie_ratings
Table default.user_movie_ratings stats: [num_partitions: 0, num_files: 1, num_rows: 0,
total_size: 1979173, raw_data_size: 0]
OK
Time taken: 0.474 seconds
17. HIVE DDL– Create an external table
hive> create external table user (
> user_id int,
> age int,
This time we don’t
> gender string, want Hive to own this
> occupation string, data’s lifecycle
> postal_code int )
> row format delimited fields terminated by '|'
> location '/user/hive/staging/user';
OK
Time taken: 0.096 seconds
18. HIVE – YAY SQL!
hive> select occupation, count(1)
> from user_movie_ratings m
> join user u on u.user_id=m.user_id
> group by occupation;
Total MapReduce jobs = 2
Launching Job 1 out of 2
...
Total MapReduce CPU Time Spent: 47 seconds 170 msec
OK
administrator 7479
artist 2308
doctor 540
educator 9442
engineer 8175
entertainment 2095
….
retired 1609
salesman 856
scientist 2058
student 21957
technician 3506
writer 5536 Hmmm..
Time taken: 110.331 seconds
19. PIG
• Powerful High Level Programming Language
• SQL-ish, small learning curve for SQL and procedural
programmers
• Excellent for data transformation, ETL
• Not meant to be an ad-hoc query tool, happy with doing
grunt work
• Plenty of supported file formats, databases, ability to
create custom UDF’s
20. PIG Example
grunt> lens_users= load '/user/movie_lens/u.user' using PigStorage('|') as
(user_id:int, age:int, gender:chararray, occupation:chararray, postal_code:int);
grunt> lens_data= load '/user/movie_lens/u.data' using PigStorage('t') as
(user_id:int, movie_id:int, rating:int, time_unix_ts:chararray);
grunt> joined = join lens_users by user_id, lens_data by user_id
grunt> grouped = group joined by (occupation);
grunt> results = FOREACH grouped GENERATE COUNT_STAR(joined),*;
grunt> store results into '/user/movie_lens_user_summary'
Interesting,
We are doing
our aggregate
functions after
grouping
21. PIG - Results
Grouping in PIG is a fair
deviation from SQL ->
original elements are
preserved in a bag
22. Summary
Hive:
• Helpful for ETL
• Very good for Ad-Hoc Analysis - Not necessarily suited
for front end users but definitely helpful for data analysts
• Directly leverages SQL expertise!!
PIG:
• Great for ETL
• Powerful, transformation and processing capabilities
• SQL-like, but different in many ways, will take some time
to master.