This document provides an overview of big data concepts and the Hadoop ecosystem. It discusses the 3-4 V's model of big data, defines Hadoop as an implementation of Google's MapReduce framework, and outlines the components of the Hadoop ecosystem including HDFS, MapReduce, Pig, Hive, HBase and Impala. It also provides examples of log file processing using Pig Latin and discusses scaling analytics to big data through tools like Mahout, ElasticSearch and Kibana.
1. Dipping Your Toes into the
Big Data Pool
Orlando CodeCamp 2014
John Ternent
VP Application Development
TravelClick
2. About Me
20+ years as a consultant, software engineer, architect,
and tech executive.
Mostly data-focused, RDBMS, object database, and big
data/NoSQL/analytics/data science.
Presently leading development efforts for TravelClick
Channel Management team.
Twitter : @jaternent
3. Poll : Big Data
How many people are comfortable with the definition?
How many people are “doing” Big Data?
4. Big Data in the Media
The Three Four V’s of Big Data:
Volume (Scale)
Variety (Forms)
Velocity (Streaming)
Veracity (Uncertainty)
http://www.ibmbigdatahub.com/infographic/four-vs-
big-data
5. A New Definition
Big Data is about a tool set and approach that allows for
non-linear scalability of solutions to data problems.
“It depends on how capital your B and D are in Big
Data…”
What is Big Data to you?
6. The Big Data Ecosystem
Data
Sources
Data
Storage
Data
Manipulation
Data
Management
Data
Analysis
• Sqoop
• Flume
• HDFS
• HBase
• Pig
• MapReduce
• Zookeeper
• Avro
• Oozie
• Hive
• Mahout
• Impala
8. Great, but What IS Hadoop?
Implementation of Google MapReduce framework
Distributed processing on commodity hardware
Distributed file system with high failure tolerance
Can support activity directly on top of distributed file
system (MapReduce jobs, Impala, Hive queries, etc)
11. Example : Log File Processing
A = LOAD '/Users/jternent/Documents/logs/api*' USING TextLoader as (line:chararray);
B = FOREACH A GENERATE FLATTEN(
(tuple(chararray, chararray, chararray, chararray, chararray, int, int, chararray, chararray, int,
int, int))
REGEX_EXTRACT_ALL(line,'^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+)
"([^"]*)" "([^"]*)" (d+) (d+) (d+)'))
as (forwarded_ip:chararray, rem_log:chararray,rem_user:chararray, ts:chararray,
req_url:chararray, result:int, resp_size:int, referrer:chararray, user_agent:chararray,
svc_time:int, rec_bytes:int, resp_bytes:int);
B1 = FILTER B BY ts IS NOT NULL;
B2 = FILTER B BY req_url MATCHES '.*[fetch|update].*';
B3 = FOREACH B2 GENERATE *, REGEX_EXTRACT(req_url, '^w+ /(S+)[?]* S+',1) as req;
C = FOREACH B3 GENERATE forwarded_ip, GetMonth(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as
month, GetDay(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as day,
GetHour(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as hour, req, result, svc_time;
D = GROUP C BY (month, day, hour, req, result);
E = FOREACH D GENERATE flatten(group), MAX(C.svc_time) as max, MIN(C.svc_time) as min,
COUNT(C) as count;
STORE E INTO '/Users/jternent/Documents/logs/ezy-logs-output' USING PigStorage
12. Another Real-World Example
2013-08-10T04:03:50-04:00 INFO (6): {"eventType":3,"eventTime":"Aug
10, 2013 4:03:50
AM","hotelId":8186,"channelId":9173,"submissionId":1376121011,"sessionId
":null,"documentId":"9173SS8186_13761210111434582378cds.txt","queueN
ame":"expedia-
dx","roomCount":1,"submissionDayCount":1,"serverName":"orldc-auto-
11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":2,"su
bmissionStatusCode":0}
2013-08-10T04:03:53-04:00 INFO (6): {"eventType":2,"eventTime":"Aug
10, 2013 4:03:53
AM","hotelId":8525,"channelId":50091,"submissionId":1376116653,"sessionI
d":null,"documentId":"50091SS8525_13761166531434520293cds.txt","queu
eName":"expedia-
dx","roomCount":5,"submissionDayCount":2,"serverName":"orldc-auto-
11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":1,"su
bmissionStatusCode":null}
100 million (ish) / week of these. 25MB zipped per server per day (15
servers right now), 750MB uncompressed.
13. Pig Example - Pros and Cons
Pros:
Don’t need to ETL into a database, all off file system
Same development for one file as 10,000 files
Horizontally scalable
UDFs allow fine-grained control
Flexible
Cons:
Language can be difficult to work with
MapReduce touches ALL the things to get the answer
(compare to indexed search)
14. Unstructured and Semi-
Structured Data
Big Data tools can help with the analysis of data that
would be more challenging in a relational database
Twitter feeds (Natural Language Processing)
Social network analysis
Big Data approaches to search are making search tools
more accessible and useful than ever
ElasticSearch
16. Analytics with Big Data
Apache Mahout
Machine learning on Hadoop
Recommendation
Classification
Clustering
RHadoop
R mapreduce implementation on HDFS
Tableau
Visualization on HDFS/Hive
Main point : You don’t have to roll your own for everything, many tools now
using HDFS natively
17. Return to SQL
Many SQL dialects are being/have been ported to
Hadoop
Hive : Create DDL Tables on top of HDFS structures
CREATE TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^]*) ([^]*) ([^]*) (-|[^]*]) ([^
"]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|".*") ([^
"]*|".*"))?"
)
STORED AS TEXTFILE;
SELECT host, COUNT(*)
FROM apachelog
GROUP BY host;
18. Cloudera Impala
Moves SQL processing onto each distributed node
Written for performance
Distribution and reduction of the query handled by the Impala
engine
19. Big Data Tradeoffs
Time tradeoff – loading/building/indexing vs. runtime
ACID properties – different distribution models may
compromise one or more of these properties
Be aware of what tradeoffs you’re making
TANSTAAFL – massive scalability, commodity hardware,
but at what price?
Tool sophistication
20. NoSQL – “Not Only SQL”
Sacrificing ACID properties for different scalability
benefits.
Key/Value Store : SimpleDB, Riak, Redis
Column Family Store : Cassandra, HBase
Document Database : CouchDB, MongoDB
Graph Database : Neo4J
General properties
High horizontal scalability
Fast access
Simple data structures
Caching
21. Getting Started
Play in the sandbox – Hadoop/Hive/Pig local mode or
AWS
Randy Zwitch has a great tutorial on this :
http://randyzwitch.com/big-data-hadoop-amazon-ec2-
cloudera-part-1/
Using Airline data :
http://stat-computing.org/dataexpo/2009/the-data.html
Kaggle competitions (data science)
Lots of big data sets available, look for machine
learning repositories
23. MOOCs
Unprecedented access to very high-quality online
courses, including
Udacity : Data Science Track
Intro to Data Science
Data Wrangling with MongoDB
Intro to Hadoop and MapReduce
Coursera :
Machine Learning course
Data Science Certificate Track (R, Python)
Waikato University : Weka
25. Outro
We live in exciting times!
Confluence of data, processing power, and algorithmic
sophistication.
More data is available to make better decisions more
easily than any other time in human history.