2. About Me
david.engfer@gmail.com
@engfer
Meetup organizer for DFWBigData.org
> Hadoop, Cassandra, and all other things
BigData and NoSQL
> Join up!
Sr. Consultant @
> Rapidly growing national IT consulting firm
focused on career development while
operating within an local-office project model
3. What is Hadoop?
0 “framework for running [distributed] applications on
large cluster built of commodity hardware”
–from Hadoop Wiki
Marty McFly?
0 Originally created by Doug Cutting
> Named the project after his son’s toy
0 The name “Hadoop” has now evolved
to cover a family of products, but at its
core, it’s essentially just the
MapReduce programming paradigm
+ a distributed file system
7. History
White Papers:
+
>_< Growing Pains
Google File System
• 2003
MapReduce
+ • 2004
BigTable
• 2006
Solutions
8. History
White Papers: Hadoop Core
Google File System c. 2005
• 2003
MapReduce
• 2004
BigTable
• 2006
9. Hadoop Distributed
File System (HDFS)
0 OSS implementation of Google File System (bit.ly/ihXkof)
0 Master/slave architecture
0 Designed to run on commodity hardware
0 Hardware failures assumed in design
0 Fault-tolerant via replication
0 Semi-POSIX compliance; relaxed for performance
0 Unix-like permissions; ties into host’s users & groups
10. Hadoop Distributed
File System (HDFS)
0 Written in Java
0 Optimized for larger files
0 Focus on streaming data (high-throughput > low-latency)
0 Rack-aware
0 Only *nix for production env.
0 Web consoles for stats
11. HDFS Client API’s
0 “Shell-like” commands (hadoop dfs [cmd])
> cat chgrp chmod chown
copyFromLocal copyToLocal cp du, dus
expunge get getmerge ls, lsr
mkdir movefromLocal mv put
rm, rmr setrep stat tail
test text touchz
0 Native Java API
0 API for other languages (http://bit.ly/fLgCJC)
> C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa,
Smalltalk, and OCaml
12. Other HDFS Admin Tools
0 hadoop dfsadmin [opts]
> Basic admin utilities for the DFS cluster
> Change file-level replication factors, set quotas, upgrade,
safemode, reporting, etc
0 hadoop fsck [opts]
> Runs distributed file system checking and fixing utility
0 hadoop balancer
> Utility that rebalances block storage across the nodes
13. HDFS Node Types
Master
NameNode
0 Single node responsible for:
> Filesystem metadata operations on cluster
> Replication and locations of file blocks
0 SPOF
=(
(backups)
CheckpointNode
or 0 Nodes responsible for:
BackupNode > NameNode backup mechanisms
Slaves
0 Nodes responsible for:
DataNode
DataNode > Storage of file blocks
DataNode
> Serving actual file data to client
14. HDFS Architecture
FS/namespace/meta ops
NameNode BackupNode
(namespace backups)
(heartbeats, balancing, replication, etc)
DataNode DataNode DataNode DataNode DataNode
serving data -->
nodes write to local disk
25. Fault Tolerance?
NameNode BackupNode
Blocks are auto-replicated on remaining
nodes to satisfy replication factor
DataNode DataNode DataNode DataNode
26. Fault Tolerance?
NameNode BackupNode
Blocks are auto-replicated on remaining
nodes to satisfy replication factor
DataNode DataNode DataNode DataNode
27. Fault Tolerance?
NameNode BackupNode
Blocks are auto-replicated on remaining
nodes to satisfy replication factor
DataNode DataNode DataNode DataNode
28. Fault Tolerance?
NameNode BackupNode
NameNode loss = FAIL
(requires manual intervention)
not an EPIC fail because you
have the backup node to replay
any FS operations
DataNode DataNode DataNode DataNode DataNode
**automatic failover is in the works
29. Live horizontal scaling
and rebalancing
NameNode BackupNode
NameNode detects new DataNode
is added to cluster
DataNode DataNode DataNode DataNode DataNode
30. Live horizontal scaling
and rebalancing
NameNode BackupNode
Blocks are re-balanced and
re-distributed
DataNode DataNode DataNode DataNode DataNode
31. Live horizontal scaling
and rebalancing
NameNode BackupNode
Blocks are re-balanced and
re-distributed
DataNode DataNode DataNode DataNode DataNode
32. Live horizontal scaling
and rebalancing
NameNode BackupNode
Blocks are re-balanced and
re-distributed
DataNode DataNode DataNode DataNode DataNode
33. Live horizontal scaling
and rebalancing
NameNode BackupNode
Once replication factor is satisfied,
extra replicas are removed
DataNode DataNode DataNode DataNode DataNode
35. Other HDFS Utils
0 HDFS Raid (http://bit.ly/fqnzs5)
> Uses distributed RAID instead of
replication (useful at Petabyte from flume wiki
scale)
0 Flume/Scribe/Chukwa
> Log collection and aggregation
frameworks that support streaming
log data to HDFS
> Flume = Cloudera (http://bit.ly/gX8LeO)
> Scribe = Facebook (http://bit.ly/dIh3If)
36. MapReduce
0 Distributed programming paradigm and framework that is
the OSS implementation of Google’s MapReduce
(http://bit.ly/gXZbsk)
0 Modeled using the ideas behind functional programming
map() and reduce() operations
> Distributed on as many nodes as you would like
0 2 phase process:
map( ) reduce( )
sub-divide & combine & reduce
conquer cardinality
37. MapReduce ABC’s
0 Essentially, it’s…
1. Take a large problem and divide it into sub-problems
2. Perform the same function on all sub-problems
3. Combine the output from all sub-problems
0 Ex: Searching
1. Take a large problem and divide it into sub-problems
# Different groups of rows in DB; different parts of files; 1 user from a list of
users; etc.
2. Perform the same function on all sub-problems
# Search for a key in the given partition of data for the sub-problem; count
words; etc.
3. Combine the output from all sub-problems
# Combine the results into a result-set and return to the client
38. M/R Facts
0 M/R is excellent for problems where the “sub-problems”
are not interdependent
> For example, the output of one “mapper” should not depend
on the output or communication with another “mapper”
0 The reduce phase does not begin execution until all
mappers have finished
0 Failed map and reduce tasks get auto-restarted
0 Rack/HDFS-aware
41. Hadoop’s MapReduce
0 MapReduce tasks are submitted as a “job”
> Jobs can be assigned to a specified “queue” of jobs
# By default, jobs are submitted to the “default” queue
> Job submission is controlled by ACL’s for each queue
0 Rack-aware and HDFS-aware
> The JobTracker communicates with the HDFS NameNode
and schedules map/reduce operations using input data
locality on HDFS DataNodes
42. M/R Nodes
Master
0 Single node responsible for:
JobTracker > Coordinating all M/R tasks & events
> Managing job queues and scheduling
> Maintains and Controls TaskTrackers
> Moves/restarts map/reduce tasks if needed
0 SPOF
=(
> Uses “checkpointing” to combat this
Slaves
0 Worker nodes responsible for:
TaskTracker
TaskTracker > Executing individual map and reduce tasks
TaskTracker
as assigned by JobTracker (in separate JVM)
43. Conceptual Overview
JobTracker
JobTracker controls and heartbeats
TaskTracker nodes
TaskTracker TaskTracker TaskTracker TaskTracker
TaskTrackers store temp data on HDFS
Temporary data stored on HDFS
44. Job Submission
M/R submit jobs to JobTracker
M/R
M/R
Client
Client
Client JobTracker jobs get queued
map()’s are assigned to TaskTrackers
(HDFS DataNode locality aware)
TaskTracker TaskTracker TaskTracker TaskTracker
Mapper Mapper Mapper Mapper
mappers spawned in separate
JVM and execute mappers store results on HDFS
Temporary data stored on HDFS
45. Job Submission
M/R submit jobs to JobTracker
M/R
M/R
Client
Client
Client JobTracker jobs get queued
reduce phase begins
TaskTracker TaskTracker TaskTracker TaskTracker
Reducer Reducer Reducer Reducer
tmp data read from HDFS
Temporary data stored on HDFS
46. MapReduce Tips
0 Keys and values can be any type of object
> Can specify custom data splitters, partitoners, combiners,
InputFormat’s, and OutputFormat’s
0 Use ToolRunner.run(Tool) to run your Java jobs…
> Will use GenericOptionsParser and DistributedCache so that
-files, -libjars, & -archives options are available to distribute
your mappers, reducers, and any
> Without this, your mappers, reducers, and other utilites will
not be propagated and added to the classpath of the other
nodes (ClassNotFoundException)
48. Other M/R Utils
0 $HADOOP_HOME/contrib/*
> PriorityScheduler & FairScheduler
> HOD (Hadoop On Demand)
# Uses TORQUE resource manager to dynamically allocate, use,
and destroy MapReduce clusters on an as-needed basis
# Great for development and testing
> Hadoop Streaming (next slide...)
0 Amazon’s Elastic MapReduce (EMR)
> Essentially production HOD for EC2 data/clusters
49. Hadoop Streaming
0 Allows you to write MapReduce jobs in languages other than
Java by running any command line process
> Input data is partitioned and given to the standard input (STDIN) of
the command line mappers and reducers specified
> Output (STDOUT) from the command line mappers and reducers
get combined into the M/R pipeline
0 Can specify custom partitioners and combiners
0 Can specify files & archives to propagate to all nodes and
unpack on local file system (-archives & -file)
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming.jar
-input “/foo/bar/input.txt”
-mapper splitz.py
-reducer /bin/wc
-output “/foo/baz/out”
-archives „hdfs://hadoop1/foo/bar/cachedir.jar‟
-file ~/scripts/splitz.py
-D mapred.job.name=“Foo bar”
50. Pig
0 Framework and language (Pig Latin) for creating and
submitting Hadoop MapReduce jobs
0 Common data operations (not supported by POJO-M/R)
like join, group, filter, sort, select, etc. are provided
0 Don’t need to know Java
0 Removes boilerplate aspect from M/R
> 200 lines in Java 15 lines in Pig!
0 Relational qualities (reads and feels SQL-ish)
51. Pig
0 Fact from Wiki: 40% of Yahoo’s M/R jobs are in Pig
0 Interactive shell (grunt) exists
0 User Defined Functions (UDF)
> Allows you to specify Java code where the logic may be too
complex for Pig Latin
> UDF’s can be part of most every operation in Pig Latin
> Great for loading and storing custom formats as well as
transforming data
52. Pig Relational Operations
COGROUP JOIN SPLIT
CROSS LIMIT STORE
DISTINCT LOAD STREAM
FILTER MAPREDUCE UNION
FOREACH ORDER BY
GROUP SAMPLE
most of these are pretty self-explanatory
53. Example Pig Script
Taken from Pig tutorial on Pig wiki: The Temporal Query Phrase Popularity script processes a
search query log file from the Excite search engine and compares the occurrence of frequency of
search phrases across two time periods separated by twelve hours.
01: REGISTER ./tutorial.jar;
02: raw = LOAD 'excite.log' USING PigStorage('t') AS (user, time, query);
03: clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);
04: clean2 = FOREACH clean1 GENERATE user, time,
org.apache.pig.tutorial.ToLower(query) as query;
05: houred = FOREACH clean2 GENERATE user,
org.apache.pig.tutorial.ExtractHour(time) as hour, query;
06: ngramed1 = FOREACH houred GENERATE user, hour,
flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;
07: ngramed2 = DISTINCT ngramed1;
08: hour_frequency1 = GROUP ngramed2 BY (ngram, hour);
09: hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0),
COUNT($1) AS count;
10: hour_frequency3 = FOREACH hour_frequency2 GENERATE $0 as ngram,
$1 as hour, $2 as count;
11: hour00 = FILTER hour_frequency2 BY hour eq '00';
12: hour12 = FILTER hour_frequency3 BY hour eq '12';
13: same = JOIN hour00 BY $0, hour12 BY $0;
14: same1 = FOREACH same GENERATE hour_frequency2::hour00::group::ngram as
ngram, $2 as count00, $5 as count12;
15: STORE same1 INTO '/tmp/tutorial-join-results' USING PigStorage();
54. Example Pig Script
Taken from Pig tutorial on Pig wiki: The Temporal Query Phrase Popularity script processes a
search query log file from the Excite search engine and compares the occurrence of frequency of
search phrases across two time periods separated by twelve hours.
01: REGISTER ./tutorial.jar;
02: raw = LOAD 'excite.log' USING PigStorage('t') AS (user, time, query);
03: clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);
04: clean2 = FOREACH clean1 GENERATE user, time,
org.apache.pig.tutorial.ToLower(query) as query;
05: houred = FOREACH clean2 GENERATE user,
org.apache.pig.tutorial.ExtractHour(time) as hour, query; UDF’’s
06: ngramed1 = FOREACH houred GENERATE user, hour,
flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;
07: ngramed2 = DISTINCT ngramed1;
08: hour_frequency1 = GROUP ngramed2 BY (ngram, hour);
09: hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0),
COUNT($1) AS count;
10: hour_frequency3 = FOREACH hour_frequency2 GENERATE $0 as ngram,
$1 as hour, $2 as count;
Now... image this equivalent in Java...
11: hour00 = FILTER hour_frequency2 BY hour eq '00';
12: hour12 = FILTER hour_frequency3 BY hour eq '12';
13: same = JOIN hour00 BY $0, hour12 BY $0;
14: same1 = FOREACH same GENERATE hour_frequency2::hour00::group::ngram as
ngram, $2 as count00, $5 as count12;
15: STORE same1 INTO '/tmp/tutorial-join-results' USING PigStorage();
55. <- ?
ZooKeeper
0 Centralized coordination service for use by distributed
applications
> Configuration, naming, synchronization (locks), ownership (master
election), etc.
ZooKeeper Service
Leader!
Server Server Server Server Server
Client Client Client Client Client Client Client Client
0 Important system guarantees:
> Sequential consistency (great for locking)
> Atomicity – all or nothing at all
> Data consistency – all clients view same system state regardless of
the server it connects to
56. <- ?
ZooKeeper
0 Hierarchical namespace of “znodes” (like directories)
0 Operations:
> create a node at a location in the tree
> delete a node
> exists - tests if a node exists at a location
> get data from a node
> set data on a node
> get children from a node
> sync - waits for data to be propagated
leaf znodes
57. HBase
0 Sparse, non-relational, column-oriented distributed database
built on top of Hadoop Core (HDFS + MapReduce)
0 Modeled after Google’s BigTable (http://bit.ly/fQ1NMA)
0 NoSQL Not Only SQL...
...not “SQL is terrible”
0 HBase also has:
> Strong consistency model
> In-memory operation
> LZO compression (optional)
> Live migrations
> MapReduce support for querying
58. What HBase Is…
0 Good at fast/streaming writes
0 Fault tolerant
0 Good at linear horizontal scalability
0 Very efficient at managing billions of rows and millions of
columns
0 Good at keeping row history
0 Good at auto-balancing
0 A complement to a SQL DB/warehouse
0 Great with non-normalized data
59. What HBase Is NOT…
0 Made for table joins
0 Made for splitting into normalized tables (see previous)
0 A complete replacement for a SQL relational database
0 A complete replacement for a SQL data warehouse
0 Great for storing small amounts of data
0 Great for storing gobs of large binary data
0 The best way to do OLTP
0 The best way to do live adhoc querying of any column
0 A replacement for a proper caching mechanism
0 ACID compliant (http://bit.ly/hhFXCS)
60. HBase Facts
0 Written in Java
0 Uses ZooKeeper to store metadata and -ROOT- region
0 Column-oriented store = flexible schema
> Can alter the schema simply by adding the column name and
data on insert (“put”)
> No schema migrations!
0 Every column has a timestamp associated with it
> Same column with most recent timestamp wins
0 Can export metrics for use with Ganglia, or as JMX
0 hbase hbck
> Check for errors and fix them (like HDFS fsck)
61. HBase Client API’s
0 jRuby interactive shell (hbase shell)
> DDL/DML commands
> Admin commands
> Cluster commands
0 Java API (http://bit.ly/ij0MgF)
0 REST API
> Provided using Stargate
0 API for other languages (http://bit.ly/fLgCJC)
62. Column-Oriented?
0 Traditional RDBMS are stored using row-oriented storage
which stores entire rows sequentially on disk
Row 1 – Cols 1-3 Row 2 – Cols 1-3
Row 3 – Cols 1-3
0 Whereas column-oriented storage only stores columns for
each row (or column-families) sequentially on disk
Row 1 – Col 1 Row 2 – Col 1 Row 1 – Col 2 Row 2 – Col 2
Row 3 – Col 1 Row 3 – Col 2
Row 1 – Col 3 Row 3 – Col 3
Where’s Row 2 - Col 2? Not needed because columns are stored
sequentially, so rows have flexible schema!
63. Think of HBase Tables As…
0 More like JSON
> And less like spreadsheets row id
{
"1" : {
"A" : { v: "x", ts: 4282 },
"B" : { v: "z", ts: 4282 }
},
columns
"aaaaa" : {
"A" : { v: "y", ts: 4282 }
column families allow grouping of
}, columns (faster retrieval)
"xyz" : {
“address” : {
“line1" : { v: "hello", ts: 4282 },
flexible “line2" : { v: "there", ts: 4282 },
recent TS = default col value
schema “line2" : { v: "there", ts: 1234 }
}, old TS
“fooo" : { v: "wow!", ts: 4282 }
},
"zzzzz" : {
value & timestamp (TS)
"A" : { v: "woot", ts: 4282 },
"B" : { v: "1337", ts: 4282 }
}
}
Modified from http://bit.ly/hbGWIG
64. HBase Overview
Data is
sent using The Master server keeps track of the
the client metadata for RegionServer’s and their
containing Regions and stores it in
Zookeeper
The HBase client communicates with the Zookeeper cluster
only to get Region information; moreover, no data is sent
through the Master
The actual row “data” (bytes) is sent directly
to and from the RegionServers
Pretty diagrams from Lars George
Therefore, the Master server nor the Zookeeper
http://goo.gl/wRLJP & http://goo.gl/6ehnV cluster don’t serve as data bottlenecks
65. HBase Overview
Pretty diagrams from Lars George
http://goo.gl/wRLJP
All HBase data (HLog and HFiles) are stored on HDFS
HDFS breaks files into 64MB chucks and replicates the chunks N times (3 by
default) to store on “actual” disk (giving HBase it’s fault tolerance)
66. Understanding HBase
Tables are split into groups of ~100 Regions are assigned to particular
rows (configurable) called Regions RegionServer’s by the Master
server. The Master only contains
region-location metadata and
Table HRegions contains no “real” row data.
Pretty diagrams from Lars George
http://goo.gl/wRLJP & http://goo.gl/6ehnV
67. Writing to HBase
1) HBase client gets the assigned region servers (and Pretty diagrams from Lars George
regions) from Master server for the particular keys http://goo.gl/wRLJP & http://goo.gl/6ehnV
(rows) in question and sends commands/data
HDFS
4) In memory store is
periodically flushed to
HDFS (disk) when size
reaches threshold
2) Transaction is written to write- 3) Same data is written to in memory
ahead-log on HDFS (disk) first HDFS store for the assigned region (row group)
68. HBase Scalability
Additional RegionServers can
be added to the live system.
The master server will then
rebalance the cluster to
migrate Regions onto the new
RegionServers
Moreover, additional HDFS data
nodes can be added to disk give
more space to the HDFS cluster
Pretty diagrams from Lars George
http://goo.gl/wRLJP & http://goo.gl/6ehnV
70. Hive
0 Data warehouse infrastructure on top of Hadoop Core
> Stores data on HDFS
> Allows you to add custom MapReduce plugins
0 HiveQL
> SQL-like language pretty close to ANSI SQL
# Supports joins
> JDBC driver exists
0 Has interactive shell (like MySQL & PostgreSQL) to run
interactive queries
71. Hive
0 When running a HiveQL query/script, in > SHOW TABLES;
the background Hive creates and runs a
series of MapReduce jobs to > CREATE TABLE rating (
> BigData means it can take a long time to run userid INT,
queries movieid INT,
rating INT,
unixtime STRING)
0 Therefore, it’s good for offline BigETL, but ROW FORMAT DELIMITED
not good replacement for OLTP/OLAP data FIELDS TERMINATED BY 't'
warehouse (like Oracle) STORED AS TEXTFILE;
> DESCRIBE rating;
0 Learn more from wiki: http://bit.ly/epauio
72. Other useful utilities around
Hadoop
0 Sqoop (http://bit.ly/eRfVEJ)
> Load SQL data from a table into HDFS or Hive
> Generates Java classes to interact with the loaded data
0 Oozie (http://bit.ly/eNLi3B)
> Orchestrates complex workflows around multiple MapReduce jobs
0 Mahout (http://bit.ly/hCXRjL)
> Algorithm library for collaborative filtering, clustering, classifiers,
and machine learning
0 Cascading (http://bit.ly/gyZNiI)
> Data query abstraction layer similar to Pig
> Java API that sits on top of MapReduce framework
> Since it’s a Java API you can use it with any program that uses a JVM
language: Groovy, Scala, Clojure, jRuby, jython, etc.
73. What about support?
0 Community, wikis, forumns, IRC
0 Cloudera provides enterprise support
> Offerings:
# Cloudera Enterprise
# Support, professional services, training, management apps
> Cloudera Distribution of Hadoop (CDH)
# Tested and hardened version of Hadoop products plus some
other goodies (oozie, flume, hue, sqoop, whirr)
~ Separate codebase, but patches are made to and form the Apache versions
# Packages: debian, redhat, EC2, VM
if you want to try Hadoop, CDH is probably the way to go.
I recommended this instead of downloading each project individually.
75. Where the heck can I use this stuff?
0 The hardest part, is finding the right use-cases to apply Hadoop
(and any NoSQL system)
> SQL databases are great for data that fits on one machine
> Lots of tooling support for SQL; not as much for Hadoop (yet)
0 A few questions to think about:
> How much data are you processing?
> Are you throwing away valuable data due to space?
> Are you processing data where steps aren’t interdependent?
0 Log storage, log processing, utility data, research data,
biological data, medical records, events, mail, tweets, market
data, financial data