3. Commodore
Amiga 500 (1990)
Memory: 512K
Atari ST
Amiga 500 (1985)
Memory: 512K
Macintosh (1984)
Memory: 128K
Wait a sec… Are we in the 80’s?!
4. ●
●
30 Billion pieces of content were
added to Facebook this past month
by more than 600 million users
2.7 billion likes made daily on and
off of the Facebook site
More than 2.5 Billion videos were
watched on YouTube… Yesterday!
●
●
1.2 million deliveries per second
35 billion searches were performed
last month on Twitter
What are the volumes of data that we are seeing today?
5. What does the future look like?
● Worldwide IP traffic will quadruple by 2015.
● Nearly 3 billion people will be online pushing the data
created and shared to nearly 8 zettabytes.
○
○
Zettabyte = 1024^1 Exabyte = 1024^2 Petabyte = 1024^3 Terabyte =
1024^4 Gigabyte = 1024^5 Megabyte = 1024^6 KiloByte
8 ZB = 9,223,372,036,854,775,808 KB
● 2/3rd of surveyed businesses in North America said
big data will become a concern for them within the
next five years.
6. Huston,
We have a Problem!!!
A new IDC study says the market for big technology and services will grow from $3.2 billion in
2010 to $16.9 billion in 2015! That’s a growth of 40%
7. What is Big Data?
“When your data sets become so large that
you have to start innovating to collect,
store, organize, analyze and share”
8. From WWW to VVV
●
●
●
Volume
○ data volumes are becoming unmanageable
Variety
○ data complexity is growing
○ more types of data are captured than previously
Velocity
○ some data is arriving so rapidly that it must either be processed
instantly, or lost
○ this is a whole subfield called “stream processing”
9.
10. Sources of Data
Computer Generated
●
●
●
Application server logs
(websites, games)
Sensor data (weather, water,
smart grids)
Images/Videos (traffic
surveillance, security cameras)
Human Generated
●
●
●
●
Twitter/Facebook
Blogs/Reviews/Emails
Images/Videos
Social Graphs: Facebook,
Linkedin
11. Types of Data
●
●
●
●
●
●
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
12. What to do with these data?
● Aggregation and Statistics
○ Data warehouse and OLAP
● Indexing, Searching, and Querying
○ Keyword based search
○ Pattern matching (XML/RDF)
● Knowledge discovery
○ Machine Learning
○ Data Mining
○ Statistical Modeling
15. Hadoop - inspired by Google
● Apache Hadoop project
○ inspired by Google MapReduce implementation
and Google File System papers
● Open sourced, flexible and available architecture for
large scale computation and data processing on a
network of commodity hardware
● Open Source Software + Commodity Hardware
○
IT Cost Reduction
16. Hadoop Concepts
● Distribute the data as it is initially stored in the system
● Bring the processing to the data
● Users can focus in developing applications
17. Hadoop Versions
● Hadoop version 1 (HDFS + MapReduce)
○ hadoop-1.2.X
● Hadoop Version 2 (HDFS + MR2 + YARN)
○ hadoop-2.2.X
○ hadoop-0.23.X
■
same as 2.2.X but missing NN HA
18. Enterprise Hadoop
● Cloudera
○ Oldest company provided Hadoop enterprise
○ CDH
○ Cloudera Manager
● Hortonworks
○ Forked from Yahoo! Hadoop team
○ Biggest contributor to Hadoop
○ HDP (Hortonworks Data Platform)
● MapR
19. Hadoop Components
●
●
Two Core components
○ Hadoop Distributed Filesystem
○ MapReduce Software Framework
Components around Hadoop
○ Often referred to as ‘Hadoop
Ecosystem’
○ Pig, Hive, HBase, Flume, Oozie,
Sqoop
20. Hadoop Components: HDFS
●
●
HDFS, the Hadoop Distributed File
System, is responsible for storing
data on the cluster
Two Roles:
○ Namenode (NN): Records
metadata
○ Datanode (DN): Stores Data
22. HDFS Structure
HDFS has a master/slave architecture for the filesystem
structure, it has two main layers:
● Namespace, which consists of directories, files and
blocks. It supports the file system operations.
● Block storage service, which offers Block
Management and Storage:
○ Block Management service provided by the NN,
supports block related operations, maintain block
locations and manages block replicas.
○ Storage service provided by the DN and allows
the read/write access to blocks on the local
storage of the node.
24. File System Read Operations
1.
2.
3.
4.
Client contacts the NameNode indicating
the file it wants to read
Client identity is validated and checked
against the owner and permissions of the
file
The NameNode responds with the list of
DataNodes that host replicas of the
blocks of the file
The client contact the DataNodes based
on the topology that was provided from
the NameNode and requests the transfer
of the desired block
25. File System Write Operations
1.
2.
3.
4.
5.
6.
Client asks the NameNode to choose DataNodes
to host replicas of the first block of the file
The NameNode grant permissions to the client
and responds with a list of DataNodes for the
block
The client organizes a pipeline from node-tonode and sends the data
When the first block is filled, the client requests
new DataNodes to be chosen to host replicas of
the next block
The NameNode responds with a new list of
DataNodes which is likely to be different
The client organizes a new pipeline and sends
the further blocks of the file
26. Hadoop Components: MapReduce
●
●
●
●
Programming model for processing
and generating large data sets
Computation takes some input data,
then it gets mapped using some code
written by the user
Then the mapped data gets reduced
using another code written by the
user
It works like a pipeline:
$ cat file | grep something | sort | uniq -c
27. MapReduce Features
●
●
●
●
Automatic parallelization and
distribution
Automatic re-execution on failure
Locality Optimizations
Abstract the “housekeeping” away
from the developer
○ Developer concentrate on
writing MapReduce functions
Job Tracker
Task Tracker
MapReduce
History Server
28. MapReduce Features
●
●
●
TaskTracker is a node in the
cluster that accepts tasks - Map,
Shuffle and Reduce operations from a JobTracker
JobTracker is the service within
Hadoop that farms out MapReduce
tasks to specific nodes in the
cluster
History Server allows the user to
get status on finished applications.
Currently it only supports
MapReduce and provides
information on finished jobs
Job Tracker
Task Tracker
MapReduce
History Server
29.
30. Hadoop Components: YARN (MR2)
The new architecture introduced in
hadoop-0.2x and hadoop-2.x, divides the
two major functions of the JobTracker:
resource management and job life-cycle
management into separate components.
32. YARN Components
●
●
●
ResourceManager (RM) is the ultimate authority that arbitrates
resources among all the applications in the system.
NodeManager (NM) is the per-machine framework agent who is
responsible for containers, monitoring their resource usage (cpu,
memory, disk, network) and reporting the same to the
ResourceManager/Scheduler.
ApplicationsManager (ASM) is responsible for accepting jobsubmissions, negotiating the first container for executing the application
specific ApplicationMaster and provides the service for restarting the
ApplicationMaster container on failure.
33. Hadoop Ecosystem
● Hadoop has become the
kernel of the distributed
operating system for Big
Data
● No one uses the kernel
alone
● A collection of projects
at Apache
34. Hadoop Components: HBase
●
●
●
●
Low-latency, distributed, nonrelational database built on top of
HDFS
Inspired by Google’s Bigtable
Data is stored in semi-columnar
format partitioned by rows into
regions
Typically a single table accommodate
hundreds of terabytes
35. Hadoop Components: Sqoop
●
●
●
●
●
Exchanging data with relational
databases
Short for “SQL to Hadoop”
Performs bidirectional transfer
between Hadoop and almost any
database with a JDBC driver.
Includes native connections for
MySQL and PostgreSQL
Free connectors exist for Teradata,
Netezza, SQL Server and Oracle
36. Hadoop Components: Flume
●
●
●
Streaming data collection and
aggregation system designed to
transport massive volumes of data
into systems such as Hadoop
Simplifies reliable streaming data
delivery from a variety of sources
including RPC services, log4j
appenders and syslog
Data can be routed, load-balanced,
replicated to multiple destinations
and aggregated from thousands of
hosts
37. Hadoop Components: Pig
●
●
●
Created to simplify the authoring of
MapReduce jobs, so no need to write
Java code
Users write data processing jobs in a
high-level scripting language from
which Pig builds an execution plan
and executes a series of MapReduce
jobs
Developers can extend its set of builtin operations by writing user-defined
functions in Java
38. Hadoop Components: Hive
●
●
●
Creates a relational database-style
abstraction that allows the developer
to write a dialect of SQL
Hive’s dialect of SQL is called HiveQL
and implements only a subset of any
of the common standards
Hive works by defining a table-like
schema over an existing set of files in
HDFS
39. Hadoop Components: Oozie
●
●
●
Workflow engine and scheduler built
specifically for large-scale job
orchestration on Hadoop
Workflows can be triggered by time
or events such as data arriving in a
directory
Major flexibility (start, stop, suspend
and re-run jobs)
40. Hadoop Components: Hue
●
●
●
●
Hadoop User Experience
Apache Open Source project
HUE is a web UI for Hadoop
Platform for building custom
applications with a nice UI library
42. Hadoop Components: ZooKeeper
●
●
●
Centralized service for maintaining:
○ Configuration Information
○ Providing distributed
synchronization
Designed to store coordination data:
○ Status Information
○ Configuration
○ Location Information
Implement reliable messaging and
redundant services
46. OS Selection and Preparation
● Deployment layout
○ Hadoop home
○ DataNode data directories
○ NameNode directories
● Software
○ Java, cron, ntp, ssh, rsync, postfix/sendmail
● Hostnames, DNS and Identification
● Users, Groups, and Privileges
47. Network Design
2-tier tree Network
3-tier tree Network
Core
Core
TOR
TOR
TOR
Host
Host
Host
Host
Host
Host
Host
Host
Host
Aggregation
Aggregation
TOR
TOR
TOR
TOR
Host
Host
Host
Host
Host
Host
Host
Host
Host
Host
Host
Host
49. How Streaming Works
The mapper and the reducer read the input from stdin
(line by line) and emit the output to stdout.
●
●
●
●
●
●
Each mapper task will launch the executable as a separate process
Converts its inputs into lines and feed the lines to the stdin of the process
Mapper collects the line oriented outputs from the stdout of the process
and converts each line into a key/value pair
Each reducer task will launch the executable as a separate process
Converts its input key/values pairs into lines and feeds the lines to the
stdin of the process
the reducer collects the line oriented outputs from the stdout of the
process, converts each line into a key/value pair
51. What do we want to do?
We want to extract how many hits come on each page. So filtering the
above line should yield:
'/': 5
'/about': 2
'/feedback': 1
52. The Mapper
#!/usr/bin/perl
use strict;
use warnings;
while (<>) {
chomp;
my ($ip, $date, $action) = split('-', $_);
$action =~ s/^ "Get (.*)"$/$1/;
print "$actiont1n";
}
53. The Reducer
#!/usr/bin/perl
my %actions;
if (exists $actions{$action}) {
$actions{$action} = $actions
{$action} + $count;
} else {
$actions{$action} = $count;
}
}
while (<>) {
chomp;
my ($action, $count) = split
("t", $_);
foreach my $c (sort{$a cmp $b} keys
%actions) {
print "'$c': $actions{$c}n";
}
use strict;
use warnings;
use Data::Dumper;
54. The Output
Now redirecting ‘log’ to Mapper.pl and piping the output to
Reducer.pl yield:
$ perl Mapper.pl < log | perl Reducer.pl
'/': 5
'/about': 2
'/feedback': 1
55. Running over Hadoop
$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper /home/ahmed/Mapper.pl
-reducer /home/ahmed/Reducer.pl