1. 1
CHAPTER-1
HADOOP ECHOSYSTEM 2.X.X
With a rapid pace in evolution of Big Data, its processing frameworks also seem to be evolving
in a full swing mode. Hadoop (Hadoop 1.0) has progressed from a more restricted processing
model of batch oriented MapReduce jobs to developing specialized and interactive processing
models (Hadoop 2.0). With the advent of Hadoop 2.0, it is possible for organizations to create
data crunching methodologies within Hadoop which were not possible with Hadoop 1.0
architectural limitations. In this piece of writing we provide the users an insight on the
novel Hadoop 2.0 (YARN) and help them understand the need to switch from Hadoop 1.0 to
Hadoop 2.0.
Evolution of Hadoop 2.0 (YARN) -Swiss Army Knife of Big Data
With the introduction of Hadoop in 2005 to support cluster distributed processing of large scale
data workloads through the MapReduce processing engine, Hadoop has undergone a great
refurbishment over time. The result of this is a better and advanced Hadoop framework that does
not merely support MapReduce but renders support to various other distributed processing
models also.
The huge data giants on the web such as Google, Yahoo and Facebook who had adopted Apache
Hadoop had to depend on the partnership of Hadoop HDFS with the resource management
environment and MapReduce programming. These technologies collectively enabled the users to
manage processes and store huge amounts of semi-structured, structured or unstructured data
within Hadoop clusters. Nevertheless there were certain intrinsic drawbacks with Hadoop
MapReduce pairing. For instance, Google and other users of Apache Hadoop had various
alluding issues with Hadoop 1.0 of not having the ability to keep track with the flood of
information that they were collecting online due to the batch processing arrangement of
MapReduce.
2. 2
Figure: - 1
What Is Hadoop?
To get started, let’s look at a simple definition of the tool that the utilities we’ll discuss support.
Apache Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using simple programming model. It is an open-source
data management system with scale-out storage and distributed processing. It’s designed with
big data in mind and is ideal for large amounts of information
The Hadoop Ecosystem
The Hadoop Ecosystem consists of tools for data analysis, moving large amounts of unstructured
and structured data, data processing, querying data, storing data, and other similar data-oriented
processes. These utilities each serve a unique purpose and are geared toward different tasks
completed through or user roles interacting with Hadoop.
3. 3
Data Storage
HDFS (Hadoop Distributed File System) is the key component that makes up Hadoop. HDFS is
used to store and access huge file based on client/server architecture. This system also enables
the distribution and storage of data across Hadoop clusters.
HBase (Hadoop Database) is a columnar database built on top of the HDFS. Being a file system,
HDFS lacks the random read and write capability. It is when HBase steps in and provides fast
record lookups in large tables.
Data Processing
MapReduce is a parallel data processing framework over clusters. Using MapReduce can help
data seeker save a lot of time, for example, if it takes a normal relational database around 20
hours to process a large data set, it might take MapReduce only around three minutes to get
everything done.
YARN (Yet Another Resource Negotiator) is a resource manager. It is said to be the second
generation of MapReduce and also a critical advancement from Hadoop 1. YARN acts the role
of an operating system, its jobs is to manage and monitor workloads, make sure it can serve
multiple clients and perform security controls. In addition, YARN supports new processing
models that MapReduce does not.
Data Access
Hive is new kind of structured query language. It was born to help who are familiar with the
traditional database and SQL to leverage Hadoop and MapReduce.
Pig serves the analysis purpose for large data sets. Pig is made up of two components, firstly the
platform to execute Pig programs; secondly, a powerful and simple scripting language called
PigLatin, which is used to write those programs.
Mahout provides a library of the most popular machine learning algorithms written in Java that
supports collaborative filtering, clustering, and classification.
4. 4
Arvo is a data serialization system. It uses JSON for defining data types and protocols to support
data-driven applications. Arvo provides a simple integration with many different languages with
the expectation to support Hadoop application to be written in other languages (e.g. Python,
C++) rather than Java.
Sqoop (SQL + Hadoop = Sqoop) is a command line interface application, which helps transfer
data between Hadoop and relational databases (e.g. MySQL or Oracle) or mainframes.
Data Management
Oozie is a workflow scheduler for Hadoop. Oozie streamlines the process of creating workflows
and managing coordination jobs among Hadoop and other applications such as Map Reduce, Pig,
Sqoop, Hive etc. The main responsibilities of Oozie are: firstly to define a sequence of actions to
be executed; secondly, to place triggers for those actions.
Chukwa is another framework that is built on top of HDFS and Map Reduce. Its purpose is to
provide a dynamic and powerful data collection system. Chukwa is capable of monitoring,
analyzing and presenting the results to get the most out of collected data.
Flume is also a scalable and reliable system for collecting and moving cluster logs from various
sources to a centralized store like Chukwa. However, there are some differences. In Flume,
chunks of data are transferred from node to node in store and forward manner; while in Chukwa,
the agent of each machine will need to determine what data to be sent.
ZooKeeper is a distributed coordination service for distributed system. It provides a very simple
programming interface and helps reduce the management complexity by providing services such
as configuration, distributed synchronization, naming, group services etc.
5. 5
CHAPTER-2
HDFS (HADOOP DISTRIBUTION FILE SYSTEM)
Hadoop File System was developed using distributed file system design. It is run on commodity
hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using
low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the
files are stored across multiple machines. These files are stored in redundant fashion to rescue
the system from possible data losses in case of failure. HDFS also makes applications available
to parallel processing.
Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of
cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS Architecture
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system and
the namenode software. It is a software that can be run on commodity hardware. The system
having the namenode acts as the master server and it does the following tasks:
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and opening files and
directories.
6. 6
Figure: - 2
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and datanode
software. For every node (Commodity hardware/System) in a cluster, there will be a datanode.
These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according
to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided
into one or more segments and/or stored in individual data nodes. These file segments are called
as blocks. In other words, the minimum amount of data that HDFS can read or write is called a
Block. The default block size is 64MB, but it can be increased as per the need to change in
HDFS configuration.
7. 7
Goals of HDFS
Fault detection and recovery : Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms
for quick and automatic fault detection and recovery.
Huge datasets : HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hardware at data : A requested task can be done efficiently, when the computation
takes place near the data. Especially where huge datasets are involved, it reduces the
network traffic and increases the throughput.
8. 8
CHAPTER-3
MAPREDUCE
MapReduce is a framework using which we can write applications to process huge amounts of
data, in parallel, on large clusters of commodity hardware in a reliable manner.
What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the sequence of the
name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers is
sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of machines in a cluster
is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.
The Algorithm
Generally MapReduce paradigm is based on sending the computer to where the data
resides!
MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
o Map stage : The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line. The
mapper processes the data and creates several small chunks of data.
o Reduce stage : This stage is the combination of the Shufflestage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
9. 9
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
Figure: - 3
10. 10
Inputs and Outputs (Java Perspective)
The MapReduce framework operates on <key, value> pairs, that is, the framework views the
input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the
output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework and hence, need
to implement the Writable interface. Additionally, the key classes have to implement the
Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of
a MapReduce job: (Input) <k1, v1> -> map -> <k2, v2>-> reduce -> <k3, v3>(Output).
Input Output
Map <k1, v1> list (<k2, v2>)
Reduce <k2, list(v2)> list (<k3, v3>)
Figure: - 4
Terminology
PayLoad - Applications implement the Map and the Reduce functions, and form the
core of the job.
Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
DataNode - Node where data is presented in advance before any processing takes place.
MasterNode - Node where JobTracker runs and which accepts job requests from clients.
SlaveNode - Node where Map and Reduce program runs.
JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
11. 11
Task Tracker - Tracks the task and reports status to JobTracker.
Job - A program is an execution of a Mapper and Reducer across a dataset.
Task - An execution of a Mapper or a Reducer on a slice of data.
Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.
12. 12
CHAPTER-4
PROJECT
Aim:- Temperature Data Analyses.
National Climatic Data Center (NCDC) is responsible for preserving, monitoring, assessing, and
providing public access to weather data. A log file is created to store all this information.this file
includes the various type data related to climate like temperature, wind flow its direction,
information related to the cyclones, whether change, the tempratature of each day is also noted.
Through this project we analyze the temperature variation of the whole month and years. With
the help of map reducing technique we can calculate the highest and lowest temperature or
hottest or coolest day of the month or year.
After going through wordcount mapreduce guide, you now have the basic idea of how a
mapreduce program works. So, let us see a complex mapreduce program on weather dataset.
Here I am using one of the dataset of year 2015 of Austin, Texas . We will do analytics on the
dataset and classify whether it was a hot day or a cold day depending on the temperature
recorded by NCDC.
NCDC gives us all the weather data we need for this mapreduce project.
The dataset which we will be using looks like below snapshot.
Figure: - 5
13. 13
Step 1:- Import the project in ECLIPS IDE.
Step 2:- When the project is not having any error, we will export it as a jar file, same as we did
in wordcount mapreduce guide. Right Click on the Project file and click on Export. Select jar file
Figure: - 6
17. 17
Step 3:- Before running the mapreduce program to check what it does, see that your cluster is
up and all the hadoop daemons are running.
Figure: - 10
18. 18
Step 4:- Select the input file on hdfs
Command :- hdfs –put download/inputfile.txt
Figure: - 11
19. 19
Step 5:- Run jar file
Command :- hadoop jar temp.jar /wathear-data.txt /output
Figure: - 12