1. BIG DATA & HADOOP
Presented by :
Roushan Kumar Sinha
B.Tech (I.T-2nd yr.)
• What is big data.
• Types of big data.
• So what is the problem.
• So what do we do.
• Three characteristics of big data.
• Google’s solution.
• What is Hadoop.
• HDFS architecture.
• Map reduce.
3. What is big data?
• Lots of data(terabytes,petabytes,or exabytes)
• Big data is the term for a collection of data set so large and complex that
it becomes difficult to process using on-hand database management
tools or our traditional data processing application.
• The challenge include capture, storage, search, sharing, transfer,
analysis and visualization.
• System/enterprises generates huge amount of data from terabytes to
petabytes of information.
Single Jet engine can generate
10+terabytes of data in 30 minutes of a
6. SoWhat IsThe Problem?
• The transfer is about 100 mb/sec.
• A standred disk is 1 terabyte.
• Time to read entire disk = 10000 sec. or 3 hrs !
• Increase in processing time may not be as helpful because
• Network bandwidth is now more of a limiting factor.
• Physical limits processor chips have been reached.
9. GOOGLE’S Solution
• Google solved this problem using an algorithm called MapReduce.This
algorithm divides the task into small parts and assigns those parts to many
computers connected over the network, and collects the results to form the
final result dataset.
• Above diagram shows various commodity hardwares which could be single
CPU machines or servers with higher capacity.
10. What is Hadoop?
• Apache Hadoop is a framework that allows for the distributed
processing of large data set across cluster of commodity computers
using a simple programming model.
• It is an open source data management with scale out storage &
12. Hadoop Distibuted File System (HDFS)
•Hadoop File System was developed using distributed
file system design. It is run on commodity hardware.
Unlike other distributed systems, HDFS is highly
faulttolerant and designed using low-cost hardware.
•HDFS holds very large amount of data and provides
easier access.To store such huge data, the files are
stored across multiple machines.These files are stored
in redundant fashion to rescue the system from possible
data losses in case of failure. HDFS also makes
applications available to parallel processing.
13. Features of HDFS
•It is suitable for the distributed storage and processing.
•Hadoop provides a command interface to interact with
•The built-in servers of namenode and datanode help
users to easily check the status of cluster.
•Streaming access to file system data.
•HDFS provides file permissions and authentication.
15. Goals of HDFS
•Fault detection and recovery : Since HDFS includes a
large number of commodity hardware, failure of
components is frequent.Therefore HDFS should have
mechanisms for quick and automatic fault detection
•Huge datasets : HDFS should have hundreds of nodes
per cluster to manage the applications having huge
•Hardware at data : A requested task can be done
efficiently, when the computation takes place near the
data. Especially where huge datasets are involved, it
reduces the network traffic and increases the
17. How Map Reduce works
• A map reduce job splits the data-set into independent chunks which are
processed by the map tasks in a completely parallel manner.
• The framework sorts the outputs of the maps which are then input to the
• Typically boyh he input and the output of the job are stored in a file
system.The framework takes care of scheduling tasks monitoring the
and re-executes the failed tasks.
• Hadoop runs the job by dividing it into tasks,of which they are two types:
map tasks and reduce tasks.