1. GLOBAL INSTITUTE OF TECHNOLOGY
BIG DATA and HADOOP
Miss. Nishi Sharma
• Apache Hadoop is an open-source software framework
for distributed storage and distributed processing of Big
Data on clusters of commodity Hardware.
• Process the data using simple programming model.
• Hadoop Distributed File System (HDFS) splits files into
large blocks (default 64MB or 128MB) and distributes
the blocks amongst the nodes in the cluster.
3. Origin of Apache Hadoop
•The origin of Apache Hadoop Projects is from
Google White paper series on Big Table,MapReduce
•Later on Yahoo & many other contributors
implements Google’s White paper.
•Doug Cutting, Hadoop’s creator, named the
framework after his child's stuffed toy elephant.
4. Keyword Behind Hadoop Is Big Data
1. Bigdata is the term for the collection of datasets so large and
complex thats difficult to process using traditional data
2. Lots of data in Terabytes or Petabytes.
7. Big Data Challenges Hadoop Resolve
Big data brings with it two fundamental challenges: how
to store and work with voluminous data sizes, and more
important, how to understand data and turn it into a
Hadoop fills a gap in the market by effectively storing
and providing computational capabilities over substantial
amounts of data.
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using simple
It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to deliver
Compaines using Hadoop:
and lots more...
10. Hadoop Distributed File System(HDFS)
Hadoop Distributed File System (HDFS) is a distributed filesystem
designed to hold very large volume of data.It is a block-structured file
•Individual files are broken into blocks of fixed size.
•These blocks are stored across a cluster of one or more machines with
data storage capacity.
•Individual machines in the cluster are referred to as DataNodes.
11. Components of HDFS
• Name Node
1. Master of the system.
2. Maintains and manage the blocks which are present on the
• Data Node
1. Slaves which are deployed on each machines and provide
2. Responsible for serving read and write requests for the
• Backup Node
This is responsible for performing periodic checkpoints.
13. Map Reduce
• MapReduce is a programming model.
• Programs written in this functional style are
automatically parallelized and executed on a large
cluster of commodity machines.
• MapReduce is an associated implementation for
processing and generating large data sets.
• The role of the programmer is to define map and
reduce functions, where the map function outputs
key/value tuples, which are processed by reduce
functions to produce the final output.
14. Map Reduce Procedure
map function that
processes a key/value
pair to generate a set of
REDUCE and a reduce function
that merges all
associated with the same
15. Components of Map Reduce
It is the service in Hadoop which send map reduce
tasks to specific nodes in the cluster.
TaskTracker are the slaves which are deployed on each
machine. They are responsible for running the map
and reduce tasks as instructed by JobTracker.
17. Map Reduce Working
• A Map-Reduce job usually splits the input data-set into independent
chunks which are processed by the map tasks in a completely parallel
• The framework sorts the outputs of the maps, which are then input to
the reduce tasks.
• Typically both the input and the output of the job are stored in a file-
system. The framework takes care of scheduling tasks, monitoring
them and re-executes the failed tasks.
• A MapReduce job is a unit of work that the client wants to be
performed: it consists of the input data, the MapReduce program, and
configuration information. Hadoop runs the job by dividing it into
tasks, of which there are two types: map tasks and reduce tasks.
21. Future scope
• Apache Hadoop's MapReduce and HDFS components originally
derived respectively from Google's MapReduce and Google File
System (GFS) papers. By the above description we can understand
the need of Big Data in future, So Hadoop can be the best of
maintenance and efficient implementation of large data.
• This technology has bright future scope because day by day need of
data would increase and security issues also major point. In now a
days many Multinational organizations are prefer Hadoop over
• So major companies like Facebook, amazon, yahoo & LinkedIn etc.
are adapting Hadoop and in future there can be many names in the
• Hence Hadoop Technology is the best appropriate approach for
handling the data in smart way and its future is bright.