1. GLOBAL INSTITUTE OF TECHNOLOGY
A
SEMINAR PROJECT
ON
BIG DATA and HADOOP
Submitted By:
Anju shekhawat
Submitted To:
Miss. Nishi Sharma
2. Introduction
• Apache Hadoop is an open-source software framework
for distributed storage and distributed processing of Big
Data on clusters of commodity Hardware.
• Process the data using simple programming model.
• Hadoop Distributed File System (HDFS) splits files into
large blocks (default 64MB or 128MB) and distributes
the blocks amongst the nodes in the cluster.
3. Origin of Apache Hadoop
•The origin of Apache Hadoop Projects is from
Google White paper series on Big Table,MapReduce
& GFS.
•Later on Yahoo & many other contributors
implements Google’s White paper.
•Doug Cutting, Hadoop’s creator, named the
framework after his child's stuffed toy elephant.
4. Keyword Behind Hadoop Is Big Data
1. Bigdata is the term for the collection of datasets so large and
complex thats difficult to process using traditional data
processing application.
2. Lots of data in Terabytes or Petabytes.
6. Types of Data
Un-Structured Data : PDF, Word, Text, Email Body
Data.
Semi-Structured Data : XML File Data.
Structured Data: RDBMS Data.
7. Big Data Challenges Hadoop Resolve
Big data brings with it two fundamental challenges: how
to store and work with voluminous data sizes, and more
important, how to understand data and turn it into a
competitive advantage.
Hadoop fills a gap in the market by effectively storing
and providing computational capabilities over substantial
amounts of data.
8. Hadoop??
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using simple
programming models.
It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to deliver
high-availability.
Compaines using Hadoop:
1. Google
2. Yahoo
3. Amazon
4. Facebook
5. Twitter
6. IBM
7. Rackspace
and lots more...
10. Hadoop Distributed File System(HDFS)
Hadoop Distributed File System (HDFS) is a distributed filesystem
designed to hold very large volume of data.It is a block-structured file
system where
•Individual files are broken into blocks of fixed size.
•These blocks are stored across a cluster of one or more machines with
data storage capacity.
•Individual machines in the cluster are referred to as DataNodes.
11. Components of HDFS
• Name Node
1. Master of the system.
2. Maintains and manage the blocks which are present on the
'Data Nodes'.
• Data Node
1. Slaves which are deployed on each machines and provide
actual storage.
2. Responsible for serving read and write requests for the
Clients.
• Backup Node
This is responsible for performing periodic checkpoints.
13. Map Reduce
• MapReduce is a programming model.
• Programs written in this functional style are
automatically parallelized and executed on a large
cluster of commodity machines.
• MapReduce is an associated implementation for
processing and generating large data sets.
• The role of the programmer is to define map and
reduce functions, where the map function outputs
key/value tuples, which are processed by reduce
functions to produce the final output.
14. Map Reduce Procedure
MapReduce
MAP
map function that
processes a key/value
pair to generate a set of
intermediate key/value
pairs
REDUCE and a reduce function
that merges all
intermediate values
associated with the same
intermediate key.
15. Components of Map Reduce
• JobTracker
It is the service in Hadoop which send map reduce
tasks to specific nodes in the cluster.
• TaskTracker
TaskTracker are the slaves which are deployed on each
machine. They are responsible for running the map
and reduce tasks as instructed by JobTracker.
17. Map Reduce Working
• A Map-Reduce job usually splits the input data-set into independent
chunks which are processed by the map tasks in a completely parallel
manner.
• The framework sorts the outputs of the maps, which are then input to
the reduce tasks.
• Typically both the input and the output of the job are stored in a file-
system. The framework takes care of scheduling tasks, monitoring
them and re-executes the failed tasks.
• A MapReduce job is a unit of work that the client wants to be
performed: it consists of the input data, the MapReduce program, and
configuration information. Hadoop runs the job by dividing it into
tasks, of which there are two types: map tasks and reduce tasks.
19. Hadoop hosted in the cloud
Amazon Elastic MapReduce
Hadoop on Microsoft Azure
Hadoop on Open Stack Instances
Hadoop on Google Cloud Platform
Hadoop on Cloud era
20. Features of Hadoop
• Scalable
• Cost effective
• Flexible
• Reliable & Fault Tolerant
21. Future scope
• Apache Hadoop's MapReduce and HDFS components originally
derived respectively from Google's MapReduce and Google File
System (GFS) papers. By the above description we can understand
the need of Big Data in future, So Hadoop can be the best of
maintenance and efficient implementation of large data.
• This technology has bright future scope because day by day need of
data would increase and security issues also major point. In now a
days many Multinational organizations are prefer Hadoop over
RDBMS.
• So major companies like Facebook, amazon, yahoo & LinkedIn etc.
are adapting Hadoop and in future there can be many names in the
list.
• Hence Hadoop Technology is the best appropriate approach for
handling the data in smart way and its future is bright.