SlideShare uma empresa Scribd logo
1 de 29
Presented by:-
Swati Chaturvedi
B.Tech, 3rd
Year
Department of Information Technology
IntroductionIntroduction
Big Data:
•Big data is a term used to describe the voluminous amount
of unstructured and semi-structured data a company
creates.
•Data that would take too much time and cost too much
money to load into a relational database for analysis.
• Big data doesn't refer to any specific quantity, the term
is often used when speaking about petabytes and exabytes
of data.
• The New York Stock Exchange generates about
one terabyte of new trade data per day.
• Facebook hosts approximately 10 billion photos,
taking up one petabyte of storage. 
• Ancestry.com, the genealogy site, stores around 2.5
petabytes of data.
• The Internet Archive stores around 2 petabytes of
data, and is growing at a rate of 20 terabytes per
month. 
So What Is The Problem?So What Is The Problem?
 The transfer speed is around 100 MB/s
 A standard disk is 1 Terabyte
 Time to read entire disk= 10000 seconds or
3 Hours!
 Increase in processing time may not be as
helpful because
• Network bandwidth is now more of a limiting factor
• Physical limits of processor chips have been reached
So What do We Do?So What do We Do?
•The obvious solution is that we
use multiple processors to solve
the same problem by
fragmenting it into pieces.
•Imagine if we had 100 drives,
each holding one hundredth of
the data. Working in parallel,
we could read the data in under
two minutes.
Distributed ComputingDistributed Computing
The key issues involved in this Solution:
Hardware failure
Combine the data after analysis
Network Associated Problems
To The Rescue!
Apache Hadoop is a framework for running applications on large
cluster built of commodity hardware.
A common way of avoiding data loss is through replication: redundant
copies of the data are kept by the system so that in the event of
failure, there is another copy available. The Hadoop Distributed
Filesystem (HDFS), takes care of this problem.
The second problem is solved by a simple programming model-
Mapreduce. Hadoop is the popular open source implementation of
MapReduce, a powerful tool designed for deep analysis and
transformation of very large data sets. 
What Else is Hadoop?What Else is Hadoop?
A reliable shared storage and analysis system.
There are other subprojects of Hadoop that provide
complementary services, or build on the core to add
higher-level abstractions.
The project includes these modules:
 Hadoop Common: The common utilities that support
the other Hadoop modules.
 Hadoop Distributed File System (HDFS): A distributed
file system that provides high-throughput access to
application data.
 Hadoop YARN: A framework for job scheduling and
cluster resource management.
 Hadoop MapReduce: A YARN-based system for
parallel processing of large data sets.
Other Hadoop-related projects at Apache include:
Ambari
Avro
Cassandra
Chukwa
 Hbase
ZooKeeper
Hive
Mahout
Spark
Tez
Hadoop Approach to DistributedHadoop Approach to Distributed
ComputingComputing
 The theoretical 1000-CPU machine would
cost a very large amount of money, far more
than 1,000 single-CPU.
 Hadoop will tie these smaller and more
reasonably priced machines together into a
single cost-effective compute cluster.
 Hadoop provides a simplified programming
model which allows the user to quickly write
and test distributed systems, and its’
efficient, automatic distribution of data and
work across machines and in turn utilizing the
underlying parallelism of the CPU cores.
 MapReduce is a programming model
 Programs written in this functional style are automatically
parallelized and executed on a large cluster of commodity
machines
 MapReduce is an associated implementation for processing and
generating large data sets.
The Programming Model Of MapReduceThe Programming Model Of MapReduce
Map, written by the user, takes an input pair and
produces a set of intermediate key/value pairs. The
MapReduce library groups together all intermediate
values associated with the same intermediate key I
and passes them to the Reduce function.
 The Reduce function, also written by the user, accepts
an intermediate key I and a set of values for that key.
It merges together these values to form a possibly
smaller set of values
How MapReduce WorksHow MapReduce Works
 A Map-Reduce job usually splits the input data-
set into independent chunks which are processed
by the map tasks in a completely parallel manner.
 The framework sorts the outputs of the maps,
which are then input to the reduce tasks.
 Typically both the input and the output of the
job are stored in a file-system. The framework
takes care of scheduling tasks, monitoring them
and re-executes the failed tasks.
 A MapReduce job is a unit of work that the client
wants to be performed: it consists of the input
data, the MapReduce program, and configuration
information. Hadoop runs the job by dividing it
into tasks, of which there are two types: map
tasks and reduce tasks
Fault ToleranceFault Tolerance
 There are two types of nodes that control the job
execution process: tasktrackers and jobtrackers
 The jobtracker coordinates all the jobs run on the
system by scheduling tasks to run on tasktrackers.
 Tasktrackers run tasks and send progress reports to the
jobtracker, which keeps a record of the overall progress
of each job.
 If a tasks fails, the jobtracker can reschedule it on a
different tasktracker.
Input toInput to
Reduce TasksReduce Tasks
MapReduce data flow with a single reduce task
MapReduce data flow with multiple reduce tasks
MapReduce data flow with no reduce tasks
HADOOP DISTRIBUTEDHADOOP DISTRIBUTED
FILESYSTEMFILESYSTEM
 Filesystems that manage the storage across a
network of machines are called distributed
filesystems.
 Hadoop comes with a distributed filesystem called
HDFS, which stands for Hadoop Distributed
Filesystem.
 HDFS, the Hadoop Distributed File System, is a
distributed file system designed to hold very large
amounts of data (terabytes or even petabytes),
and provide high-throughput access to this
information.
Design of HDFSDesign of HDFS
 Very large files
Files that are hundreds of megabytes, gigabytes, or terabytes in
size. There are Hadoop clusters running today that store petabytes
of data.
 Streaming data access
HDFS is built around the idea that the most efficient data
processing pattern is a write-once, read-many-times pattern.
A dataset is typically generated or copied from source, then
various analyses are performed on that dataset over time. Each
analysis will involve a large proportion of the dataset, so the time
to read the whole dataset is more important than the latency in
reading the first record.
 Low-latency data access
Applications that require low-latency access to data, in the tens of
milliseconds range, will not work well with HDFS. Remember HDFS
is optimized for delivering a high throughput of data, and this may
be at the expense of latency. HBase (Chapter 12) is currently a
better choice for low-latency access.
 Multiple writers, arbitrary file modifications
Files in HDFS may be written to by a single writer. Writes are
always made at the end of the file. There is no support for multiple
writers, or for modifications at arbitrary offsets in the file.
(These might be supported in the future, but they are likely to be
relatively inefficient.)
Namenodes and DatanodesNamenodes and Datanodes
 A HDFS cluster has two types of node operating in a
master-worker pattern: a namenode (the master) and
a number of datanodes (workers).
 The namenode manages the filesystem namespace. It
maintains the filesystem tree and the metadata for
all the files and directories in the tree.
 Datanodes are the work horses of the filesystem.
They store and retrieve blocks when they are told to
(by clients or the namenode), and they report back to
the namenode periodically with lists of blocks that
they are storing.
 Without the namenode, the filesystem
cannot be used. In fact, if the machine
running the namenode were obliterated, all
the files on the filesystem would be lost
since there would be no way of knowing how
to reconstruct the files from the blocks on
the datanodes.
 Important to make the namenode resilient to
failure, and Hadoop provides two mechanisms for
this:
1. is to back up the files that make up the
persistent state of the filesystem metadata.
Hadoop can be configured so that the namenode
writes its persistent state to multiple filesystems.
2. Another solution is to run a secondary namenode.
The secondary namenode usually runs on a
separate physical machine, since it requires plenty
of CPU and as much memory as the namenode to
perform the merge. It keeps a copy of the
merged namespace image, which can be used in
the event of the namenode failing
conclusionsconclusions
 Hadoop is data storage and analysis platform for
large volumes of data.
 It is scalable,fault-tolerant,distributed data
storage and processing system.
 Hadoop will sit along side,not replace your
existing RDBMS.
 Hadoop has many tools to ease data analysis.
hadoop

Mais conteúdo relacionado

Mais procurados

Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopIRJET Journal
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdfvishal choudhary
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce cscpconf
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with SparkArjen de Vries
 

Mais procurados (16)

Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Hadoop paper
Hadoop paperHadoop paper
Hadoop paper
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdf
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
final report
final reportfinal report
final report
 

Semelhante a hadoop

Semelhante a hadoop (20)

Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Big data
Big dataBig data
Big data
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Anju
AnjuAnju
Anju
 

hadoop

  • 1. Presented by:- Swati Chaturvedi B.Tech, 3rd Year Department of Information Technology
  • 2. IntroductionIntroduction Big Data: •Big data is a term used to describe the voluminous amount of unstructured and semi-structured data a company creates. •Data that would take too much time and cost too much money to load into a relational database for analysis. • Big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data.
  • 3. • The New York Stock Exchange generates about one terabyte of new trade data per day. • Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.  • Ancestry.com, the genealogy site, stores around 2.5 petabytes of data. • The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month. 
  • 4. So What Is The Problem?So What Is The Problem?  The transfer speed is around 100 MB/s  A standard disk is 1 Terabyte  Time to read entire disk= 10000 seconds or 3 Hours!  Increase in processing time may not be as helpful because • Network bandwidth is now more of a limiting factor • Physical limits of processor chips have been reached
  • 5. So What do We Do?So What do We Do? •The obvious solution is that we use multiple processors to solve the same problem by fragmenting it into pieces. •Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes.
  • 6. Distributed ComputingDistributed Computing The key issues involved in this Solution: Hardware failure Combine the data after analysis Network Associated Problems
  • 8. Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. The Hadoop Distributed Filesystem (HDFS), takes care of this problem. The second problem is solved by a simple programming model- Mapreduce. Hadoop is the popular open source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets. 
  • 9. What Else is Hadoop?What Else is Hadoop? A reliable shared storage and analysis system. There are other subprojects of Hadoop that provide complementary services, or build on the core to add higher-level abstractions. The project includes these modules:  Hadoop Common: The common utilities that support the other Hadoop modules.  Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.  Hadoop YARN: A framework for job scheduling and cluster resource management.  Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  • 10. Other Hadoop-related projects at Apache include: Ambari Avro Cassandra Chukwa  Hbase ZooKeeper Hive Mahout Spark Tez
  • 11. Hadoop Approach to DistributedHadoop Approach to Distributed ComputingComputing  The theoretical 1000-CPU machine would cost a very large amount of money, far more than 1,000 single-CPU.  Hadoop will tie these smaller and more reasonably priced machines together into a single cost-effective compute cluster.  Hadoop provides a simplified programming model which allows the user to quickly write and test distributed systems, and its’ efficient, automatic distribution of data and work across machines and in turn utilizing the underlying parallelism of the CPU cores.
  • 12.  MapReduce is a programming model  Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines  MapReduce is an associated implementation for processing and generating large data sets.
  • 13. The Programming Model Of MapReduceThe Programming Model Of MapReduce Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.
  • 14.  The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values
  • 15. How MapReduce WorksHow MapReduce Works  A Map-Reduce job usually splits the input data- set into independent chunks which are processed by the map tasks in a completely parallel manner.  The framework sorts the outputs of the maps, which are then input to the reduce tasks.  Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.  A MapReduce job is a unit of work that the client wants to be performed: it consists of the input data, the MapReduce program, and configuration information. Hadoop runs the job by dividing it into tasks, of which there are two types: map tasks and reduce tasks
  • 16.
  • 17. Fault ToleranceFault Tolerance  There are two types of nodes that control the job execution process: tasktrackers and jobtrackers  The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers.  Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job.  If a tasks fails, the jobtracker can reschedule it on a different tasktracker.
  • 18. Input toInput to Reduce TasksReduce Tasks
  • 19. MapReduce data flow with a single reduce task
  • 20. MapReduce data flow with multiple reduce tasks
  • 21. MapReduce data flow with no reduce tasks
  • 22. HADOOP DISTRIBUTEDHADOOP DISTRIBUTED FILESYSTEMFILESYSTEM  Filesystems that manage the storage across a network of machines are called distributed filesystems.  Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem.  HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information.
  • 23. Design of HDFSDesign of HDFS  Very large files Files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.  Streaming data access HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, then various analyses are performed on that dataset over time. Each analysis will involve a large proportion of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.
  • 24.  Low-latency data access Applications that require low-latency access to data, in the tens of milliseconds range, will not work well with HDFS. Remember HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency. HBase (Chapter 12) is currently a better choice for low-latency access.  Multiple writers, arbitrary file modifications Files in HDFS may be written to by a single writer. Writes are always made at the end of the file. There is no support for multiple writers, or for modifications at arbitrary offsets in the file. (These might be supported in the future, but they are likely to be relatively inefficient.)
  • 25. Namenodes and DatanodesNamenodes and Datanodes  A HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers).  The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree.  Datanodes are the work horses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.
  • 26.  Without the namenode, the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes.
  • 27.  Important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this: 1. is to back up the files that make up the persistent state of the filesystem metadata. Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems. 2. Another solution is to run a secondary namenode. The secondary namenode usually runs on a separate physical machine, since it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps a copy of the merged namespace image, which can be used in the event of the namenode failing
  • 28. conclusionsconclusions  Hadoop is data storage and analysis platform for large volumes of data.  It is scalable,fault-tolerant,distributed data storage and processing system.  Hadoop will sit along side,not replace your existing RDBMS.  Hadoop has many tools to ease data analysis.