hadoop

S
Presented by:-
Swati Chaturvedi
B.Tech, 3rd
Year
Department of Information Technology
IntroductionIntroduction
Big Data:
•Big data is a term used to describe the voluminous amount
of unstructured and semi-structured data a company
creates.
•Data that would take too much time and cost too much
money to load into a relational database for analysis.
• Big data doesn't refer to any specific quantity, the term
is often used when speaking about petabytes and exabytes
of data.
• The New York Stock Exchange generates about
one terabyte of new trade data per day.
• Facebook hosts approximately 10 billion photos,
taking up one petabyte of storage. 
• Ancestry.com, the genealogy site, stores around 2.5
petabytes of data.
• The Internet Archive stores around 2 petabytes of
data, and is growing at a rate of 20 terabytes per
month. 
So What Is The Problem?So What Is The Problem?
 The transfer speed is around 100 MB/s
 A standard disk is 1 Terabyte
 Time to read entire disk= 10000 seconds or
3 Hours!
 Increase in processing time may not be as
helpful because
• Network bandwidth is now more of a limiting factor
• Physical limits of processor chips have been reached
So What do We Do?So What do We Do?
•The obvious solution is that we
use multiple processors to solve
the same problem by
fragmenting it into pieces.
•Imagine if we had 100 drives,
each holding one hundredth of
the data. Working in parallel,
we could read the data in under
two minutes.
Distributed ComputingDistributed Computing
The key issues involved in this Solution:
Hardware failure
Combine the data after analysis
Network Associated Problems
To The Rescue!
Apache Hadoop is a framework for running applications on large
cluster built of commodity hardware.
A common way of avoiding data loss is through replication: redundant
copies of the data are kept by the system so that in the event of
failure, there is another copy available. The Hadoop Distributed
Filesystem (HDFS), takes care of this problem.
The second problem is solved by a simple programming model-
Mapreduce. Hadoop is the popular open source implementation of
MapReduce, a powerful tool designed for deep analysis and
transformation of very large data sets. 
What Else is Hadoop?What Else is Hadoop?
A reliable shared storage and analysis system.
There are other subprojects of Hadoop that provide
complementary services, or build on the core to add
higher-level abstractions.
The project includes these modules:
 Hadoop Common: The common utilities that support
the other Hadoop modules.
 Hadoop Distributed File System (HDFS): A distributed
file system that provides high-throughput access to
application data.
 Hadoop YARN: A framework for job scheduling and
cluster resource management.
 Hadoop MapReduce: A YARN-based system for
parallel processing of large data sets.
Other Hadoop-related projects at Apache include:
Ambari
Avro
Cassandra
Chukwa
 Hbase
ZooKeeper
Hive
Mahout
Spark
Tez
Hadoop Approach to DistributedHadoop Approach to Distributed
ComputingComputing
 The theoretical 1000-CPU machine would
cost a very large amount of money, far more
than 1,000 single-CPU.
 Hadoop will tie these smaller and more
reasonably priced machines together into a
single cost-effective compute cluster.
 Hadoop provides a simplified programming
model which allows the user to quickly write
and test distributed systems, and its’
efficient, automatic distribution of data and
work across machines and in turn utilizing the
underlying parallelism of the CPU cores.
 MapReduce is a programming model
 Programs written in this functional style are automatically
parallelized and executed on a large cluster of commodity
machines
 MapReduce is an associated implementation for processing and
generating large data sets.
The Programming Model Of MapReduceThe Programming Model Of MapReduce
Map, written by the user, takes an input pair and
produces a set of intermediate key/value pairs. The
MapReduce library groups together all intermediate
values associated with the same intermediate key I
and passes them to the Reduce function.
 The Reduce function, also written by the user, accepts
an intermediate key I and a set of values for that key.
It merges together these values to form a possibly
smaller set of values
How MapReduce WorksHow MapReduce Works
 A Map-Reduce job usually splits the input data-
set into independent chunks which are processed
by the map tasks in a completely parallel manner.
 The framework sorts the outputs of the maps,
which are then input to the reduce tasks.
 Typically both the input and the output of the
job are stored in a file-system. The framework
takes care of scheduling tasks, monitoring them
and re-executes the failed tasks.
 A MapReduce job is a unit of work that the client
wants to be performed: it consists of the input
data, the MapReduce program, and configuration
information. Hadoop runs the job by dividing it
into tasks, of which there are two types: map
tasks and reduce tasks
hadoop
Fault ToleranceFault Tolerance
 There are two types of nodes that control the job
execution process: tasktrackers and jobtrackers
 The jobtracker coordinates all the jobs run on the
system by scheduling tasks to run on tasktrackers.
 Tasktrackers run tasks and send progress reports to the
jobtracker, which keeps a record of the overall progress
of each job.
 If a tasks fails, the jobtracker can reschedule it on a
different tasktracker.
Input toInput to
Reduce TasksReduce Tasks
MapReduce data flow with a single reduce task
MapReduce data flow with multiple reduce tasks
MapReduce data flow with no reduce tasks
HADOOP DISTRIBUTEDHADOOP DISTRIBUTED
FILESYSTEMFILESYSTEM
 Filesystems that manage the storage across a
network of machines are called distributed
filesystems.
 Hadoop comes with a distributed filesystem called
HDFS, which stands for Hadoop Distributed
Filesystem.
 HDFS, the Hadoop Distributed File System, is a
distributed file system designed to hold very large
amounts of data (terabytes or even petabytes),
and provide high-throughput access to this
information.
Design of HDFSDesign of HDFS
 Very large files
Files that are hundreds of megabytes, gigabytes, or terabytes in
size. There are Hadoop clusters running today that store petabytes
of data.
 Streaming data access
HDFS is built around the idea that the most efficient data
processing pattern is a write-once, read-many-times pattern.
A dataset is typically generated or copied from source, then
various analyses are performed on that dataset over time. Each
analysis will involve a large proportion of the dataset, so the time
to read the whole dataset is more important than the latency in
reading the first record.
 Low-latency data access
Applications that require low-latency access to data, in the tens of
milliseconds range, will not work well with HDFS. Remember HDFS
is optimized for delivering a high throughput of data, and this may
be at the expense of latency. HBase (Chapter 12) is currently a
better choice for low-latency access.
 Multiple writers, arbitrary file modifications
Files in HDFS may be written to by a single writer. Writes are
always made at the end of the file. There is no support for multiple
writers, or for modifications at arbitrary offsets in the file.
(These might be supported in the future, but they are likely to be
relatively inefficient.)
Namenodes and DatanodesNamenodes and Datanodes
 A HDFS cluster has two types of node operating in a
master-worker pattern: a namenode (the master) and
a number of datanodes (workers).
 The namenode manages the filesystem namespace. It
maintains the filesystem tree and the metadata for
all the files and directories in the tree.
 Datanodes are the work horses of the filesystem.
They store and retrieve blocks when they are told to
(by clients or the namenode), and they report back to
the namenode periodically with lists of blocks that
they are storing.
 Without the namenode, the filesystem
cannot be used. In fact, if the machine
running the namenode were obliterated, all
the files on the filesystem would be lost
since there would be no way of knowing how
to reconstruct the files from the blocks on
the datanodes.
 Important to make the namenode resilient to
failure, and Hadoop provides two mechanisms for
this:
1. is to back up the files that make up the
persistent state of the filesystem metadata.
Hadoop can be configured so that the namenode
writes its persistent state to multiple filesystems.
2. Another solution is to run a secondary namenode.
The secondary namenode usually runs on a
separate physical machine, since it requires plenty
of CPU and as much memory as the namenode to
perform the merge. It keeps a copy of the
merged namespace image, which can be used in
the event of the namenode failing
conclusionsconclusions
 Hadoop is data storage and analysis platform for
large volumes of data.
 It is scalable,fault-tolerant,distributed data
storage and processing system.
 Hadoop will sit along side,not replace your
existing RDBMS.
 Hadoop has many tools to ease data analysis.
hadoop
1 de 29

Recomendados

HadoopHadoop
HadoopHimanshu Soni
197 visualizações23 slides
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
787 visualizações24 slides
CpptCppt
Cpptchunkypandey12
147 visualizações31 slides
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
12.3K visualizações28 slides
Seminar pptSeminar ppt
Seminar pptRajatTripathi34
49 visualizações20 slides
HadoopHadoop
HadoopRittikaBaksi
27 visualizações22 slides

Mais conteúdo relacionado

Mais procurados

Hadoop basicsHadoop basics
Hadoop basicsLaxmi Rauth
65 visualizações21 slides
Hadoop and big dataHadoop and big data
Hadoop and big dataSharad Pandey
559 visualizações20 slides
Hadoop paperHadoop paper
Hadoop paperATWIINE Simon Alex
37 visualizações5 slides
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
118 visualizações38 slides

Mais procurados(16)

Hadoop basicsHadoop basics
Hadoop basics
Laxmi Rauth65 visualizações
Hadoop and big dataHadoop and big data
Hadoop and big data
Sharad Pandey559 visualizações
Hadoop paperHadoop paper
Hadoop paper
ATWIINE Simon Alex37 visualizações
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
IRJET Journal48 visualizações
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
Tarak Tar118 visualizações
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Giovanna Roda97 visualizações
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman5.3K visualizações
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdf
vishal choudhary108 visualizações
Big data and hadoopBig data and hadoop
Big data and hadoop
Chanchal Tripathi44 visualizações
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
Arjen de Vries539 visualizações
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
Vaibhav Dhattarwal910 visualizações
final reportfinal report
final report
Prathamesh Mantri97 visualizações

Similar a hadoop

Hadoop TechnologyHadoop Technology
Hadoop TechnologyAtul Kushwaha
2.5K visualizações22 slides
Cppt HadoopCppt Hadoop
Cppt Hadoopchunkypandey12
27 visualizações31 slides
CpptCppt
Cpptchunkypandey12
126 visualizações31 slides
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
253 visualizações24 slides
HadoopHadoop
HadoopAnkit Prasad
168 visualizações25 slides

Similar a hadoop(20)

Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha2.5K visualizações
Cppt HadoopCppt Hadoop
Cppt Hadoop
chunkypandey1227 visualizações
CpptCppt
Cppt
chunkypandey12126 visualizações
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Mr. Ankit253 visualizações
HadoopHadoop
Hadoop
Ankit Prasad168 visualizações
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho2.7K visualizações
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
AlAmin6381896 visualizações
Big data and hadoopBig data and hadoop
Big data and hadoop
Roushan Sinha30 visualizações
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
Sunil D Patil1 visão
Big dataBig data
Big data
Abilash Mavila174 visualizações
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar1.2K visualizações
project report on hadoopproject report on hadoop
project report on hadoop
Manoj Jangalva469 visualizações
Big dataBig data
Big data
revathireddyb107 visualizações
Big dataBig data
Big data
revathireddyb147 visualizações
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
AmirReza Mohammadi54 visualizações
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
DIVYA3708511 visão
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
Rupak Roy56 visualizações
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
Jay Nagar493 visualizações
AnjuAnju
Anju
Anju Shekhawat417 visualizações
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
Ankan Banerjee735 visualizações

hadoop

  • 1. Presented by:- Swati Chaturvedi B.Tech, 3rd Year Department of Information Technology
  • 2. IntroductionIntroduction Big Data: •Big data is a term used to describe the voluminous amount of unstructured and semi-structured data a company creates. •Data that would take too much time and cost too much money to load into a relational database for analysis. • Big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data.
  • 3. • The New York Stock Exchange generates about one terabyte of new trade data per day. • Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.  • Ancestry.com, the genealogy site, stores around 2.5 petabytes of data. • The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month. 
  • 4. So What Is The Problem?So What Is The Problem?  The transfer speed is around 100 MB/s  A standard disk is 1 Terabyte  Time to read entire disk= 10000 seconds or 3 Hours!  Increase in processing time may not be as helpful because • Network bandwidth is now more of a limiting factor • Physical limits of processor chips have been reached
  • 5. So What do We Do?So What do We Do? •The obvious solution is that we use multiple processors to solve the same problem by fragmenting it into pieces. •Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes.
  • 6. Distributed ComputingDistributed Computing The key issues involved in this Solution: Hardware failure Combine the data after analysis Network Associated Problems
  • 8. Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. The Hadoop Distributed Filesystem (HDFS), takes care of this problem. The second problem is solved by a simple programming model- Mapreduce. Hadoop is the popular open source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets. 
  • 9. What Else is Hadoop?What Else is Hadoop? A reliable shared storage and analysis system. There are other subprojects of Hadoop that provide complementary services, or build on the core to add higher-level abstractions. The project includes these modules:  Hadoop Common: The common utilities that support the other Hadoop modules.  Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.  Hadoop YARN: A framework for job scheduling and cluster resource management.  Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  • 10. Other Hadoop-related projects at Apache include: Ambari Avro Cassandra Chukwa  Hbase ZooKeeper Hive Mahout Spark Tez
  • 11. Hadoop Approach to DistributedHadoop Approach to Distributed ComputingComputing  The theoretical 1000-CPU machine would cost a very large amount of money, far more than 1,000 single-CPU.  Hadoop will tie these smaller and more reasonably priced machines together into a single cost-effective compute cluster.  Hadoop provides a simplified programming model which allows the user to quickly write and test distributed systems, and its’ efficient, automatic distribution of data and work across machines and in turn utilizing the underlying parallelism of the CPU cores.
  • 12.  MapReduce is a programming model  Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines  MapReduce is an associated implementation for processing and generating large data sets.
  • 13. The Programming Model Of MapReduceThe Programming Model Of MapReduce Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.
  • 14.  The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values
  • 15. How MapReduce WorksHow MapReduce Works  A Map-Reduce job usually splits the input data- set into independent chunks which are processed by the map tasks in a completely parallel manner.  The framework sorts the outputs of the maps, which are then input to the reduce tasks.  Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.  A MapReduce job is a unit of work that the client wants to be performed: it consists of the input data, the MapReduce program, and configuration information. Hadoop runs the job by dividing it into tasks, of which there are two types: map tasks and reduce tasks
  • 17. Fault ToleranceFault Tolerance  There are two types of nodes that control the job execution process: tasktrackers and jobtrackers  The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers.  Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job.  If a tasks fails, the jobtracker can reschedule it on a different tasktracker.
  • 18. Input toInput to Reduce TasksReduce Tasks
  • 19. MapReduce data flow with a single reduce task
  • 20. MapReduce data flow with multiple reduce tasks
  • 21. MapReduce data flow with no reduce tasks
  • 22. HADOOP DISTRIBUTEDHADOOP DISTRIBUTED FILESYSTEMFILESYSTEM  Filesystems that manage the storage across a network of machines are called distributed filesystems.  Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem.  HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information.
  • 23. Design of HDFSDesign of HDFS  Very large files Files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.  Streaming data access HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, then various analyses are performed on that dataset over time. Each analysis will involve a large proportion of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.
  • 24.  Low-latency data access Applications that require low-latency access to data, in the tens of milliseconds range, will not work well with HDFS. Remember HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency. HBase (Chapter 12) is currently a better choice for low-latency access.  Multiple writers, arbitrary file modifications Files in HDFS may be written to by a single writer. Writes are always made at the end of the file. There is no support for multiple writers, or for modifications at arbitrary offsets in the file. (These might be supported in the future, but they are likely to be relatively inefficient.)
  • 25. Namenodes and DatanodesNamenodes and Datanodes  A HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers).  The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree.  Datanodes are the work horses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.
  • 26.  Without the namenode, the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes.
  • 27.  Important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this: 1. is to back up the files that make up the persistent state of the filesystem metadata. Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems. 2. Another solution is to run a secondary namenode. The secondary namenode usually runs on a separate physical machine, since it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps a copy of the merged namespace image, which can be used in the event of the namenode failing
  • 28. conclusionsconclusions  Hadoop is data storage and analysis platform for large volumes of data.  It is scalable,fault-tolerant,distributed data storage and processing system.  Hadoop will sit along side,not replace your existing RDBMS.  Hadoop has many tools to ease data analysis.