SlideShare uma empresa Scribd logo
1 de 25
BIRLA VISHVAKARMA
MAHAVIDHYALAYA
TOPIC :SUBMITTED TO : MISSES BIJAL DALWADI
SUBMITTED BY : NIKITA SURE(140080116025)
JIMMY CHOPDA(140080116013)
VRUTI TANKARIA(140080116057)
MESHWA PATEL(140080116035)
DATABASE MANAGEMENT SYSTEM
PART 1
What is hadoop?
History of hadoop.
Why hadoop?
Where hadoop.
What is Hadoop?
Apache hadoop is an open-source
software framework written in java for
distributed storage and distributed
processing of very large data set on
computer clusters built from commodity
hardware.
Hadoop was created by doug
cutting and mike cafarella in 2005.
Cutting who was working at yahoo !
At that time, named it after his
son’s toy elephant.
It was originally developed to
support distribution for the nutch
search engine project.
It’s latest release version is 2.7.1
on july 6,2015.
Doug Cutting
Hadoop - Why ?
The complexity of modern analytics needs is
outstripping the available computing power
of legacy systems. With its distributed
processing, Hadoop can handle large
volumes of structured and unstructured data
more efficiently than the traditional
enterprise data warehouse. Because Hadoop
is open source and can run on commodity
hardware, the initial cost savings are
dramatic and continue to grow as your
organizational data grows.
 Smart meters are deployed in homes worldwide to help consumers and utility
companies manage the use of water, electricity, and gas better. Historically,
meter readers would walk from house to house recording meter read outs and
reporting them to the utility company for billing purposes. Because of the labor
costs, many utilities switched from monthly readings to quarterly. This delayed
revenue and made it impossible to analyze residential usage in any detail.
Consider a fictional company called CostCutter Utilities that serves 10 million
households. Once a quarter, they gathered 10 million readings to produce
utility bills. With government regulation and the price of oil skyrocketing,
CostCutter started deploying smart meters so they could get hourly readings of
electricity usage. They now collect 21.6 billion sensor readings per quarter
from the smart meters. Analysis of the meter data over months and years can
be correlated with energy saving campaigns, weather patterns, and local
events, providing savings insights both for consumers and CostCutter Utilities.
When consumers are offered a billing plan that has cheaper electricity from 8
p.m. to 5 a.m., they demand five minute intervals in their smart meter reports
so they can identify high-use activity in their homes. At five minute intervals,
the smart meters are collecting more than 100 billion meter readings every 90
days, and CostCutter Utilities now has a big data problem. Their data volume
exceeds their ability to process it with existing software and hardware. So
CostCutter Utilities turns to Hadoop to handle the incoming meter readings.
The base Apache Hadoop framework is composed
of the following modules:
1. Hadoop Common – contains libraries and
utilities needed by other Hadoop modules
2. Hadoop Distributed File System (HDFS) – a
distributed file-system that stores data on
commodity machines
3. Hadoop YARN – a resource-management
platform responsible for managing computing
resources in clusters and using them for
scheduling of users' applications
4. Hadoop MapReduce – a programming model for
large scale data processing
PART 2
 Map reduce.
 HDFS.
 Introduction to hadoop architecture.
What is map reduce?
 Mapreduce is a processing technique and
a program model for distributed
computing based on java.
 The mapreduce algorithm contain two
important tasks, namely map and reduce.
 Map takes a set of data and converts it
into another set of data, where individual
elements are broken down into tuples.
What is HDFS?
 HDFS is a Java-based file system that provides
scalable and reliable data storage, and it was
designed to span large clusters of commodity
servers. HDFS has demonstrated production
scalability of up to 200 PB of storage and a single
cluster of 4500 servers, supporting close to a billion
files and blocks.
 Hadoop can work directly with any mountable
distributed file system, but the most common file
system used by Hadoop is the HDFS. It is a fault-
tolerant distributed file system that is designed for
commonly available hardware. It is well-suited for
large data sets due to its high throughput access to
application data.
HADOOP ARCHITECTURE
 "Hadoop employs a master/slave
architecture for both distributed storage
and distributed computation". In the
distributed storage, the NameNode is the
master and the DataNodes are the slaves.
In the distributed computation, the
Jobtracker is the master and the
Tasktrackers are the slaves
MASTERs
1 . NAMENODE
 The NameNode is the HEART of an HDFS file
system. It keeps the directory tree of all files in
the file system, and tracks where across the
cluster the file data is kept. It does not store the
data of these files itself.
 When the NameNode goes down, the file system
goes offline.
2 . JOBTRACKER
 The JobTracker is the service within Hadoop that
farms out MapReduce tasks to specific nodes in
the cluster, ideally the nodes that have the data,
or at least are in the same rack.
 The JobTracker is a point of failure for the
hadoop MapReduce service. If it goes down, all
running jobs are halted.
SLAVES
1 . DATANODE
 A DataNode stores data in the
HadoopFileSystem. A functional filesystem
has more than one DataNode, with data
replicated across them.
2 . TASKTRACKER
A TaskTracker is a node in the cluster that
accepts tasks - Map, Reduce and Shuffle
operations - from a JobTracker.
PART 3
 How hadoop architecture works.
 Reading files from HDFS.
 Writing files from HDFS.
PART 4
 Advantages of hadoop.
 Disadvantages of hadoop
 Where it is used?
 Subprojects of hadoop
Hadoop RelatedSubprojects
 Pig
◦ High-level language for data analysis
 HBase
◦ Table storage for semi-structured data
 Zookeeper
◦ Coordinating distributed applications
 Hive
◦ SQL-like Query language and Metastore
 Mahout
◦ Machine learning
Etc….
Hadoop advantages
1. Scalable
Hadoop is a highly scalable storage platform, because it can
store and distribute very large data sets across hundreds of
inexpensive servers that operate in parallel.
2. Cost effective
Hadoop is designed as a scale-out architecture that can
affordably store all of a company’s data for later use. The cost
savings are staggering: instead of costing thousands to tens of
thousands of pounds per terabyte, Hadoop offers computing
and storage capabilities for hundreds of pounds per terabyte.
3. Flexible
 Hadoop can be used for a wide variety of purposes, such as
log processing, recommendation systems, data
warehousing, market campaign analysis and fraud
detection.
 4. Fast
 Hadoop’s unique storage method is based on a distributed
file system that basically ‘maps’ data wherever it is located
on a cluster. The tools for data processing are often on the
same servers where the data is located, resulting in much
faster data processing. If you’re dealing with large volumes
of unstructured data, Hadoop is able to efficiently process
terabytes of data in just minutes, and petabytes in hours.
 5. Resilient to failure
 A key advantage of using Hadoop is its fault tolerance.
When data is sent to an individual node, that data is also
replicated to other nodes in the cluster, which means that
in the event of failure, there is another copy available for
use.
Problems
 Coding is tedious
 Want to change that data? SQL UPDATE
and the change is in. Hadoop does not do
this.
 Hadoop stores data in files, and does not
index them. If you want to find something,
you have to run a MapReduce job going
through all the data.
 Where Hadoop works is where the data is
too big for a database!
What we want ?
 Guaranteed data processing
 Fault-tolerance
 No intermediate message brokers!
 Higher level abstraction than message
passing
 “Just works” !!
Who uses hadoop?
Hadoop is in use at most organizations that
handle big data:
 Amazon/A9
 Facebook
 Google
 IBM
 New York Times
 PowerSet
 Yahoo!
ANY QUESTIONS????

Mais conteúdo relacionado

Mais procurados

Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 

Mais procurados (20)

Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop
HadoopHadoop
Hadoop
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop
HadoopHadoop
Hadoop
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Big data
Big dataBig data
Big data
 

Semelhante a Hadoop info

Semelhante a Hadoop info (20)

Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop Technology
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
paper
paperpaper
paper
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Anju
AnjuAnju
Anju
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Hadoop
HadoopHadoop
Hadoop
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoop
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Hadoop Tutorial for Beginners
Hadoop Tutorial for BeginnersHadoop Tutorial for Beginners
Hadoop Tutorial for Beginners
 

Último

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Último (20)

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 

Hadoop info

  • 1. BIRLA VISHVAKARMA MAHAVIDHYALAYA TOPIC :SUBMITTED TO : MISSES BIJAL DALWADI SUBMITTED BY : NIKITA SURE(140080116025) JIMMY CHOPDA(140080116013) VRUTI TANKARIA(140080116057) MESHWA PATEL(140080116035) DATABASE MANAGEMENT SYSTEM
  • 2. PART 1 What is hadoop? History of hadoop. Why hadoop? Where hadoop.
  • 3. What is Hadoop? Apache hadoop is an open-source software framework written in java for distributed storage and distributed processing of very large data set on computer clusters built from commodity hardware.
  • 4. Hadoop was created by doug cutting and mike cafarella in 2005. Cutting who was working at yahoo ! At that time, named it after his son’s toy elephant. It was originally developed to support distribution for the nutch search engine project. It’s latest release version is 2.7.1 on july 6,2015. Doug Cutting
  • 5. Hadoop - Why ? The complexity of modern analytics needs is outstripping the available computing power of legacy systems. With its distributed processing, Hadoop can handle large volumes of structured and unstructured data more efficiently than the traditional enterprise data warehouse. Because Hadoop is open source and can run on commodity hardware, the initial cost savings are dramatic and continue to grow as your organizational data grows.
  • 6.  Smart meters are deployed in homes worldwide to help consumers and utility companies manage the use of water, electricity, and gas better. Historically, meter readers would walk from house to house recording meter read outs and reporting them to the utility company for billing purposes. Because of the labor costs, many utilities switched from monthly readings to quarterly. This delayed revenue and made it impossible to analyze residential usage in any detail. Consider a fictional company called CostCutter Utilities that serves 10 million households. Once a quarter, they gathered 10 million readings to produce utility bills. With government regulation and the price of oil skyrocketing, CostCutter started deploying smart meters so they could get hourly readings of electricity usage. They now collect 21.6 billion sensor readings per quarter from the smart meters. Analysis of the meter data over months and years can be correlated with energy saving campaigns, weather patterns, and local events, providing savings insights both for consumers and CostCutter Utilities. When consumers are offered a billing plan that has cheaper electricity from 8 p.m. to 5 a.m., they demand five minute intervals in their smart meter reports so they can identify high-use activity in their homes. At five minute intervals, the smart meters are collecting more than 100 billion meter readings every 90 days, and CostCutter Utilities now has a big data problem. Their data volume exceeds their ability to process it with existing software and hardware. So CostCutter Utilities turns to Hadoop to handle the incoming meter readings.
  • 7. The base Apache Hadoop framework is composed of the following modules: 1. Hadoop Common – contains libraries and utilities needed by other Hadoop modules 2. Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines 3. Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications 4. Hadoop MapReduce – a programming model for large scale data processing
  • 8. PART 2  Map reduce.  HDFS.  Introduction to hadoop architecture.
  • 9. What is map reduce?  Mapreduce is a processing technique and a program model for distributed computing based on java.  The mapreduce algorithm contain two important tasks, namely map and reduce.  Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples.
  • 10. What is HDFS?  HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.  Hadoop can work directly with any mountable distributed file system, but the most common file system used by Hadoop is the HDFS. It is a fault- tolerant distributed file system that is designed for commonly available hardware. It is well-suited for large data sets due to its high throughput access to application data.
  • 11. HADOOP ARCHITECTURE  "Hadoop employs a master/slave architecture for both distributed storage and distributed computation". In the distributed storage, the NameNode is the master and the DataNodes are the slaves. In the distributed computation, the Jobtracker is the master and the Tasktrackers are the slaves
  • 12. MASTERs 1 . NAMENODE  The NameNode is the HEART of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.  When the NameNode goes down, the file system goes offline. 2 . JOBTRACKER  The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.  The JobTracker is a point of failure for the hadoop MapReduce service. If it goes down, all running jobs are halted.
  • 13. SLAVES 1 . DATANODE  A DataNode stores data in the HadoopFileSystem. A functional filesystem has more than one DataNode, with data replicated across them. 2 . TASKTRACKER A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.
  • 14. PART 3  How hadoop architecture works.  Reading files from HDFS.  Writing files from HDFS.
  • 15.
  • 16.
  • 17.
  • 18. PART 4  Advantages of hadoop.  Disadvantages of hadoop  Where it is used?  Subprojects of hadoop
  • 19. Hadoop RelatedSubprojects  Pig ◦ High-level language for data analysis  HBase ◦ Table storage for semi-structured data  Zookeeper ◦ Coordinating distributed applications  Hive ◦ SQL-like Query language and Metastore  Mahout ◦ Machine learning Etc….
  • 20. Hadoop advantages 1. Scalable Hadoop is a highly scalable storage platform, because it can store and distribute very large data sets across hundreds of inexpensive servers that operate in parallel. 2. Cost effective Hadoop is designed as a scale-out architecture that can affordably store all of a company’s data for later use. The cost savings are staggering: instead of costing thousands to tens of thousands of pounds per terabyte, Hadoop offers computing and storage capabilities for hundreds of pounds per terabyte.
  • 21. 3. Flexible  Hadoop can be used for a wide variety of purposes, such as log processing, recommendation systems, data warehousing, market campaign analysis and fraud detection.  4. Fast  Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’ data wherever it is located on a cluster. The tools for data processing are often on the same servers where the data is located, resulting in much faster data processing. If you’re dealing with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes, and petabytes in hours.  5. Resilient to failure  A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node, that data is also replicated to other nodes in the cluster, which means that in the event of failure, there is another copy available for use.
  • 22. Problems  Coding is tedious  Want to change that data? SQL UPDATE and the change is in. Hadoop does not do this.  Hadoop stores data in files, and does not index them. If you want to find something, you have to run a MapReduce job going through all the data.  Where Hadoop works is where the data is too big for a database!
  • 23. What we want ?  Guaranteed data processing  Fault-tolerance  No intermediate message brokers!  Higher level abstraction than message passing  “Just works” !!
  • 24. Who uses hadoop? Hadoop is in use at most organizations that handle big data:  Amazon/A9  Facebook  Google  IBM  New York Times  PowerSet  Yahoo!