SlideShare uma empresa Scribd logo
1 de 38
HADOOP OVERVIEW &
ARCHITECTURE
BY
CHANDINI SANS
CONTENTS
1. Why hadoop?
2. Importance of hadoop
3. What’s in hadoop?
4. Apache hadoop echo system
5. Hadoop architecture
6. Hadoop map reduce
7. Hdfs
8. Advantages of hadoop
COST PER GIGA BYTE
STORAGE TRENDS
ISSUES WITH LARGE DATA
• Map Parallelism: Chunking input data
• Reduce Parallelism: Grouping related
data
• Dealing with failures & load imbalance
• Doug Cutting, Mike Cafarella developed an
Open Source Project called HADOOP in 2005
and Daug named it after his son's toy elephant.
• Hadoop has become one of the most talked about
technologies.
• Why? One of the top reasons is its ability to handle
huge amounts of data – any kind of data – quickly.
With volumes and varieties of data growing each
day, especially from social media and automated
sensors, that’s a key consideration for most
organizations. 
• Hadoop is an open-source software framework
for storing and processing big data in a
distributed fashion on large clusters of
commodity hardware.
• Essentially, it accomplishes two tasks:
-massive data storage
- faster processing.
• Hadoop is an Apache open source framework
written in java that allows distributed
processing of large datasets across clusters of
computers using simple programming models.
Hadoop is designed to scale up from single
server to thousands of machines, each offering
local computation and storage.
WHO USES HADOOP?
WHY IS HADOOP IMPORTANT?
• Low cost : The open-source framework is free and uses
commodity hardware to store large quantities of data.
• Computing power : Its distributed computing model
can quickly process very large volumes of data.
• Scalability : You can easily grow your system simply by
adding more nodes
• Storage flexibility : You can store as much data as you
want and decide how to use it later.
• Inherent data protection and self-healing
capabilities : Data, application processing are protected
WHAT’S IN HADOOP?
• HDFS – the Java-based distributed file system that can
store all kinds of data without prior organization.
• MapReduce – a software programming model for
processing large sets of data in parallel.
• YARN – a resource management framework for
scheduling and handling resource requests from distributed
applications.
COMPONENTS THAT HAVE ACHIEVED TOP-
LEVEL APACHE PROJECT STATUS
• Pig – a platform for manipulating data stored in HDFS. It
consists of a compiler for Map Reduce programs and a
high-level language called Pig Latin.
• Hive – a data warehousing and SQL-like query language
that presents data in the form of tables. Hive programming
is similar to database programming. (It was initially
developed by Facebook.)
• HBase – a non relational, distributed database that runs
on top of Hadoop. HBase tables can serve as input and
output for Map Reduce jobs.
• Zookeeper – an application that coordinates distributed
processes.
• Ambari – a web interface for managing, configuring
and testing Hadoop services and components.
• Flume – software that collects, aggregates and moves
large amounts of streaming data into HDFS.
• Sqoop – a connection and transfer mechanism that
moves data between Hadoop and relational databases.
• Oozie – a Hadoop job scheduler.
HADOOP ARCHITECTURE
• Hadoop framework includes following four modules:
• Hadoop Common : These are Java libraries and
utilities required by other Hadoop modules. These
libraries provides filesystem and OS level abstractions
and contains the necessary Java files and scripts
required to start Hadoop.
• Hadoop YARN : This is a framework for job
scheduling and cluster resource management.
• Hadoop Distributed File System (HDFS) : A
distributed file system that provides high-throughput
access to application data.
• Hadoop MapReduce : This is YARN-based system
for parallel processing of large data sets.
COMPONENTS OF HADOOP
FRAMEWORK:
HADOOP MAP REDUCE
• Hadoop runs applications using the Map
Reduce algorithm, where the data is processed
in parallel on different CPU nodes.
• Map Reduce program executes in three stages,
namely map stage, shuffle stage, and reduce
stage.
WHAT IS MAP REDUCE?
STAGES OF MAP REDUCE
• Map stage : The map ‘s job is to process the input data
which is in the form of file or directory and is stored in the
Hadoop file system (HDFS) and is passed to the mapper
function line by line. The mapper processes the data and
creates several small chunks of data.
• Reduce stage : This stage is the combination of
the Shuffle stage and the Reduce stage. The Reducer’s
job is to process the data that comes from the mapper.
After processing, it produces a new set of output, which will
be stored in the HDFS.
MAP REDUCE
MAP REDUCE
ARCHITECTURE
THINK MAP REDUCE
• Record = (Key, Value)
• Key : Comparable, Serializable
• Value : Serializable
• Input, Map, Shuffle, Reduce, Output
MAP
• Input: (Key1, Value1)
• Output: List(Key2, Value2)
• Projections, Filtering, Transformation
• Data is organized into files and
directories
• Files are divided into uniform sized
blocks(default 128MB) and distributed
across cluster nodes
HDFS
• Blocks are replicated to handle hardware
failure
• Replication for performance and fault
tolerance (Rack-Aware placement)
• HDFS keeps checksums of data for
corruption detection and recovery
FEATURES OF HDFS
• It is suitable for the distributed storage and
processing.
• Hadoop provides a command interface to
interact with HDFS.
• The built-in servers of name node and data
node help users to easily check the status of
cluster.
• Streaming access to file system data.
• HDFS provides file permissions and
authentication.
HDFS ARCHITECTURE
• Namenode is a software that can be run on commodity
hardware. The system having the namenode acts as the
master server and it does the following tasks:
- Manages the file system namespace.
- Regulates client’s access to files.
- It also executes file system operations such as renaming,
closing, and opening files and directories.
• Datanode nodes manage the data storage of the system.
- perform read-write operations on the file systems, as per
client request.
- perform operations such as block creation, deletion, and
replication
• Block the user data is stored in the files of HDFS in which file
system will be divided into one or more segments and stored
in individual data nodes segments are called as blocks
MASTER-SLAVE
ARCHITECTURE
GOALS OF HDFS
• Fault detection and recovery :
Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore
HDFS should have mechanisms for quick and automatic
fault detection and recovery.
• Huge datasets :
HDFS should have hundreds of nodes per cluster to
manage the applications having huge datasets.
• Hardware at data :
A requested task can be done efficiently, when the
computation takes place near the data where huge
datasets are involved, it reduces the network traffic and
increases the throughput.
ADVANTAGES OF HADOOP
• Hadoop framework allows the user to quickly write and
test distributed systems.
• Hadoop library itself detects and handles failures at the
application layer.
• Servers can be added or removed from the cluster
dynamically and Hadoop continues to operate without
interruption.
• apart from being open source, it is compatible on all the
platforms since it is Java based.
Thank
You…!!!

Mais conteúdo relacionado

Mais procurados

HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Hadoop Architecture
Hadoop Architecture Hadoop Architecture
Hadoop Architecture Ganesh B
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 

Mais procurados (20)

Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Hadoop Architecture
Hadoop Architecture Hadoop Architecture
Hadoop Architecture
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Presentation
PresentationPresentation
Presentation
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hive
HiveHive
Hive
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 

Semelhante a Hadoop

Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxraghavanand36
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxiaeronlineexm
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 

Semelhante a Hadoop (20)

Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop
HadoopHadoop
Hadoop
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop training
Hadoop trainingHadoop training
Hadoop training
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptx
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 

Último

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 

Último (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 

Hadoop

  • 2. CONTENTS 1. Why hadoop? 2. Importance of hadoop 3. What’s in hadoop? 4. Apache hadoop echo system 5. Hadoop architecture 6. Hadoop map reduce 7. Hdfs 8. Advantages of hadoop
  • 3.
  • 6. ISSUES WITH LARGE DATA • Map Parallelism: Chunking input data • Reduce Parallelism: Grouping related data • Dealing with failures & load imbalance
  • 7.
  • 8. • Doug Cutting, Mike Cafarella developed an Open Source Project called HADOOP in 2005 and Daug named it after his son's toy elephant.
  • 9. • Hadoop has become one of the most talked about technologies. • Why? One of the top reasons is its ability to handle huge amounts of data – any kind of data – quickly. With volumes and varieties of data growing each day, especially from social media and automated sensors, that’s a key consideration for most organizations. 
  • 10. • Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. • Essentially, it accomplishes two tasks: -massive data storage - faster processing.
  • 11. • Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.
  • 13.
  • 14. WHY IS HADOOP IMPORTANT? • Low cost : The open-source framework is free and uses commodity hardware to store large quantities of data. • Computing power : Its distributed computing model can quickly process very large volumes of data. • Scalability : You can easily grow your system simply by adding more nodes • Storage flexibility : You can store as much data as you want and decide how to use it later. • Inherent data protection and self-healing capabilities : Data, application processing are protected
  • 15. WHAT’S IN HADOOP? • HDFS – the Java-based distributed file system that can store all kinds of data without prior organization. • MapReduce – a software programming model for processing large sets of data in parallel. • YARN – a resource management framework for scheduling and handling resource requests from distributed applications.
  • 16.
  • 17. COMPONENTS THAT HAVE ACHIEVED TOP- LEVEL APACHE PROJECT STATUS • Pig – a platform for manipulating data stored in HDFS. It consists of a compiler for Map Reduce programs and a high-level language called Pig Latin. • Hive – a data warehousing and SQL-like query language that presents data in the form of tables. Hive programming is similar to database programming. (It was initially developed by Facebook.) • HBase – a non relational, distributed database that runs on top of Hadoop. HBase tables can serve as input and output for Map Reduce jobs. • Zookeeper – an application that coordinates distributed processes.
  • 18. • Ambari – a web interface for managing, configuring and testing Hadoop services and components. • Flume – software that collects, aggregates and moves large amounts of streaming data into HDFS. • Sqoop – a connection and transfer mechanism that moves data between Hadoop and relational databases. • Oozie – a Hadoop job scheduler.
  • 19. HADOOP ARCHITECTURE • Hadoop framework includes following four modules: • Hadoop Common : These are Java libraries and utilities required by other Hadoop modules. These libraries provides filesystem and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop. • Hadoop YARN : This is a framework for job scheduling and cluster resource management. • Hadoop Distributed File System (HDFS) : A distributed file system that provides high-throughput access to application data. • Hadoop MapReduce : This is YARN-based system for parallel processing of large data sets.
  • 20.
  • 22.
  • 24. • Hadoop runs applications using the Map Reduce algorithm, where the data is processed in parallel on different CPU nodes. • Map Reduce program executes in three stages, namely map stage, shuffle stage, and reduce stage. WHAT IS MAP REDUCE?
  • 25. STAGES OF MAP REDUCE • Map stage : The map ‘s job is to process the input data which is in the form of file or directory and is stored in the Hadoop file system (HDFS) and is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data. • Reduce stage : This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
  • 28. THINK MAP REDUCE • Record = (Key, Value) • Key : Comparable, Serializable • Value : Serializable • Input, Map, Shuffle, Reduce, Output
  • 29. MAP • Input: (Key1, Value1) • Output: List(Key2, Value2) • Projections, Filtering, Transformation
  • 30. • Data is organized into files and directories • Files are divided into uniform sized blocks(default 128MB) and distributed across cluster nodes
  • 31. HDFS • Blocks are replicated to handle hardware failure • Replication for performance and fault tolerance (Rack-Aware placement) • HDFS keeps checksums of data for corruption detection and recovery
  • 32. FEATURES OF HDFS • It is suitable for the distributed storage and processing. • Hadoop provides a command interface to interact with HDFS. • The built-in servers of name node and data node help users to easily check the status of cluster. • Streaming access to file system data. • HDFS provides file permissions and authentication.
  • 34. • Namenode is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks: - Manages the file system namespace. - Regulates client’s access to files. - It also executes file system operations such as renaming, closing, and opening files and directories. • Datanode nodes manage the data storage of the system. - perform read-write operations on the file systems, as per client request. - perform operations such as block creation, deletion, and replication • Block the user data is stored in the files of HDFS in which file system will be divided into one or more segments and stored in individual data nodes segments are called as blocks
  • 36. GOALS OF HDFS • Fault detection and recovery : Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery. • Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets. • Hardware at data : A requested task can be done efficiently, when the computation takes place near the data where huge datasets are involved, it reduces the network traffic and increases the throughput.
  • 37. ADVANTAGES OF HADOOP • Hadoop framework allows the user to quickly write and test distributed systems. • Hadoop library itself detects and handles failures at the application layer. • Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption. • apart from being open source, it is compatible on all the platforms since it is Java based.