SlideShare uma empresa Scribd logo
1 de 9
Baixar para ler offline
A Tour to Apache Hadoop, It’s Components, Flavors
and much more…
Contents
1. Hadoop Tutorial ..................................................................................................................................1
2. What is Hadoop?.................................................................................................................................1
3. Why Hadoop?.....................................................................................................................................2
4. Hadoop Architecture?........................................................................................................................2
5. Hadoop Components..........................................................................................................................3
5.1. HDFS – Strong Layer.................................................................................................................3
5.2. MapReduce – Processing Layer ...............................................................................................3
5.3. YARN – Resource Management Layer .....................................................................................4
6. Hadoop Daemons................................................................................................................................4
7. How Hadoop works?...........................................................................................................................5
8. Hadoop Flavors ...................................................................................................................................5
9. Hadoop Ecosystem Components........................................................................................................6
10. Conclusion.........................................................................................................................................7
Hadoop Tutorial
1
https://data-flair.training/big-data-hadoop/
1. Hadoop Tutorial
Apache Hadoop is an open source, Scalable, and Fault tolerant framework written in Java. It
efficiently processes large volumes of data on a cluster of commodity hardware. Hadoop is not only a
storage system but is a platform for large data storage as well as processing.
We will learn in this Hadoop tutorial about Hadoop architecture, Hadoop daemons, different flavors
of Hadoop. At last, we will cover the Hadoop components like HDFS, MapReduce, Yarn, etc.
2. What is Hadoop?
Hadoop is an open-source tool from the ASF – Apache Software Foundation. Open source project
means it is freely available and we can even change its source code as per the requirements. If certain
functionality does not fulfill your need then you can change it. Most of Hadoop code is written by
Yahoo, IBM, Facebook, Cloudera.
It provides an efficient framework for running jobs on multiple nodes of clusters. Cluster means a
group of systems connected via LAN. Apache Hadoop provides distributed processing of data as it
works on multiple machines simultaneously.
By getting inspiration from Google, which has written a paper about the technologies it is using
technologies like Map-Reduce programming model as well as its file system (GFS). Hadoop was
originally written for the Nutch search engine project. When Doug Cutting and his team were working
on it, very soon Hadoop became a top-level project due to its huge popularity.
Apache Hadoop is an open source framework written in Java. The basic Hadoop programming
language is Java, but this does not mean you can code only in Java. You can code in C, C++, Perl,
Python, ruby etc. but it will be better to code in java as you will have lower level control of the code.
Hadoop efficiently processes large volumes of data on a cluster of commodity hardware. Hadoop is
developed for processing huge volume of data. Commodity hardware is the low-end hardware; they
are cheap devices which are very economical. Hence, Hadoop is very economic.
Hadoop can be setup on a single machine (pseudo-distributed mode), but it shows its real power with
a cluster of machines. We can scale it to thousand nodes on the fly ie, without any downtime.
Therefore, we need not to make the system down to add more nodes in the cluster. Follow this guide
to learn Hadoop installation on a multi-node cluster.
Hadoop consists of three key parts –
 Hadoop Distributed File System (HDFS) – It is the storage layer of Hadoop. HDFS is the
most reliable storage system on the planet.
 Map-Reduce – It is the data processing layer of Hadoop. MapReduce is the distributed
processing framework, which processes the data at lightning fast speed.
 YARN – Yet Another Resource Negotiator, It is the resource management layer of Hadoop.
Yarn manages resources on the cluster
Hadoop Tutorial
2
https://data-flair.training/big-data-hadoop/
3. Why Hadoop?
Let us now understand why Hadoop is very popular, why Hadoop capture more than 90% of big
data market.
Apache Hadoop is not only a storage system but is a platform for data storage as well as processing.
It is scalable (as we can add more nodes on the fly), Fault tolerant (Even if nodes go down, data is
processed by another node).
Following characteristics of Hadoop make it a unique platform:
 Flexibility to store and mine any type of data whether it is structured, semi-structured or
unstructured. It is not bounded by a single schema.
 Excels at processing data of complex nature. Its scale-out architecture divides workloads
across many nodes. Another added advantage is that its flexible file-system eliminates ETL
bottlenecks.
 Scales economically, as discussed it can deploy on commodity hardware. Apart from this its
open-source nature guards against vendor lock.
4. Hadoop Architecture?
After understanding what is Apache Hadoop, let us now understand the Hadoop Architecture in detail.
Hadoop works in master-slave fashion. There are master nodes (very few) and n numbers of slave
nodes where n can be 1000s. Master manages, maintains and monitors the slaves while slaves are the
actual worker nodes. In Hadoop architecture the Master should be deployed on a good hardware, not
just commodity hardware. As it is the centerpiece of Hadoop cluster.
Master stores the metadata (data about data) while slaves are the nodes which store the actual data
distributedly in the cluster. The client connects with master node to perform any task. Now in this
Hadoop tutorial, we will discuss different components of Hadoop in detail.
Hadoop Tutorial
3
https://data-flair.training/big-data-hadoop/
5. Hadoop Components
There are three most important Apache Hadoop Components. In this Hadoop tutorial, you will learn
what is HDFS, what is MapReduce and what is Yarn. Let us discuss them one by one:
5.1. HDFS – Strong Layer
Hadoop HDFS or Hadoop Distributed File System is a distributed file system which provides storage
in Hadoop in a distributed fashion.
In Hadoop Architecture on the master node, a daemon called namenode run for HDFS. On all the
slaves a daemon called datanode run for HDFS. Hence slaves are also called as datanode. Namenode
stores meta-data and manages the datanodes. On the other hand, Datanodes stores the data and do
the actual task.
HDFS is a highly fault tolerant, distributed, reliable and scalable file system for data storage.
First Follow this guide to learn more about features of HDFS and then proceed further with the Hadoop
tutorial.
HDFS is developed to handle huge volumes of data. The file size expected is in the range of GBs to TBs.
A file is split up into blocks (default 128 MB) and stored distributedly across multiple machines. These
blocks replicate as per the replication factor. HDFS handles the failure of a node in the cluster.
5.2. MapReduce – Processing Layer
Now it’s time to understand one of the most important pillar if Hadoop, i.e. MapReduce. MapReduce
is a programming model. As it is designed for large volumes of data in parallel by dividing the work
into a set of independent tasks. MapReduce is the heart of Hadoop, it moves computation close to
the data. As a movement of a huge volume of data will be very costly. It allows massive scalability
across hundreds or thousands of servers in a Hadoop cluster.
Hadoop Tutorial
4
https://data-flair.training/big-data-hadoop/
Hence, MapReduce is a framework for distributed processing of huge volumes of data set over a
cluster of nodes. As data is stored in a distributed manner in HDFS. It provides the way to Map–
Reduce to perform distributed processing.
5.3. YARN – Resource Management Layer
YARN – Yet Another Resource Negotiator is the resource management layer of Hadoop. In the multi-
node cluster, it becomes very complex to manage/allocate/release the resources (CPU, memory,
disk). Hadoop Yarn manages the resources quite efficiently. It allocates the same on request from any
application.
On the master node, the ResourceManager daemon runs for the YARN then for all the slave
nodes NodeManager daemon runs.
Learn the differences between two resource manager Yarn vs. Apache Mesos. Next topic in the
Hadoop tutorial is a very important part i.e. Hadoop Daemons
6. Hadoop Daemons
Daemons are the processes that run in the background. There are mainly 4 daemons which run
for Hadoop.
Hadoop Daemons:
 Namenode – It runs on master node for HDFS.
 Datanode – It runs on slave nodes for HDFS.
 ResourceManager – It runs on master node for Yarn.
 NodeManager – It runs on slave node for Yarn.
These 4 demons run for Hadoop to be functional. Apart from this, there can be secondary NameNode,
standby NameNode, Job HistoryServer, etc.
Hadoop Tutorial
5
https://data-flair.training/big-data-hadoop/
7. How Hadoop works?
Till now we have studied Hadoop Introduction and Hadoop architecture in great details. Now let us
summarize Apache Hadoop working step by step:
i) Input data is broken into blocks of size 128 MB (by default) and then moves to different nodes.
ii) Once all the blocks of the file stored on datanodes then user can process the data.
iii) Now, master schedules the program (submitted by the user) on individual nodes.
iv) Once all the nodes process the data then the output is written back to HDFS.
8. Hadoop Flavors
This section of Hadoop Tutorial talks about the various flavors of Hadoop.
Apache – Vanilla flavor, as the actual code is residing in Apache repositories.
Hortonworks – Popular distribution in the industry.
Cloudera – It is the most popular in the industry.
MapR – It has rewritten HDFS and its HDFS is faster as compared to others.
IBM – Proprietary distribution is known as Big Insights.
All flavors are almost same and if you know one, you can easily work on other flavors as well.
Hadoop Tutorial
6
https://data-flair.training/big-data-hadoop/
9. Hadoop Ecosystem Components
In this section, we will cover Hadoop ecosystem components. Let us see what all the components form
the Hadoop Eco-System:
Hadoop HDFS – Distributed storage layer for Hadoop.
Yarn Hadoop – Resource management layer introduced in Hadoop 2.x.
Hadoop Map-Reduce – Distributed processing layer for Hadoop.
HBase – It is a column-oriented database that runs on top of HDFS. It is a NoSQL database which does
not understand the structured query. For sparse data set, it suits well.
Hive – Apache Hive is a data warehousing infrastructure based on Hadoop and it enables easy data
summarization, using SQL queries.
Pig – It is a top-level scripting language. Pig enables writing complex data processing without Java
programming.
Flume – It is a reliable system for efficiently collecting large amounts of data from many different
sources in real-time.
Sqoop – It is a tool design to transport huge volumes of data between Hadoop and RDBMS.
Oozie – It is a Java Web application uses to schedule Apache Hadoop jobs. It combines multiple jobs
sequentially into one logical unit of work.
Zookeeper – A centralized service for maintaining configuration information, naming, providing
distributed synchronization, and providing group services.
Hadoop Tutorial
7
https://data-flair.training/big-data-hadoop/
Mahout – A library of scalable machine-learning algorithms, implemented on top of Apache Hadoop
and using the MapReduce paradigm.
Refer this Hadoop Ecosystem Components tutorial for the detailed study of All the Ecosystem
components of Hadoop.
10. Conclusion
In conclusion to this Hadoop tutorial, we can say that Apache Hadoop is the most popular and
powerful big data tool. Hadoop stores & processes huge amount of data in the distributed manner on
a cluster of nodes. It provides world’s most reliable storage layer- HDFS. Batch processing engine
MapReduce and Resource management layer- YARN.
This conclusion is not the end but a foundation to learn Apache Hadoop. Here are the next steps after
you are through with this Apache Hadoop Tutorial:
1. Internal Working of Hadoop
2. Hadoop Distributed File System
3. MapReduce Tutorial
4. YARN - Tutorial
5. Install Hadoop 2 on Ubuntu

Mais conteúdo relacionado

Mais procurados

Trends in computer architecture
Trends in computer architectureTrends in computer architecture
Trends in computer architecturemuhammedsalihabbas
 
20. Parallel Databases in DBMS
20. Parallel Databases in DBMS20. Parallel Databases in DBMS
20. Parallel Databases in DBMSkoolkampus
 
Parallel sorting Algorithms
Parallel  sorting AlgorithmsParallel  sorting Algorithms
Parallel sorting AlgorithmsGARIMA SHAKYA
 
[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun YooJaeJun Yoo
 
Lec 17 heap data structure
Lec 17 heap data structureLec 17 heap data structure
Lec 17 heap data structureSajid Marwat
 
Basic Traversal and Search Techniques
Basic Traversal and Search TechniquesBasic Traversal and Search Techniques
Basic Traversal and Search TechniquesSVijaylakshmi
 
Computing Outside The Box
Computing Outside The BoxComputing Outside The Box
Computing Outside The BoxIan Foster
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer PerceptronsESCOM
 
Dynamic programming, Branch and bound algorithm & Greedy algorithms
Dynamic programming, Branch and bound algorithm & Greedy algorithms Dynamic programming, Branch and bound algorithm & Greedy algorithms
Dynamic programming, Branch and bound algorithm & Greedy algorithms SURBHI SAROHA
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNNShuai Zhang
 
Computer notes - Expression Tree
Computer notes - Expression TreeComputer notes - Expression Tree
Computer notes - Expression Treeecomputernotes
 
Software System Scalability: Concepts and Techniques (keynote talk at ISEC 2009)
Software System Scalability: Concepts and Techniques (keynote talk at ISEC 2009)Software System Scalability: Concepts and Techniques (keynote talk at ISEC 2009)
Software System Scalability: Concepts and Techniques (keynote talk at ISEC 2009)David Rosenblum
 
Waltz algorithm in artificial intelligence
Waltz algorithm in artificial intelligenceWaltz algorithm in artificial intelligence
Waltz algorithm in artificial intelligenceMinakshi Atre
 
MemoryManagementStrategies.ppt
MemoryManagementStrategies.pptMemoryManagementStrategies.ppt
MemoryManagementStrategies.pptPaulRajasingh2
 

Mais procurados (20)

Trends in computer architecture
Trends in computer architectureTrends in computer architecture
Trends in computer architecture
 
20. Parallel Databases in DBMS
20. Parallel Databases in DBMS20. Parallel Databases in DBMS
20. Parallel Databases in DBMS
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
B trees dbms
B trees dbmsB trees dbms
B trees dbms
 
Parallel sorting Algorithms
Parallel  sorting AlgorithmsParallel  sorting Algorithms
Parallel sorting Algorithms
 
[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo
 
Machine learning
Machine learningMachine learning
Machine learning
 
Lec 17 heap data structure
Lec 17 heap data structureLec 17 heap data structure
Lec 17 heap data structure
 
Binary tree
Binary treeBinary tree
Binary tree
 
Basic Traversal and Search Techniques
Basic Traversal and Search TechniquesBasic Traversal and Search Techniques
Basic Traversal and Search Techniques
 
Computing Outside The Box
Computing Outside The BoxComputing Outside The Box
Computing Outside The Box
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
 
Dynamic programming, Branch and bound algorithm & Greedy algorithms
Dynamic programming, Branch and bound algorithm & Greedy algorithms Dynamic programming, Branch and bound algorithm & Greedy algorithms
Dynamic programming, Branch and bound algorithm & Greedy algorithms
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 
Computer notes - Expression Tree
Computer notes - Expression TreeComputer notes - Expression Tree
Computer notes - Expression Tree
 
Software System Scalability: Concepts and Techniques (keynote talk at ISEC 2009)
Software System Scalability: Concepts and Techniques (keynote talk at ISEC 2009)Software System Scalability: Concepts and Techniques (keynote talk at ISEC 2009)
Software System Scalability: Concepts and Techniques (keynote talk at ISEC 2009)
 
Waltz algorithm in artificial intelligence
Waltz algorithm in artificial intelligenceWaltz algorithm in artificial intelligence
Waltz algorithm in artificial intelligence
 
Graph Planning
Graph PlanningGraph Planning
Graph Planning
 
BIGDATA ANALYTICS LAB MANUAL final.pdf
BIGDATA  ANALYTICS LAB MANUAL final.pdfBIGDATA  ANALYTICS LAB MANUAL final.pdf
BIGDATA ANALYTICS LAB MANUAL final.pdf
 
MemoryManagementStrategies.ppt
MemoryManagementStrategies.pptMemoryManagementStrategies.ppt
MemoryManagementStrategies.ppt
 

Semelhante a Hadoop tutorial-pdf.pdf

Semelhante a Hadoop tutorial-pdf.pdf (20)

Hadoop Tutorial for Beginners
Hadoop Tutorial for BeginnersHadoop Tutorial for Beginners
Hadoop Tutorial for Beginners
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 
Hadoop .pdf
Hadoop .pdfHadoop .pdf
Hadoop .pdf
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Hadoop description
Hadoop descriptionHadoop description
Hadoop description
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Bigdata ppt
Bigdata pptBigdata ppt
Bigdata ppt
 
Bigdata
BigdataBigdata
Bigdata
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahofield_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentaho
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Big Data Training in Amritsar
Big Data Training in AmritsarBig Data Training in Amritsar
Big Data Training in Amritsar
 
Big Data Training in Mohali
Big Data Training in MohaliBig Data Training in Mohali
Big Data Training in Mohali
 

Último

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 

Último (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Hadoop tutorial-pdf.pdf

  • 1. A Tour to Apache Hadoop, It’s Components, Flavors and much more…
  • 2. Contents 1. Hadoop Tutorial ..................................................................................................................................1 2. What is Hadoop?.................................................................................................................................1 3. Why Hadoop?.....................................................................................................................................2 4. Hadoop Architecture?........................................................................................................................2 5. Hadoop Components..........................................................................................................................3 5.1. HDFS – Strong Layer.................................................................................................................3 5.2. MapReduce – Processing Layer ...............................................................................................3 5.3. YARN – Resource Management Layer .....................................................................................4 6. Hadoop Daemons................................................................................................................................4 7. How Hadoop works?...........................................................................................................................5 8. Hadoop Flavors ...................................................................................................................................5 9. Hadoop Ecosystem Components........................................................................................................6 10. Conclusion.........................................................................................................................................7
  • 3. Hadoop Tutorial 1 https://data-flair.training/big-data-hadoop/ 1. Hadoop Tutorial Apache Hadoop is an open source, Scalable, and Fault tolerant framework written in Java. It efficiently processes large volumes of data on a cluster of commodity hardware. Hadoop is not only a storage system but is a platform for large data storage as well as processing. We will learn in this Hadoop tutorial about Hadoop architecture, Hadoop daemons, different flavors of Hadoop. At last, we will cover the Hadoop components like HDFS, MapReduce, Yarn, etc. 2. What is Hadoop? Hadoop is an open-source tool from the ASF – Apache Software Foundation. Open source project means it is freely available and we can even change its source code as per the requirements. If certain functionality does not fulfill your need then you can change it. Most of Hadoop code is written by Yahoo, IBM, Facebook, Cloudera. It provides an efficient framework for running jobs on multiple nodes of clusters. Cluster means a group of systems connected via LAN. Apache Hadoop provides distributed processing of data as it works on multiple machines simultaneously. By getting inspiration from Google, which has written a paper about the technologies it is using technologies like Map-Reduce programming model as well as its file system (GFS). Hadoop was originally written for the Nutch search engine project. When Doug Cutting and his team were working on it, very soon Hadoop became a top-level project due to its huge popularity. Apache Hadoop is an open source framework written in Java. The basic Hadoop programming language is Java, but this does not mean you can code only in Java. You can code in C, C++, Perl, Python, ruby etc. but it will be better to code in java as you will have lower level control of the code. Hadoop efficiently processes large volumes of data on a cluster of commodity hardware. Hadoop is developed for processing huge volume of data. Commodity hardware is the low-end hardware; they are cheap devices which are very economical. Hence, Hadoop is very economic. Hadoop can be setup on a single machine (pseudo-distributed mode), but it shows its real power with a cluster of machines. We can scale it to thousand nodes on the fly ie, without any downtime. Therefore, we need not to make the system down to add more nodes in the cluster. Follow this guide to learn Hadoop installation on a multi-node cluster. Hadoop consists of three key parts –  Hadoop Distributed File System (HDFS) – It is the storage layer of Hadoop. HDFS is the most reliable storage system on the planet.  Map-Reduce – It is the data processing layer of Hadoop. MapReduce is the distributed processing framework, which processes the data at lightning fast speed.  YARN – Yet Another Resource Negotiator, It is the resource management layer of Hadoop. Yarn manages resources on the cluster
  • 4. Hadoop Tutorial 2 https://data-flair.training/big-data-hadoop/ 3. Why Hadoop? Let us now understand why Hadoop is very popular, why Hadoop capture more than 90% of big data market. Apache Hadoop is not only a storage system but is a platform for data storage as well as processing. It is scalable (as we can add more nodes on the fly), Fault tolerant (Even if nodes go down, data is processed by another node). Following characteristics of Hadoop make it a unique platform:  Flexibility to store and mine any type of data whether it is structured, semi-structured or unstructured. It is not bounded by a single schema.  Excels at processing data of complex nature. Its scale-out architecture divides workloads across many nodes. Another added advantage is that its flexible file-system eliminates ETL bottlenecks.  Scales economically, as discussed it can deploy on commodity hardware. Apart from this its open-source nature guards against vendor lock. 4. Hadoop Architecture? After understanding what is Apache Hadoop, let us now understand the Hadoop Architecture in detail. Hadoop works in master-slave fashion. There are master nodes (very few) and n numbers of slave nodes where n can be 1000s. Master manages, maintains and monitors the slaves while slaves are the actual worker nodes. In Hadoop architecture the Master should be deployed on a good hardware, not just commodity hardware. As it is the centerpiece of Hadoop cluster. Master stores the metadata (data about data) while slaves are the nodes which store the actual data distributedly in the cluster. The client connects with master node to perform any task. Now in this Hadoop tutorial, we will discuss different components of Hadoop in detail.
  • 5. Hadoop Tutorial 3 https://data-flair.training/big-data-hadoop/ 5. Hadoop Components There are three most important Apache Hadoop Components. In this Hadoop tutorial, you will learn what is HDFS, what is MapReduce and what is Yarn. Let us discuss them one by one: 5.1. HDFS – Strong Layer Hadoop HDFS or Hadoop Distributed File System is a distributed file system which provides storage in Hadoop in a distributed fashion. In Hadoop Architecture on the master node, a daemon called namenode run for HDFS. On all the slaves a daemon called datanode run for HDFS. Hence slaves are also called as datanode. Namenode stores meta-data and manages the datanodes. On the other hand, Datanodes stores the data and do the actual task. HDFS is a highly fault tolerant, distributed, reliable and scalable file system for data storage. First Follow this guide to learn more about features of HDFS and then proceed further with the Hadoop tutorial. HDFS is developed to handle huge volumes of data. The file size expected is in the range of GBs to TBs. A file is split up into blocks (default 128 MB) and stored distributedly across multiple machines. These blocks replicate as per the replication factor. HDFS handles the failure of a node in the cluster. 5.2. MapReduce – Processing Layer Now it’s time to understand one of the most important pillar if Hadoop, i.e. MapReduce. MapReduce is a programming model. As it is designed for large volumes of data in parallel by dividing the work into a set of independent tasks. MapReduce is the heart of Hadoop, it moves computation close to the data. As a movement of a huge volume of data will be very costly. It allows massive scalability across hundreds or thousands of servers in a Hadoop cluster.
  • 6. Hadoop Tutorial 4 https://data-flair.training/big-data-hadoop/ Hence, MapReduce is a framework for distributed processing of huge volumes of data set over a cluster of nodes. As data is stored in a distributed manner in HDFS. It provides the way to Map– Reduce to perform distributed processing. 5.3. YARN – Resource Management Layer YARN – Yet Another Resource Negotiator is the resource management layer of Hadoop. In the multi- node cluster, it becomes very complex to manage/allocate/release the resources (CPU, memory, disk). Hadoop Yarn manages the resources quite efficiently. It allocates the same on request from any application. On the master node, the ResourceManager daemon runs for the YARN then for all the slave nodes NodeManager daemon runs. Learn the differences between two resource manager Yarn vs. Apache Mesos. Next topic in the Hadoop tutorial is a very important part i.e. Hadoop Daemons 6. Hadoop Daemons Daemons are the processes that run in the background. There are mainly 4 daemons which run for Hadoop. Hadoop Daemons:  Namenode – It runs on master node for HDFS.  Datanode – It runs on slave nodes for HDFS.  ResourceManager – It runs on master node for Yarn.  NodeManager – It runs on slave node for Yarn. These 4 demons run for Hadoop to be functional. Apart from this, there can be secondary NameNode, standby NameNode, Job HistoryServer, etc.
  • 7. Hadoop Tutorial 5 https://data-flair.training/big-data-hadoop/ 7. How Hadoop works? Till now we have studied Hadoop Introduction and Hadoop architecture in great details. Now let us summarize Apache Hadoop working step by step: i) Input data is broken into blocks of size 128 MB (by default) and then moves to different nodes. ii) Once all the blocks of the file stored on datanodes then user can process the data. iii) Now, master schedules the program (submitted by the user) on individual nodes. iv) Once all the nodes process the data then the output is written back to HDFS. 8. Hadoop Flavors This section of Hadoop Tutorial talks about the various flavors of Hadoop. Apache – Vanilla flavor, as the actual code is residing in Apache repositories. Hortonworks – Popular distribution in the industry. Cloudera – It is the most popular in the industry. MapR – It has rewritten HDFS and its HDFS is faster as compared to others. IBM – Proprietary distribution is known as Big Insights. All flavors are almost same and if you know one, you can easily work on other flavors as well.
  • 8. Hadoop Tutorial 6 https://data-flair.training/big-data-hadoop/ 9. Hadoop Ecosystem Components In this section, we will cover Hadoop ecosystem components. Let us see what all the components form the Hadoop Eco-System: Hadoop HDFS – Distributed storage layer for Hadoop. Yarn Hadoop – Resource management layer introduced in Hadoop 2.x. Hadoop Map-Reduce – Distributed processing layer for Hadoop. HBase – It is a column-oriented database that runs on top of HDFS. It is a NoSQL database which does not understand the structured query. For sparse data set, it suits well. Hive – Apache Hive is a data warehousing infrastructure based on Hadoop and it enables easy data summarization, using SQL queries. Pig – It is a top-level scripting language. Pig enables writing complex data processing without Java programming. Flume – It is a reliable system for efficiently collecting large amounts of data from many different sources in real-time. Sqoop – It is a tool design to transport huge volumes of data between Hadoop and RDBMS. Oozie – It is a Java Web application uses to schedule Apache Hadoop jobs. It combines multiple jobs sequentially into one logical unit of work. Zookeeper – A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
  • 9. Hadoop Tutorial 7 https://data-flair.training/big-data-hadoop/ Mahout – A library of scalable machine-learning algorithms, implemented on top of Apache Hadoop and using the MapReduce paradigm. Refer this Hadoop Ecosystem Components tutorial for the detailed study of All the Ecosystem components of Hadoop. 10. Conclusion In conclusion to this Hadoop tutorial, we can say that Apache Hadoop is the most popular and powerful big data tool. Hadoop stores & processes huge amount of data in the distributed manner on a cluster of nodes. It provides world’s most reliable storage layer- HDFS. Batch processing engine MapReduce and Resource management layer- YARN. This conclusion is not the end but a foundation to learn Apache Hadoop. Here are the next steps after you are through with this Apache Hadoop Tutorial: 1. Internal Working of Hadoop 2. Hadoop Distributed File System 3. MapReduce Tutorial 4. YARN - Tutorial 5. Install Hadoop 2 on Ubuntu