This Hadoop will help you understand the different tools present in the Hadoop ecosystem. This Hadoop video will take you through an overview of the important tools of Hadoop ecosystem which include Hadoop HDFS, Hadoop Pig, Hadoop Yarn, Hadoop Hive, Apache Spark, Mahout, Apache Kafka, Storm, Sqoop, Apache Ranger, Oozie and also discuss the architecture of these tools. It will cover the different tasks of Hadoop such as data storage, data processing, cluster resource management, data ingestion, machine learning, streaming and more. Now, let us get started and understand each of these tools in detail.
Below topics are explained in this Hadoop ecosystem presentation:
1. What is Hadoop ecosystem?
1. Pig (Scripting)
2. Hive (SQL queries)
3. Apache Spark (Real-time data analysis)
4. Mahout (Machine learning)
5. Apache Ambari (Management and monitoring)
6. Kafka & Storm
7. Apache Ranger & Apache Knox (Security)
8. Oozie (Workflow system)
9. Hadoop MapReduce (Data processing)
10. Hadoop Yarn (Cluster resource management)
11. Hadoop HDFS (Data storage)
12. Sqoop & Flume (Data collection and ingestion)
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Learn Spark SQL, creating, transforming, and querying Data frames
14. Understand the common use-cases of Spark and the various interactive algorithms
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training.
2. Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
3. Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
4. HDFS
HDFS stands for Hadoop Distributed File System
Stores different formats of data
on various machines
Namenode
(Master)
Datanode
(Slave)
2 major components
128 MB
300 MB
128 MB 44 MB
Splits the data into multiple
blocks (128 MB by default)
5. Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
6. YARN
YARN stands for Yet Another Resource Negotiator
ResourceManager
(Master)
NodeManager
(Slave)
2 major components
Allocates RAM, memory and
other resources to different
applications
RAM
Memory
ResourcesHandles the cluster of
nodes
7. Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
8. MapReduce
MapReduce processes large volumes of data in a parallelly distributed manner
Big Data
Map()
Map()
Map()
Map()
Map()
Map()
Shuffle and
sort
Reduce()
Reduce()
Output
9. Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
10. Sqoop
Sqoop is used to transfer data between Hadoop and external datastores such as
relational databases and enterprise data warehouses
Relational database and
enterprise data warehouse
Hadoop data
It imports data from external datastores into
HDFS, Hive and HBase
Map Task
HDFS/HBase/Hi
ve
Enterprise
data
warehouse
Document
based systems
Relational
Database
Hadoop
command
11. Flume
Flume is distributed service for collecting, aggregating and moving large amounts of
log data
Unstructured and semi-
structured data into
HDFS
Flume
ingests
Ingests online streaming data from social
media, log files, web server into HDFS
Web server/
Cloud/Social
media data
Source Sink
HDFS
Channel
12. Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
13. Pig
Pig is used to analyze data in Hadoop. It provides a high level data processing
language to perform numerous operations on the data
Pig Latin Scripts
Parser
Optimizer
Compiler
Execution Engine
MapReduce
HDFS
Grunt Shell Pig Server
Apache PigPig Latin
Pig Latin
Compiler
Language for
scripting
Converts Pig Latin code to
executable code
Provides a platform for building
data flow for ETL
10 lines of Pig Latin script is around
200 lines of MapReduce job
14. Hive
Hive facilitates reading, writing and managing large datasets residing in the distributed
storage using SQL (Hive Query Language)
Hive Command
Line
JDBC/ODBC
driver
2 major components
JDBC/ODBC
Hive Thrift
Server
Hive Web
Interface
CLI
Driver
(Compiler, Optimizer, Executor)
Job Tracker Namenode
Hive
Datanode
+
Task
Tracker
Provides User Defined Functions (UDF) for
data mining, document indexing, log
processing, etc.
15. Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
16. Spark
Spark is an open-source distributed computing engine for processing and analyzing
huge volumes of real time data
Written in
Driver
Program
SparkContext
Cluster
Manager
Worker Node
Task Cache
Executor
Task Cache
Executor
Worker Node
Runs 100x times faster than MapReduce
Provides in-memory computation of data
Used to process and analyze real time streaming
data such as stock market and banking data
17. Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
18. Mahout
Mahout is used to create scalable and distributed machine learning algorithms
Machine learning
applications
Mahout
environment
builds
Collaborative
Filtering
Classification
Clustering
Has a library that contains in-
built algorithms for
19. Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
20. Ambari
Ambari is an open-source tool responsible for keeping track of running applications and
their statuses
Host Server
Agent Agent Agent
Ambari Web
Database
Host Server Host Server
Ambari Server
• Manages, monitors and provisions Hadoop clusters
• Provides a central management service to start, stop and
configure Hadoop services
21. Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
22. Kafka
Kafka is a distributed streaming platform to store and process streams of records
Written in
Builds real-time streaming data pipelines that reliably
get data between applications
Builds real-time streaming applications that
transforms data into streams
Kafka uses a messaging system for transferring data
from one application to another
Sender Receiver
Message queue
23. Storm
Storm is a processing engine that processes real-time streaming data at a
very high speed
Written in
Clojure
Ability to process over a million jobs in a fraction of
seconds on a node
It is integrated with Hadoop to harness higher
throughputs
24. Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
25. Ranger
Ranger is a framework to enable, monitor and manage data securities across
the Hadoop platform
Provides centralized security
administration to manage all
security related tasks
1
Standardize authorization across all
Hadoop components
2
Enhanced support for different
authorization methods – Role based
access control, attribute based
access control, etc.
3
26. Knox
Knox is an application gateway for interacting with the REST APIs and UIs of
Hadoop deployments
Knox delivers 3 groups of user facing services:
Provides access to Hadoop via
proxying the HTTP request
Proxying
Services1
Authentication for REST API access
and WebSSO flow for user
interfaces
Authentication
Services2
Client development can be done
with the scripting through DSL or
using the Knox shell classes
Client
Services3
27. Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
28. Oozie
Oozie is a workflow scheduler system to manage Hadoop jobs
Workflow
engine
Coordinator
engine
Consists of 2 parts
1 2
1
2
Directed Acyclic Graphs (DAGs) which
specifies a sequence of actions to be
executed
These consist of workflow jobs triggered by
time and data
availability
Start
MapReduce
Program [Action
Node]
Notify client of
success [Email
Action Node]
Notify Client of
Error [Email Action
Node]
Kill
(unsuccessful
termination)
Begin Success
Error
End
(successful
completion)