The Hadoop Cluster Administration course at Edureka starts with the fundamental concepts of Apache Hadoop and Hadoop Cluster. It covers topics to deploy, manage, monitor, and secure a Hadoop Cluster. You will learn to configure backup options, diagnose and recover node failures in a Hadoop Cluster. The course will also cover HBase Administration. There will be many challenging, practical and focused hands-on exercises for the learners. Software professionals new to Hadoop can quickly learn the cluster administration through technical sessions and hands-on labs. By the end of this six week Hadoop Cluster Administration training, you will be prepared to understand and solve real world problems that you may come across while working on Hadoop Cluster.
2. How It Works…
LIVE classes
Class recordings
Module wise Quizzes and Practical Assignments
24x7 on-demand technical support
Deployment of different clusters
Online certification exam
Lifetime access to the Learning Management System
www.edureka.in/hadoop-admin
3. Course Topics
Week 1
–
–
–
Understanding Big Data
Hadoop Components
Introduction to Hadoop 2.0
Week 2
–
–
–
Hadoop 2.0
Hadoop Configuration
Hadoop Cluster Architecture
Week 3
–
–
–
Different Hadoop Server Roles
Data processing flow
Cluster Network Configuration
Week 4
–
–
–
Job Scheduling
Fair Scheduler
Monitoring a Hadoop Cluster
Week 5
–
–
–
Securing your Hadoop Cluster
Kerberos and HDFS Federation
Backup and Recovery
Week 6
–
–
–
Oozie and Hive Administration
HBase Architecture
HBase Administration
www.edureka.in/hadoop-admin
4. Topics for Today
What is Big Data?
Limitations of the existing solutions
Solving the problem with Hadoop
Introduction to Hadoop
Hadoop Eco-System
Hadoop Core Components
MapReduce software framework
Hadoop Cluster Administrator: Roles and Responsibilities
Introduction to Hadoop 2.0
www.edureka.in/hadoop-admin
5. What Is Big Data?
Lots of Data (Terabytes or Petabytes).
Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of
information.
A airline jet collects 10 terabytes of sensor data
for every 30 minutes of flying time.
NYSE generates about one terabyte of new trade data
per day to Perform stock trading analytics to determine
trends for optimal trades.
www.edureka.in/hadoop-admin
6. IBM’s Definition
IBM’s definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Characteristics Of Big Data
Volume
Velocity
Variety
12 Terabytes of
tweets created
each day
Scrutinizes 5 million
trade events created
each day to identify
potential fraud
Sensor
data, audio, video, cli
ck streams, log files
and more
www.edureka.in/hadoop-admin
7. Data Volume Is Growing Exponentially
Estimated Global Data Volume:
2011: 1.8 ZB
2015: 7.9 ZB
The world's information doubles every two years
Over the next 10 years:
The number of servers worldwide will grow by 10x
Amount of information managed by enterprise data
centers will grow by 50x
Number of “files” enterprise data center handle will
grow by 75x
Source: http://www.emc.com/leadership/programs/digital-universe.htm,
which was based on the 2011 IDC Digital Universe Study
www.edureka.in/hadoop-admin
8. What Big Companies Have To Say…
“Analyzing Big Data sets will become a key basis for competition.”
McKinsey
“Leaders in every sector will have to grapple the implications of Big Data.”
“Big Data analytics are rapidly emerging as the preferred solution to business and
technology trends that are disrupting.”
Gartner
“Enterprises should not delay implementation of Big Data Analytics.”
“Use Hadoop to gain a competitive advantage over more risk-averse enterprises.”
Forrester
Research
“Prioritize Big Data projects that might benefit from Hadoop.”
www.edureka.in/hadoop-admin
9. Some Of the Hadoop Users
www.edureka.in/hadoop-admin
10. Hadoop Users – In Detail
http://wiki.apache.org/hadoop/PoweredBy
www.edureka.in/hadoop-admin
11. What Is Hadoop?
Apache Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage & distributed processing.
www.edureka.in/hadoop-admin
13. Hadoop History
Fastest sort of a TB, 3.5mins
over 910 nodes
Doug Cutting adds DFS &
MapReduce support to Nutch
Doug Cutting & Mike Cafarella
started working on Nutch
2002
2003
Fastest sort of a TB,
62secs over 1,460 nodes
Sorted a PB in 16.25hours
Over 3.658 nodes
NY Times converts 4TB of
Image archives over 100 EC2s
2004
Google publishes GFS &
MapReduce papers
2005
2006
2007
2008
Yahoo! hires Cutting,
Hadoop spins out of Nutch
Facebook launches Hive:
SQL Support for Hadoop
2009
Founded
Doug Cutting
Joins Cloudera
Hadoop Summit 2009,
750 attendees
www.edureka.in/hadoop-admin
14. Hadoop 1.x Eco-System
Apache Oozie (Workflow)
Hive
Pig Latin
DW System
Data Analysis
Mahout
Machine Learning
MapReduce Framework
HBase
HDFS (Hadoop Distributed File System)
Flume
Sqoop
Import Or Export
Unstructured or
Semi-Structured data
Structured Data
www.edureka.in/hadoop-admin
15. Hadoop 1.x Core Components
Hadoop is a system for large scale data processing.
It has two main components:
HDFS – Hadoop Distributed File System (Storage)
Distributed across “nodes”
Natively redundant
NameNode tracks locations.
Additional Administration
Tools:
Filesystem utilities
Job scheduling and monitoring
Web UI
MapReduce (Processing)
Splits a task across processors
“near” the data & assembles results
Self-Healing, High Bandwidth
Clustered storage
JobTracker manages the TaskTrackers
www.edureka.in/hadoop-admin
16. Hadoop 1.x Core Components (Contd.)
MapReduce
Engine
Job Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
HDFS
Cluster
Admin Node
Name node
Data Node
Data Node
Data Node
Data Node
www.edureka.in/hadoop-admin
17. Name Node and Data Nodes
NameNode:
master of the system
maintains and manages the blocks which are present on the
DataNodes
DataNodes:
slaves which are deployed on each machine and provide the
actual storage
responsible for serving read and write requests for the clients
www.edureka.in/hadoop-admin
18. Secondary Name Node
metadata
Secondary NameNode:
NameNode
Single Point
Failure
Not a hot standby for the NameNode
Connects to NameNode every hour*
Housekeeping, backup of NemeNode metadata
Saved metadata can build a failed NameNode
Secondary
NameNode
You give me
metadata every
hour, I will make
it secure
metadata
www.edureka.in/hadoop-admin
19. What Is MapReduce?
MapReduce is a programming model
It is neither platform- nor language-specific
Record-oriented data processing (key and value)
Task distributed across multiple nodes
Key
Value
Where possible, each node processes data
stored on that node
Consists of two phases
MapReduce
Map
Reduce
www.edureka.in/hadoop-admin
20. What Is MapReduce? (Contd.)
Process can be considered as being similar to a Unix pipeline
cat /my/log | grep '.html' | sort | uniq –c > /my/outfile
MAP
SORT
REDUCE
www.edureka.in/hadoop-admin
21. Hadoop 1.x – In Summary
Client
HDFS
Name Node
Data Node
Data Node
Map Reduce
Secondary
Name Node
Job Tracker
Task Tracker
Map
Reduce
Task Tracker
Map
Reduce
….
Data
Blocks
www.edureka.in/hadoop-admin
23. Hadoop Cluster Administrator
Roles and Responsibilities
Deploying the cluster
Performance and availability of the cluster
Job scheduling and Management
Upgrades
Backup and Recovery
Monitoring the cluster
Troubleshooting
www.edureka.in/hadoop-admin
24. Hadoop 1.0 Vs. Hadoop 2.0
Property
Hadoop 1.x
Hadoop 2.x
NameNodes
1
Many
High Availability
Not present
Highly Available
Processing Control
JobTracker, Task Tracker
Resource Manager, Node
Manager, App Master
www.edureka.in/hadoop-admin
25. MRv1 Vs. MRv2
Hadoop 1.0
Hadoop 2.0
MapReduce
(data processing)
MapReduce
(Data Processing)
Job Tracker
HDFS
(Data Storage)
Problems with Resource utilization
Slots only for Map and Reduce
Others
(data Processing)
Data Node
YARN
(Cluster Resource Management)
Scheduler
Applications
Manager (AsM)
HDFS
(Data Storage)
Provides a Cluster Level Resource Manager
Application Level Resource Management (Node
Manager??)
Provides slots for Jobs other than Map and Reduce
www.edureka.in/hadoop-admin
26. Hadoop 2.0 - Architecture
HDFS
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Secondary
Name Node
Shared
edit logs
YARN
Read edit logs and applies
to its own namespace
Standby
NameNode
Active
NameNode
Data Node
Client
Data Node
Resource Manager
Node Manager
Container
Node Manager
Container
App
Master
Node Manager
Container
App
Master
Data Node
Node Manager
Container
App
Master
Data Node
App
Master
www.edureka.in/hadoop-admin
27. Assignments
Attempt the following Assignments using the documents present in the LMS:
Apache Hadoop 1.0 Installation on Ubuntu in Pseudo-Distributed Mode
Execute Linux Basic Commands
Execute HDFS Hands On
Cloudera CDH3 and CDH4 Quick VM installation on your local machine
www.edureka.in/hadoop-admin
Any production cluster larger than 20-30 nodes requires a full time admin. This admin is responsible for:- the performance and availability of the cluster, the data it contains, and the jobs that run there.- deployment, upgrades, troubleshooting, configuration, tuning, job management, installing tools, architecting processes, monitoring, backups, recovery, etc. There is not a single organization with production Hadoop cluster that didn’t have a full-time admin. The fact that Cloudera is offering Hadoop Administrator Certification and that O’Reilly is selling a book called “Hadoop Operations” demonstrates the importance of a Hadoop Cluster Administrator role in industry.