The Hadoop Cluster Administration course at Edureka starts with the fundamental concepts of Apache Hadoop and Hadoop Cluster. It covers topics to deploy, manage, monitor, and secure a Hadoop Cluster. You will learn to configure backup options, diagnose and recover node failures in a Hadoop Cluster. The course will also cover HBase Administration. There will be many challenging, practical and focused hands-on exercises for the learners. Software professionals new to Hadoop can quickly learn the cluster administration through technical sessions and hands-on labs. By the end of this six week Hadoop Cluster Administration training, you will be prepared to understand and solve real world problems that you may come across while working on Hadoop Cluster.
6. www.edureka.in/hadoop-admin
Hadoop 1.0 Vs. Hadoop 2.0
Property Hadoop 1.x Hadoop 2.x
NameNodes 1 Many
High Availability Not present Highly Available
Processing Control JobTracker, Task Tracker Resource Manager, Node
Manager, App Master
7. www.edureka.in/hadoop-admin
Hadoop 2.0 HDFS Federation
http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/Federation.html
Namenode
Block Management
NS
Storage
Datanode Datanode…
NamespaceBlockStorage
Namespace
NS1 NSk NSn
NN-1 NN-k NN-n
Common Storage
Datanode 1
…
Datanode 2
…
Datanode m
…
BlockStorage
Pool 1 Pool k Pool n
Block Pools
… …
8. www.edureka.in/hadoop-admin
Hadoop 2.0 HDFS NameNode High Availability
Shared
edit logs
Data Blocks
….
Data Nodes are configured with the
location of both Name Nodes, and send
block location information and heartbeats
to both.
Read edit logs and applies to its own
namespace
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Active
Name Node
Standby
Name Node
Data Node Data Node Data Node Data Node
Secondary
Name Node
10. www.edureka.in/hadoop-admin
Client
HDFS
YARN
Resource Manager
Hadoop 2.0
Shared
edit logs
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and applies
to its own namespace
Secondary
Name Node
Data Node Data Node
Data Node Data Node
Node Manager
Container
App
Master
Node Manager
Container
App
Master
Standby
NameNode
Node Manager
Container
App
Master
Node Manager
Container
App
Master
Active
NameNode
12. www.edureka.in/hadoop-admin
Hadoop 2.0 Configuration Files
Configuration
Filenames
Description of Log Files
hadoop-env.sh
yarn-env.sh Settings for Hadoop Daemon‟s process environment.
core-site.xml
Configuration settings for Hadoop Core such as I/O settings that common to both HDFS
and YARN.
hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the Data Nodes.
yarn-site.xml Configuration setting for ResourceManager and NodeManager.
mapred-site.xml Configuration settings for MapReduce Applications.
slaves A list of machines (one per line) that each run DataNode and NodeManager.
14. www.edureka.in/hadoop-admin
Deprecated Properties
Deprecated Property Name New Property Name
dfs.data.dir dfs.datanode.data.dir
dfs.http.address dfs.namenode.http-address
fs.default.name fs.defaultFS
The core functionality and usage of these core configuration files are same in Hadoop 2.0
and 1.0 but many new properties have been added and many have been deprecated.
For example:
‟fs.default.name‟ has been deprecated and replaced with „fs.defaultFS‟ for YARN in core-site.xml
„dfs.nameservices‟ has been added to enable NameNode High Availability in hdfs-site.xml
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html
In Hadoop 2.x.x (CDH4) release, you can use either the old or the new properties.
– The old property names are now deprecated, but still work!
15. www.edureka.in/hadoop-admin
Runtime Environment
Offers a way to provide custom parameters for each of the servers.
Sourced by the Hadoop Daemons start/stop scripts.
Examples of environment variables that you can specify:
HADOOP_DATANODE_HEAPSIZE
YARN_HEAPSIZE
Set parameter JAVA_HOME
JVM
hadoop-env.sh
yarn-env.sh
Map
Reduce
22. www.edureka.in/hadoop-admin
Hadoop Cluster: A Typical Use Case (Hadoop 1.0)
RAM: 16GB
Hard disk: 6 X 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 X 10 GB/s
OS: 32bit CentOS
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 X 10 GB/s
OS: 32bit CentOS
RAM: 32 GB,
Hard disk: 1 TB
Processor: Xenon with 4 Cores
Ethernet: 3 X 10 GB/s
OS: 32bit CentOS
Name Node Secondary Name Node
Data Node
RAM: 16GB
Hard disk: 6 X 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 X 10 GB/s
OS: 32bit CentOS
Data Node
23. www.edureka.in/hadoop-admin
Hadoop Cluster: Thinking About The Problem
Single Machine
Great for testing,
developing.
Not a practical
implementation for
large amounts of data.
Initially four or six
nodes.
As the volume of data
grows, more nodes can
easily be added.
Ways of deciding when the
cluster needs to grow
Increasing amount of
computation power
needed.
Increasing amount of
data which needs to be
stored.
Increasing amount of
memory needed to
process tasks.
Hadoop Cluster
Small Cluster Large Cluster
24. www.edureka.in/hadoop-admin
Master Hardware
Namenode requirements
RAM to fit metadata
Modest but dedicated disk
Secondary Namenode
Almost identical to Namenode
Resource Manager
Retain Job Data, Memory Hungry
Memory requirements can grow
independent of cluster size
Slave Hardware
Storage
Computation
Cluster Sizing
Usage Pattern and Workloads
IO-bound or CPU-bound
Consider requirements for
additional components such as
HBase
Plan your Hadoop Cluster: Hardware
25. www.edureka.in/hadoop-admin
Operating System
Linux is the only production quality option today.
A significant number run on RHEL.
Java
JDK- the most critical software
List of tested JVMs:
http://wiki.apache.org/hadoop/HadoopJavaVers
ions
Java 1.6.x
Operating System utilities
ssh
cron
rsync
ntp
Plan your Hadoop Cluster: Software
26. www.edureka.in/hadoop-admin
Choose a Distribution and Version of Hadoop
Popular Hadoop Distributions
Apache Hadoop
Complex Cluster setup
Manual install and Integration of Hadoop
ecosystem components such as Pig, Hive,
HBase etc
No commercial Support
Good for First try
Cloudera
Established distribution with many referenced
deployments
Powerful tools for deployment, management
and monitoring such as Cloudera Manager
27. www.edureka.in/hadoop-admin
HortonWorks
Only distribution without any modification in Apache Hadoop
HCatalog for metadata
Stinger for Hive
MapR
Support native Unix filesystem
HA features such as snapshots, mirroring or stateful failover
Amazon Elastic Map Reduce (EMR)
Hosted Solution
Only Pig and Hive are available as of now
Popular Hadoop Distributions
28. www.edureka.in/hadoop-admin
Assignments – Status
Attempt the following Assignments using the documents present in the LMS:
Install single-node Apache Hadoop 2.0 using a Virtual Machine in VMPlayer or VirtualBox.
In order to scale the name service horizontally, federation uses multiple independent Namenodes/namespaces. The Namenodes are federated, that is, the Namenodes are independent and don’t require coordination with each other. The datanodes are used as common storage for blocks by all the Namenodes. Each datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports and handles commands from the Namenodes.
In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in aStandby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.In order for the Standby node to keep its state synchronized with the Active node, the current implementation requires that the two nodes both have access to a directory on a shared storage device (eg an NFS mount from a NAS). This restriction will likely be relaxed in future versions.When any namespace modification is performed by the Active node, it durably logs a record of the modification to an edit log file stored in the shared directory. The Standby node is constantly watching this directory for edits, and as it sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the shared storage before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called "split-brain scenario," the administrator must configure at least one fencing method for the shared storage. During a failover, if it cannot be verified that the previous Active node has relinquished its Active state, the fencing process is responsible for cutting off the previous Active's access to the shared edits storage. This prevents it from making any further edits to the namespace, allowing the new Active to safely proceed with failover.More:http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
Apache Hadoop NextGen MapReduce (YARN)MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN.The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.The ResourceManager and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system.The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.
These screen shots are from a single node CDH4 cluster, so whatever configuration not completed can be found in conf.empty.
You can still use fs.default.name
No Hadoop 2.0in ProductionBelow is the configuration of a live production cluster (CDH3):For NameNode:RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 X 10 GB/sOS: 32bit CentOSFor DataNode:RAM: 16GBHard disk: 6 X 2TBProcessor: Xenon with 2 cores.Ethernet: 3 X 10 GB/sOS: 32bit CentOSFor Secondary-NameNode:RAM: 32 GB,Hard disk: 1 TBProcessor: Xenon with 4 CoresEthernet: 3 X 10 GB/sOS: 32bit CentOSThe above configuration deals with about 10-15 TB of data per customer on an average and the company has 3-4 customers who were using this functionality and we found that it was serving us well. This environment also performs some real complex queries on this by slicing and dicing the data.
To choose the right hardware, software and configuration for your production Hadoop Cluster, you need to do your homework on the the Hadoop distribution vendor (For example Cloudera or Apache Hadoop ), Usage or Workload Pattern, Cluster Size, additional Hadoop ecosystem components such as HBase etc (in addition to Size of the Data).Based on this analysis you need to decided hardware for each server nodes.
Recommend Hadoop Operations by Eric Sammer as reference guidehttp://www.amazon.com/Hadoop-Operations-Eric-Sammer/dp/1449327052.