2. Some Numbers......
# 1k => 1000 bytes
# 1kb => 1024 bytes
# 1m => 1000000 bytes
# 1mb => 1024*1024 bytes
# 1g => 1000000000 bytes # 1gb => 1024*1024*1024 bytes
# 1T => 1000000000000 bytes #Tb =>1024*1024*1024*1024 bytes
# 1Petabytes , Exabytes, zettabytes... etc
Max data in memory (RAM): 64GB
Max data per computer (disk): 24TB
Data processed by Google every month: 400PB… in 2007
Average job size: 180GB
Time: 180GB of data would take to read sequentially off a single disk drive: approximately 45
minutes
15/12/13
3. Data Access Speed is the Bottleneck
We can process data very quickly, but we can only read/write
it very slowly
Solution: parallel reads
– 1 HDD = 75MB/sec
– 1,000 HDDs = 75GB/sec
– Far more acceptable
15/12/13
4. Moving to a Cluster of Machines
* In the late 1990s, Google decided to design its architecture using
clusters of low-cost machines
– Rather than fewer, more powerful machines
* Creating an architecture around low-cost, unreliable hardware
presents a number of challenges
15/12/13
5. System Requirements
* System should support partial failure
* System should support data recoverability
* System should be consistent
* System should be scalable
15/12/13
6. Hadoop's Origins
Google created an architecture which answers these (and other)
requirements
Released two White Papers
1. 2003: Description of the Google File System (GFS)
– A method for storing data in a distributed, reliable fashion
2. 2004: Description of distributed MapReduce
– A method for processing data in a parallel fashion
15/12/13
9. HDFS Features
* Operates ‘on top of’ an existing filesystem
* Files are stored as ‘blocks’
– Much larger than for most filesystems
– Default is 64MB
* Provides reliability through replication
– Each block is replicated across multiple DataNodes
– Default replication factor is 3
* Single NameNode daemon stores metadata and co-ordinates access
– Provides simple, centralized management
* Blocks are stored on slave nodes
– Running the DataNode daemon
15/12/13
12. The NameNode
The NameNode stores all metadata
– Information about file locations in HDFS
– Information about file ownership and permissions
– Names of the individual blocks
– Locations of the blocks
Metadata is stored on disk and read when the NameNode
daemon starts up
– Filename is fsimage
When changes to the metadata are required, these are made in
RAM
– Changes are also written to a log file on disk called edits
– Full details later
13. The NameNode: Memory Allocation
When the NameNode is running, all meta data is held in RAM
for fast response
Each ‘item’ consumes 150-200 bytes of RAM
Items:
– Filename, permissions, etc.
– Block information for each block
14. The NameNode: Memory Allocation
Why HDFS prefers fewer, larger files:
– Consider 1GB of data, HDFS block size 128MB
– Stored as 1 x 1GB file
– Name: 1 item
– Blocks: 8 x 3 = 24 items
– Total items: 25
– Stored as 1000 x 1MB files
– Names: 1000 items
– Blocks: 1000 x 3 = 3000 items
– Total items: 4000
15. The Slave Nodes
Actual contents of the files are stored as blocks on the slave nodes
Blocks are simply files on the slave nodes’ underlying filesystem
– Named blk_xxxxxxx
– Nothing on the slave node provides information about what
underlying file the block is a part of
– That information is only stored in the NameNode’s metadata
Each block is stored on multiple different nodes for redundancy
– Default is three replicas
Each slave node runs a DataNode daemon
– Controls access to the blocks
– Communicates with the NameNode
17. The Secondary NameNode:
The Secondary NameNode is not a failover NameNode!
– It performs memory-intensive administrative functions for the
NameNode
– NameNode keeps information about files and blocks (the
metadata) in memory
– NameNode writes metadata changes to an editlog
– Secondary NameNode periodically combines a prior
filesystem snapshot and editlog into a new snapshot
– New snapshot is transmitted back to the NameNode
Secondary NameNode should run on a separate machine in a
large installation
– It requires as much RAM as the NameNode
19. Anatomy of a File Write
1. Client connects to the NameNode
2. NameNode places an entry for the file in its metadata, returns the block
name and list of DataNodes to the client
3. Client connects to the first DataNode and starts sending data
4. As data is received by the first DataNode, it connects to the second and
starts sending data
5. Second DataNode similarly connects to the third
6. ack packets from the pipeline are sent back to the client
7. Client reports to the NameNode when the block is written
21. Anatomy of a File Read
Client connects to the NameNode
NameNode returns the name and locations of the first few blocks of
the file
– Block locations are returned closest-first.
Client connects to the first of the DataNodes, and reads the block
If the DataNode fails during the read, the client will seamlessly
connect to the next one in the list to read the block
22. The NameNode Is Not A Bottleneck
Note: the data never travels via the NameNode
– For writes
– For reads
– During re-replication
23. Dealing With Data Corruption
As the DataNode is reading the block, it also calculates the checksum.
‘Live’ checksum is compared to the checksum created when the block
was stored.
If they differ, the client reads from the next DataNode in the list
– The NameNode is informed that a corrupted version of the block
has been found.
– The NameNode will then re-replicate that block elsewhere.
The DataNode verifies the checksums for blocks on a regular basis to
avoid ‘bit rot’
– Default is every three weeks after the block was created
24. Data Reliability and Recovery
DataNodes send heartbeats to the NameNode
– Every three seconds
After a period without any heartbeats, a DataNode is assumed
to be lost
– NameNode determines which blocks were on the lost node.
– NameNode finds other DataNodes with copies of these blocks.
– These DataNodes are instructed to copy the
blocks to other nodes.
– Three-fold replication is actively maintained.
26. Hadoop is ‘Rack-aware’
Hadoop understands the concept of ‘rack awareness’
– The idea of where nodes are located, relative to one another
– Helps the JT to assign tasks to nodes closest to the data
– Helps the NN determine the ‘closest’ block to a client during reads
– In reality, this should perhaps be described as being ‘switchaware’
HDFS replicates data blocks on nodes on different racks
– Provides extra data security in case of catastrophic hardware
failure
Rack-awareness is determined by a user-defined script
29. HDFS File Permissions
Files in HDFS have an owner, a group, and permissions
– Very similar to Unix file permissions
HDFS permissions are designed to stop good people doing
foolish things
30. What Is MapReduce?
MapReduce is a method for distributing a task across multiple nodes
Each node processes data stored on that node
Consists of two developer-created phases
– Map
– Reduce
In between Map and Reduce is the shuffle and sort
– Sends data from the Mappers to the Reducers
32. MapReduce: Basic Concepts
Each Mapper processes a single input split from HDFS
Hadoop passes the developer’s Map code one record at a time
Each record has a key and a value
Intermediate data is written by the Mapper to local disk
During the shuffle and sort phase, all the values associated
with the same intermediate key are transferred to the same
Reducer
– The developer specifies the number of Reducers
Reducer is passed each key and a list of all its values
– Keys are passed in sorted order
Output from the Reducers is written to HDFS
36. Some MapReduce Terminology
* A user runs a client program on a client computer
* The client program submits a job to Hadoop
– The job consists of a mapper, a reducer, and a list of inputs
* The job is sent to the JobTracker
* Each Slave Node runs a process called the TaskTracker
* The JobTracker instructs TaskTrackers to run and monitor tasks
– A Map or Reduce over a piece of data is a single task
* A task attempt is an instance of a task running on a slave node
– Task attempts can fail, in which case they will be restarted (more later)
– There will be at least as many task attempts as there are tasks which need
be performed
15/12/13
to
37. Aside: The Job Submission Process
When a job is submitted, the following happens:
– The client requests and receives a new unique Job ID from the JobTracker (includes
JobTracker start time and a sequence number)
– The client calculates the input splits for the job
– How the input data will be split up between Mappers
– The client turns the job configuration information into an XML file
– The client places the XML file and the job jar into a temporary
directory in HDFS (the Job ID is included in the path)
– The client contacts the JobTracker with the location of the XML
and jar files, and the list of input splits
– The JobTracker takes over the job from this point on
15/12/13
39. MapReduce Failure Recovery
Task processes send heartbeats to the TaskTracker
TaskTrackers send heartbeats to the JobTracker
Any task that fails to report in 10 minutes is assumed to have failed
– Its JVM is killed by the TaskTracker
Any task that throws an exception is said to have failed
Failed tasks are reported to the JobTracker by the TaskTracker
The JobTracker reschedules any failed tasks
– It tries to avoid rescheduling the task on the same TaskTracker where it previously
failed
If a task fails four times, the whole job fails
15/12/13
40. MapReduce Failure Recovery
Any TaskTracker that fails to report in 10 minutes is assumed to have crashed
– All tasks on the node are restarted elsewhere
– Any TaskTracker reporting a high number of failed tasks is
blacklisted, to prevent the node from blocking the entire job
– There is also a ‘global blacklist’, for TaskTrackers which fail on
multiple jobs.
The JobTracker manages the state of each job
– Partial results of failed tasks are ignored
15/12/13
41. The Apache Hadoop Project
Hadoop is a ‘top-level’ Apache project
– Created and managed under the auspices of the Apache Software Foundation
Several other projects exist that rely on some or all of Hadoop
– Typically either both HDFS and MapReduce, or just HDFS
Ecosystem projects are often also top-level Apache projects
– Some are ‘Apache incubator’ projects
– Some are not managed by the Apache Software Foundation
Ecosystem projects include Hive, Pig, Sqoop, Flume, HBase,Oozie, …
15/12/13
42. Hive
Hive is a high-level abstraction on top of MapReduce
– Initially created by a team at Facebook
– Avoids having to write Java MapReduce code
– Data in HDFS is queried using a language very similar to SQL
– Known as HiveQL
HiveQL queries are turned into MapReduce jobs by the Hive
interpreter
– ‘Tables’ are just directories of files stored in HDFS
– A Hive Metastore contains information on how to map a file to a table
structure
15/12/13
43. Planning Your Hadoop Cluster
* What issues to consider when planning your Hadoop cluster
1. What types of hardware are typically used for Hadoop
nodes
2. How to optimally configure your network topology
3. How to select the right operating system and Hadoop
distribution
44. Cluster Growth Based on Storage Capacity
• Basing your cluster growth on storage capacity is often a
good method to use
• Example:
– Data grows by approximately 1TB per week
– HDFS set up to replicate each block three times
– Therefore, 3TB of extra storage space required per week
– Plus some overhead – say, 30%
– Assuming machines with 4 x 1TB hard drives, this equates to a
new machine required each week
– Alternatively: Two years of data – 100TB – will require
approximately 100 machines
45. Classifying Nodes
• Nodes can be classified as either ‘slave nodes’ or ‘master
nodes’
• Slave node runs DataNode plus TaskTracker daemons
• Master node runs either a NameNode daemon, a Secondary
NameNode Daemon, or a JobTracker daemon
– On smaller clusters, NameNode and JobTracker are often run
on the same machine
– Sometimes even Secondary NameNode is on the same
machine as the NameNode and JobTracker
– Important that at least one copy of the NameNode’s metadata is
stored on a separate machine (see later)
46. Slave Nodes: Recommended Configuration
• Typical ‘base’ configuration for a slave Node
– 4 x 1TB or 2TB hard drives, in a JBOD* configuration
– Do not use RAID! (See later)
– 2 x Quad-core CPUs
– 24-32GB RAM
– Gigabit Ethernet
• Multiples of (1 hard drive + 2 cores + 6-8GB RAM) tend to
work well for many types of applications
– Especially those that are I/O bound
47. Slave Nodes: More Details (CPU)
• Quad-core CPUs are now standard
• Hex-core CPUs are becoming more prevalent
– But are more expensive
• Hyper-threading should be enabled
• Hadoop nodes are seldom CPU-bound
– They are typically disk- and network-I/O bound
– Therefore, top-of-the-range CPUs are usually not necessary
48. Slave Nodes: More Details (RAM)
• Slave node configuration specifies the maximum number of Map
and Reduce tasks that can run simultaneously on that node
• Each Map or Reduce task will take 1GB to 2GB of RAM
• Slave nodes should not be using virtual memory
• Ensure you have enough RAM to run all tasks, plus overhead
for the DataNode and TaskTracker daemons, plus the operating
system
• Rule of thumb:
Total number of tasks = 1.5 x number of processor cores
-- This is a starting point, and should not be taken as a definitive
setting for all clusters
49. Slave Nodes: More Details (Disk)
• In general, more spindles (disks) is better
• In practice, we see anywhere from four to 12 disks per node
• Use 3.5" disks
– Faster, cheaper, higher capacity than 2.5" disks
• 7,200 RPM SATA drives are fine
– No need to buy 15,000 RPM drives
• 8 x 1.5TB drives is likely to be better than 6 x 2TB drives
– Different tasks are more likely to be accessing different disks
• A good practical maximum is 24TB per slave node
– More than that will result in massive network traffic if a node dies
and block re-replication must take place
50. Slave Nodes: Why Not RAID?
• Slave Nodes do not benefit from using RAID* storage
– HDFS provides built-in redundancy by replicating blocks across
multiple nodes
– RAID striping (RAID 0) is actually slower than the JBOD
configuration used by HDFS
– RAID 0 read and write operations are limited by the speed of
the slowest disk in the RAID array
– Disk operations on JBOD are independent, so the average
speed is greater than that of the slowest disk
– One test by Yahoo showed JBOD performing between 10%
and 30% faster than RAID 0, depending on the operations being
performed
51. What About Virtualization?
• Virtualization is usually not worth considering
– Multiple virtual nodes per machine hurts performance
– Hadoop runs optimally when it can use all the disks at once
What About Blade Servers?
Blade servers are not recommended
– Failure of a blade chassis results in many nodes being
unavailable
– Individual blades usually have very limited hard disk capacity
– Network interconnection between the chassis and top-of-rack
switch can become a bottleneck
52. Master Nodes: Single Points of Failure
• Slave nodes are expected to fail at some point
– This is an assumption built into Hadoop
– NameNode will automatically re-replicate blocks that were on
the failed node to other nodes in the cluster, retaining the 3x
replication requirement
– JobTracker will automatically re-assign tasks that were running
on failed nodes
• Master nodes are single points of failure
– If the NameNode goes down, the cluster is inaccessible
– If the JobTracker goes down, no jobs can run on the cluster
– All currently running jobs will fail
• Spend more money on your master nodes!
53. Master Node Hardware Recommendations
• Carrier-class hardware
– Not commodity hardware
• Dual power supplies
• Dual Ethernet cards
– Bonded to provide failover
• RAIDed hard drives
• At least 32GB of RAM
54. General Network Considerations
• Hadoop is very bandwidth-intensive!
– Often, all nodes are communicating with each other at the same
time
• Use dedicated switches for your Hadoop cluster
• Nodes are connected to a top-of-rack switch
• Nodes should be connected at a minimum speed of 1Gb/sec
• For clusters where large amounts of intermediate data is
generated, consider 10Gb/sec connections
– Expensive
– Alternative: bond two 1Gb/sec connections to each node
55. General Network Considerations (cont’d)
• Racks are interconnected via core switches
• Core switches should connect to top-of-rack switches at
10Gb/ sec or faster
• Beware of over-subscription in top-of-rack and core
switches
• Consider bonded Ethernet to mitigate against failure
• Consider redundant top-of-rack and core switches
56. Operating System Recommendations
• Choose an OS you’re comfortable administering
• CentOS: geared towards servers rather than individual
workstations
– Conservative about package versions
– Very widely used in production
• RedHat Enterprise Linux (RHEL): RedHat-supported analog
to CentOS
– Includes support contracts, for a price
• In production, we often see a mixture of RHEL and CentOS
machines
– Often RHEL on master nodes, CentOS on slaves
57. Configuring The System
• Do not use Linux’s LVM (Logical Volume Manager) to make
all your disks appear as a single volume
• – As with RAID 0, this limits speed to that of the slowest disk
• Check the machines’ BIOS* settings
– BIOS settings may not be configured for optimal performance
– For example, if you have SATA drives make sure IDE emulation
is not enabled
• Test disk I/O speed with hdparm -t
– Example:
hdparm -t /dev/sda1
– You should see speeds of 70MB/sec or more
Anything less is an indication of possible problems
58. Configuring The System
Hadoop has no specific disk partitioning requirements
– Use whatever partitioning system makes sense to you
Mount disks with the noatime option
Common directory structure for data mount points:
/data/<n>/dfs/nn
/data/<n>/dfs/dn
/data/<n>/dfs/snn
/data/<n>/mapred/local
Reduce the swappiness of the system
– Set vm.swappiness to 0 or 5 in /etc/sysctl.conf
15/12/13
59. Filesystem Considerations
• Cloudera recommends the ext3 and ext4 filesystems
– ext4 is now becoming more commonly used
• XFS provides some performance benefit during kickstart
– It formats in 0 seconds, vs several minutes for each disk with
ext3
• XFS has some performance issues
– Slow deletes in some versions
– Some performance improvements are available; see e.g.,
http://everything2.com/index.pl?node_id=1479435
– Some versions had problems when a machine runs out of
memory
60. Operating System Parameters
• Increase the nofile ulimit for the mapred and hdfs users to
at least 32K
– Setting is in /etc/security/limits.conf
• Disable IPv6
• Disable SELinux
• Install and configure the ntp daemon
– Ensures the time on all nodes is synchronized
– Important for HBase
– Useful when using logs to debug problems
61. Java Virtual Machine (JVM) Recommendations
• Always use the official Oracle JDK (http://java.com/)
– Hadoop is complex software, and often exposes bugs in other
JDK implementations
• Version 1.6 is required
– Avoid 1.6.0u18
– This version had significant bugs
• Hadoop is not yet production-tested with Java 7 (1.7)
• Recommendation: don’t upgrade to a new version as soon
as it is released
– Wait until it has been tested for some time
63. For easy installation
Cloudera has released Cloudera Manager (CM), a tool for easy
deployment and configuration of Hadoop clusters
The free version, Cloudera Manager Free Edition, can manage
up to 50 nodes
– The version supplied with Cloudera Enterprise supports an unlimited
number of nodes
66. Hadoop's Configuration Files
Each machine in the Hadoop cluster has its own set of
configuration files
Configuration files all reside in Hadoop’s conf directory
– Typically /etc/hadoop/conf
Primary configuration files are written in XML
69. hdfs-site.xml
The single most important configuration value on your entire cluster, set on
the NameNode:
* Loss of the NameNode’s metadata will result in the effective loss of all the
data on the cluster
– Although the blocks will remain, there is no way of reconstructing the
original files without the metadata
* This must be at least two disks (or a RAID volume) on the NameNode,
plus an NFS mount elsewhere on the network
– Failure to set this correctly will result in eventual loss of your cluster’s
data
71. Additional Configuration Files
There are several more configuration files in /etc/hadoop/conf
– hadoop-env.sh: environment variables for Hadoop daemons
– HDFS and MapReduce include/exclude files
* Controls who can connect to the NameNode and JobTracker
– masters, slaves: hostname lists for ssh control
– hadoop-policy.xml: Access control policies
– log4j.properties: logging (covered later in the course)
– fair-scheduler.xml: Scheduler (covered later in the course)
– hadoop-metrics.properties: Monitoring (covered later in
the course)
72. Environment Setup: hadoop-env.sh
HADOOP_HEAPSIZE
– Controls the heap size for Hadoop daemons
– Default 1GB
– Comment this out, and set the heap for individual daemons
HADOOP_NAMENODE_OPTS
– Java options for the NameNode
– At least 4GB: -Xmx4g
HADOOP_JOBTRACKER_OPTS
– Java options for the JobTracker
– At least 4GB: -Xmx4g
HADOOP_DATANODE_OPTS, HADOOP_TASKTRACKER_OPTS
– Set to 1GB each: -Xmx1g
15/12/13
73. Host 'include' and 'exclude' Files
Optionally, specify dfs.hosts in hdfs-site.xml to point to a file
listing hosts which are allowed to connect to the NameNode
and act as DataNodes
– Similarly, mapred.hosts points to a file which lists hosts allowedto
connect as TaskTrackers
Both files are optional
– If omitted, any host may connect and act as a DataNode/
TaskTracker
– This is a possible security/data integrity issue
NameNode can be forced to reread the dfs.hosts file with
hadoop dfsadmin -refreshNodes
– No such command for the JobTracker, which has to be restarted
to re-read the mapred.hosts file, so many System Administrators only
create a dfs.hosts file
76. Displaying All Jobs
• To display all jobs including completed jobs, use
# hadoop job -list all
77. Killing a Job
• It is important to note that once a user has submitted a job,
they can not stop it just by hitting CTRL-C on their terminal
– This stops job output appearing on the user’s console
– The job is still running on the cluster!
78. Killing a Job
To kill a job use hadoop job -kill <job_id>
15/12/13