19. Hadoop Distributed File System
Data Model:
• Data is organized into files and directories
• Files are divided into uniformly-sized blocks and
distributed across cluster nodes
• Blocks are replicated to handle hardware failure
• Filesystem keeps checksums of data for corruption
detection and recovery
• Read requests are always served from closest replica
• Not strictly POSIX-compliant
32. • Programming model processing list of key/value pairs
• Map function: processes input key/value pairs and produces set of
intermediate key/value pairs.
• Reduce function: merges all intermediate values associated with the same
intermediate key and produces output key/value pairs.
Map-Reduce Programming Model
Input
(k1, v1)
Output
K2, List(V3)
Intermediate
Output
List (K2, V2)
Reduce
Sort or Group by K2
(K2, List(V2))
Map
33. Application Writer Specifies:
• Map and Reduce classes
• Input data on HDFS
• Input/Output format classes (optional)
Workflow:
• Input phase generates a number of logical FileSplits from input files
• One Map task is created per logical file split
• Each Map task loads Map class and executes map function to transform
input kv-pairs into a new set of kv-pairs
• Record reader class supplied part of InputFormat reads a input record
as k-v pair
• Map output keys are stored on local disk in sorted partitions, one per
task
• One invocation of map function per k-v pair from an associated input
split
• Each Reduce task fetches map output (from its associated partition) as
soon as map task finishes its processing
• Map outputs are merged
• One invocation of reduce function per distinct key and its associated
list of values
• Output k-v pairs are stored on HDFS, one file per reduce task
• Framework handles task scheduling and recovery.
Km+1…N
Output
Part-0
Output
Part-1
Input
Split 0
Input HDFS File
K1..m K1..mK1..m Km+1…N Km+1…N
Sorted Partitions
Map 0 Map 1 Map 2
Sorted Partitions Sorted Partitions
Reduce 0 Reduce 1
Shuffle
Input
Split 2
Input
Split 1
Merge & Sort Merge & Sort
Parallel Execution Model for Map-Reduce
Km+1…N