DevoxxFR 2024 Reproducible Builds with Apache Maven
NameNode and DataNode Couplingfor a Power-proportional Hadoop Distributed File System
1. NAMENODE AND DATANODE COUPLING
FOR A POWER-PROPORTIONAL
HADOOP DISTRIBUTED FILE SYSTEM
Hieu Hanh Le, Satoshi Hikida and Haruo Yokota
Tokyo Institute of Technology
Appeared in DASFAA 2013
The 18th International Conference on Database
Systems for Advanced Applications (Wuhan, China)
1
2. Background
Research Motivation
Goal and Approach
Proposals
Experimental Evaluation
Conclusion
Agenda
2
3. Background
Hadoop Distributed File System (HDFS) is widely
used as data storage for applications in the Cloud
Commercial Off-the-self-based system
Support MapReduce framework
Good scalability
Utilize a huge number of DataNodes to store huge amount
of data requested by data-intensive applications
Expand the power consumption of storage system
Power-aware file systems are moving towards
power-proportional design
3
4. [Background]
Power-proportional Storage System
System should consume energy in proportion to
amount of work performed [Barroso and Holzle, 2007]
Set system’s operation to multiple gears containing
different number of data nodes
Made possible by data placement methods
4
High Gear
Node
1
Node
2
D2
Node
3
D3D1
Node
4
D4
Low Gear
Node
1
Node
4
Node
3
Node
2
D2 D3D1 D4
D1 D4
migration
5. Research Motivation
5
Gear-shifting is vital in power-proportional system
The system needs to reflect updated data that was
modified in a lower gear to guarantee the higher
performance
Re-transfer the updated data according to the data
placement
The inefficient gear-shifting process in current methods
for the HDFS [Rabbit, Sierra]
Bottleneck in metadata access
High communication cost among nodes
Rabbit: Robust and Flexible Power-proportional Storage, ACM SOCC 2010
Sierra: Practical Power-proportionality for Data Center Storage, ACM EuroSys 2011
6. Gear-shifting in current HDFS-based methods [1/10]
6
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up
Eg: Rabbit, Sierra
D1
D2 D3
D4
D2 D3
D1 D4
Low Gear High Gear
7. Gear-shifting in current HDFS-based methods [2/10]
7
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up
Eg: Rabbit, Sierra
1. Access metadata to
identify updated blocks
Congestion
D1
D2 D3
D4
D2 D3
D1 D4
Low Gear High Gear
8. Gear-shifting in current HDFS-based methods [3/10]
8
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up
2. Transfer updated
blocks
Eg: Rabbit, Sierra
Congestion
D1
D2 D3
D4
D2 D3
D1 D4
2.1 Command
issuance
2.2 Transfer
block
Low Gear High Gear
1. Access metadata to
identify updated blocks
9. Gear-shifting in current HDFS-based methods [4/10]
9
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up
2. Transfer updated
blocks
Eg: Rabbit, Sierra
Congestion
D1
D2 D3
D4
D2 D3
D1 D4
2.1 Command
issuance
2.2 Transfer
block
Low Gear High Gear
1. Access metadata to
identify updated blocks
10. Gear-shifting in current HDFS-based methods [5/10]
10
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up
2. Transfer updated
blocks
Eg: Rabbit, Sierra
Congestion
D1
D2 D3
D4
D2 D3
D1 D4
2.1 Command
issuance
2.2 Transfer
block
Low Gear High Gear
1. Access metadata to
identify updated blocks
11. Gear-shifting in current HDFS-based methods [6/10]
11
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up
2. Transfer updated
blocks
Eg: Rabbit, Sierra
Congestion
D1
D2 D3
D4
D2 D3
D1 D4
2.1 Command
issuance
2.2 Transfer
block
Low Gear High Gear
1. Access metadata to
identify updated blocks
12. Gear-shifting in current HDFS-based methods [7/10]
12
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up
2. Transfer updated
blocks
Eg: Rabbit, Sierra
Congestion
D1
D2 D3
D4
D2 D3
D1 D4
2.1 Command
issuance
2.2 Transfer
block
Low Gear High Gear
1. Access metadata to
identify updated blocks
13. Gear-shifting in current HDFS-based methods [8/10]
13
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up
2. Transfer updated
blocks
Eg: Rabbit, Sierra
Congestion
D1
D2 D3
D4
D2 D3
D1 D4
2.1 Command
issuance
2.2 Transfer
block
Low Gear High Gear
1. Access metadata to
identify updated blocks
14. Gear-shifting in current HDFS-based methods [9/10]
14
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up
2. Transfer updated
blocks
Eg: Rabbit, Sierra
Congestion
D1
D2 D3
D4
D2 D3
D1 D4D1
2.1 Command
issuance
2.2 Transfer
block
Low Gear High Gear
1. Access metadata to
identify updated blocks
15. Gear-shifting in current HDFS-based methods [10/10]
15
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Write Dataset
D = {D1, D2, D3, D4}
Data
Node1
Data
Node4
Data
Node2
Data
Node3
Name
Node
Gear Up
2. Transfer updated
blocks
Eg: Rabbit, Sierra
Sequentially
(1 block/connection)
Congestion
InefficiencyD1
D2 D3
D4
D2 D3
D1 D4D1 D4
2.1 Command
issuance
2.2 Transfer
block
Low Gear High Gear
1. Access metadata to
identify updated blocks
16. Goal and Approach
Goal
Propose a novel architecture for efficient gear-shifting for
power-proportional HDFS
Approach
Utilize distributed metadata management (MDM)
Eliminate the bottleneck of the centralized MDM
Coupling NameNode and DataNode (NDCouplingHDFS)
Localize the range of updated blocks maintained by metadata
management
Reduce the communication cost among nodes
Enable multiple blocks transfer to improve the efficiency in
HDFS
16
17. [Proposals]
Distributed MDM
Distribute MDM to multiple nodes to decentralize the load during
gear-shiftings
Require a distributed MDM that is update conscious
The MDM is transferred when the system shifts gears
Low cost of search/insert/delete operations
Inefficient distributed hash table based method
For each transferred file, the hash function is needed to be applied
Efficient range based method
For a range of files, all the metadata can be transferred within a limited
structure transverses
Apply two range-based methods
Each node statically maintains a separate subnamespace
(Static Directory Partition-SDP)
Parallel index technique with well concurrency control (Fat-Btree) [*]
17
[*] A Concurrency Control Protocol for Parallel B-tree structure without
latch-coupling for explosively growing digital content, EDBT 2008
18. [Proposals]
NDCouplingHDFS with Distributed MDM
Each node maintains a subnamespace of the whole
namspace of the system
The mapping information [Node, Range] is managed by
Distributed MDM
18
Data
Management
Distributed
MDM
ND1
Distributed
MDM
Data
Management
ND2
Distributed
MDM
Data
Management
ND3
Distributed
MDM
Data
Management
ND4
2. Forward request to
responsible nodes
3. Serve the request
and return the results
1. Send
request of 25
4. Return results
A NDCoulingHDFS
node
ND1: [1, 10]
ND2: [11,20]
ND3: [21, 30]
ND4: [31,~]
24. [Proposals]
Efficient Gear-shifting [6/6]24
Data
Management
Distributed
MDM
Distributed
MDM
Data
Management
Distributed
MDM
Data
Management
Distributed
MDM
Data
Management
A DB C
A1 B1 C1 D1
WOL
Log
WOL
Log
A
A1
D
D1
<File, Temp Node, Intended Node>
Reactivated Reactivated
A1
B1 C1
D1A1 D1
1. Transfer updated
metadata
1. Transfer updated
metadata
The process is
distributed to
multiple nodes
The command
issuance from
Disitributed MDM
and Data
Management is
locally performed
Updated blocks
are transferred in
batch way
(multiple blocks
per connection)
2. Command issuance 2. Command issuance
3. Transfer blocks3. Transfer blocks
4. Updated metadata4. Updated metadata
Parallelism
Reduce
network cost
Efficient block
transfer
25. Experiment Evaluation
Experiment 1
Verify the effectiveness of proposals in gear-shifting
process by comparing with the normal HDFS
Updated block reflection is the major cost
Coupling architecture, batch block transferring
Experiment 2
Evaluate the effectiveness of distributed index
technique to NDCouplingHDFS
SDP and Fat-Btree through changing the number of nodes
25
26. [Experiment 1]
Validity of NDCouplingHDFS in Gear-shifting
26
Updated Data Reflection
# Gears 2
# Active nodes at Low Gear 8
# Active nodes at High
Gear
16
# files 16000
File size 1MB
HDFS
Version 0.20.2
Maximum number of
transferred blocks
100
Heartbeat interval 1s
Compare the execution time of updated data
reflection the NDCouplingHDFS with the normal
HDFS based on five configurations
Combinations of architecture, distributed MDM (SDP,
Fat-Btree), command issuance, block transfer
Environment
27. 0
5
10
15
20
25
30
35
40
45
0
10
20
30
40
50
60
70
NormalHDFS SSS SBS SBB FBB
Execution time
Number of communication connections
[commnand issuance]
[Experiment 1]
Experimental Results27
46%41%
Configuration Normal
HDFS
SSS SBS SBB FBB
Architecture HDFS Coupling Coupling Coupling Coupling
MDM Central SDP SDP SDP Fat-Btree
Command
issuance
Sequential Sequential Batch Batch Batch
Block
transference
Sequential Sequential Sequential Batch Batch
Coupling architecture and
Batch block transferring highly
effected the performance
[s]
28. [Experiment 2]
Scalability of metadata operations
Evaluate SDP vs. Fat-Btree
Change the number of files and number of nodes
28
Machine
# 1, 2, 4, 8
CPU TM8600 1.0GHz
Memory DRAM 4GB
NIC 1000Mb/s
OS Linux 3.0 64bit
Java JDK-1.7.0
Fat-Btree
Fanout 16
Control
Concurrency
LCFB [Yoshihara, 2007]
Workload
#files 3000
File size 1MB
29. Fat-Btree gained better scalability when the number of
nodes increases
The read throughput scaled well due to better search cost and
concurrency control
The efficiency in write throughput is limited due to the
synchronization cost in updating tree structure
[Experiment 2]
Experimental Results29
0
50
100
150
200
250
300
350
1 2 4 8
SDP
Fat-Btree
0
5
10
15
20
25
30
1 2 4 8
SDP
FBT
ReadThroughput[operation/s]
WriteThroughput[operations/s]
A transaction: open/create metadata
and read/write files
30. Conclusion
Proposed NDCouplingHDFS for efficient gear-shifting in
power-proportional HDFS
Significantly reduced at most 46% the execution time of
reflecting updated data compare with the normal HDFS
Coupling architecture and batch block transferring
Improved the IO performance by applying distributed
index technique to NDCouplingHDFS
NDCouplingHDFS
Maintains supporting MapReduce
Exptected to achieve real power-proportionality including
power consumption of metadata management
30
31. NameNode and DataNode Coupling for a
Power-proportional Hadoop Distributed File System
Thank you for your attention!31