SlideShare uma empresa Scribd logo
1 de 32
André Oriani
Islene Calciolari Garcia
Institute of Computing – University of Campinas-Brazil
FROM BACKUP TO HOT STANDBY:
HIGH AVAILABILITY FOR HDFS
AGENDA
• Motivation;
• Architecture of HDFS 0.21;
• Implementation of Hot Standby Node;
• Experiments and Results;
• High Availability features on HDFS 2.0.0-alpha;
• Conclusions and Future Work.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 2
MOTIVATION
CLUSTER-BASED PARALLEL FILE SYSTEMS
Master-Slaves Architecture:
• Metadata Server – serves clients, manages the namespace.
• Storage Servers – store the data.
Centralized System  Metadata Server is a SPOF
Importance of a Hot Standby for the metadata server of HDFS
Cold start of a 2000-node HDFS cluster with 21PB and 150 million files
takes ~ 45min [8].
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 3
HADOOP DISTRIBUTED FILE SYSTEM
(HDFS) 0.21
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 4
NameNodeClient
DataNode DataNodeDataNode
Backup
Node
DATANODES
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 5
Storage Nodes
• Files are split in equal-sized blocks.
• Blocks are replicated to DataNodes.
• Send status messages to NameNode:
• Heartbeats;
• Block-Reports;
• Block-Received.
NameNode
Statuses
NAMENODE
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 6
NameNodeClient
DataNode DataNode
Requests
Metadata
Hearbeats
Block-Reports
Block-Received
Commands
Metadata Server
• Manages the file system tree.
• Handles metadata requests.
• Controls access and leases.
• Manages Blocks:
• Allocation;
• Location;
• Replication levels.
NAMENODE’S STATE
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 7
File System
Tree
Block
Management
Leases
LOG
Journaling
• All state is kept in RAM for
better performance.
• Changes to namespace
are recorded to log.
• Lease and Block
information is volatile.
BACKUP NODE
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 8
NameNode
Backup
Node
Journal Entries
Checkpoint
Checkpoint Helper
• NameNode streams changes to Backup Node.
• Efficient checkpoint strategy: apply changes to its own state.
• Checkpoint Backup’s state == Checkpoint NameNode’s state.
A HOT STANDBY FOR HDFS 0.21
• Backup Node: Opportunity
• Already replicates namespace state.
• NameNode’s subclass  Can process client requests.
• Evolving the Backup Node into Hot Standby Node:
1. Handle the missing state components:
1. Replica locations
2. Leases
2. Detect NameNode’s Failure.
3. Switch the Hot Standby Node to active state (failover).
4. Disseminate current metadata server information.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 9
Reuses AvatarNodes’
strategy:
• Sends DataNodes’ status
Messages to Hot Standby
too.
• No rigid sync
• DataNode failures are
relatively common.
Stable clusters: 2-3
failures per day in
1,000 nodes [3]
• Hot Standby Node is
kept on safe mode. It
becomes the authority
once it is active
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 10
MISSING STATE
REPLICA LOCATIONS
NameNode
DataNode
Hot Standby
Node
Heartbeats
Block-Reports
Block-Received
Not replicated:
• Blocks are only
recorded to log
when file is closed.
• So any write in
progress is lost if
NameNode fails.
• Restarting the write
will create a new
lease.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 11
MISSING STATE
LEASES
NameNodeClient LOG
open(file)
open(file)
getAdditionalBlock()
getAdditionalBlock()
getAdditionalBlock()
close(file)
complete(file,blocks)
addLease(file,client)
Uses ZooKeeper
• Highly available
distributed coordination
service.
• Keeps a tree of znodes
replicated among the
ensemble.
• All operations are
atomic.
• Leaf node may be
ephemeral.
• Clients can watch
znodes.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 12
FAILURE DETECTION
NameNode
Hot Standby
Node
Hot StandbyNameNode
ZooKeeper
Ensemble
Namenodes
active’s IP
Client
NameNode fails
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 13
FAILOVER
NameNode
Hot Standby
Node
Namenodes
Hot Standby
ZooKeeper
Ensemble
Switch Hot Standby Node to
active
1. Stop checkpointing;
2. Close all open files;
3. Restart lease management;
4. Leave safe mode;
5. Update group znode to its
network address.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 14
FAILOVER(CONT.)
Hot Standby
Node
EXPERIMENTS
• Environment Test : Amazon EC2
• 1 Zookeeper server;
• 1 NameNode;
• 1 Backup Node or Hot Standby Node;
• 20 DataNodes;
• 20 Clients running the test programs;
• All small instances.
• Performance tests - Comparison against HDFS 0.21.
• Failover tests - NameNode is shutdown when block count is more than 2K.
• Two test scenarios
• Metadata : Each client creates 200 files of one block each;
• I/O : Each client creates a single file of 200 blocks;
• 5 sample per scenario.
• Source code, tests scripts and raw data:
• Available at https://sites.google.com/site/hadoopfs/experiments
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 15
RESULTS OVERVIEW
Lines of Code:
1373 lines: 0.18% of original code for HDFS 0.21.
Performance:
• NameNode
• Failover Manager overhead: increase of 16% in CPU time and
12% in heap memory compared to HDFS 0.21.
• DataNodes
• No considerable change on network traffic. Extra messages are
less than 0.43% of total flow out of DataNode.
• Substantial overhead only on I/O scenario: 17% in CPU and 6%
in heap memory compared to HDFS 0.21 in the same scenario.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 16
RESULTS OVERVIEW(CONT.)
• Big Block-Received Message Problem:
• Hot Standby only knows (thru log) the blocks of a file when it is closed.
• Hot Standby returns non-recognized blocks, so DataNodes can retry them on
next block-received message.
• In I/O scenario files have 200 blocks Many pending blocks will be retried
until files are closed  larger block-received messages and responses 
Processing Memory
• Hot Standby Node
• CPU time 3.3 times and heap memory 1.9 times higher in I/O scenario
compared to metadata scenario.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 17
DataNode
Hot Standby
Node
Block-Received: new + old blocks
Response: non-recognized blocks
PERFORMANCE RESULTS
THROUGHPUT AT CLIENTS
METADATA
6.24
25.8
5.45
25.32
0
5
10
15
20
25
30
Write Read
Throughtput(MB/s)
HDFS 0.21 Hot Standby
I/O
4.72
13.89
4.86
12.39
0
5
10
15
20
25
30
Write ReadThroughtput(MB/s)
HDFS 0.21 Hot Standby
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 18
FAILOVER RESULTS
METADATA I/O
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 19
NameNode Failure
Hot Standby is notified of
the failure.
First request processed
NameNode Failure
Hot Standby is notified of
the failure.
First request processed
Zookeeper session timeout : 2 min
(1.62±0.23)
min.
(2.31±0.46)
min.
0.24%
22%
HDFS 2.0.0-ALPHA
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 20
Name
Node
Shared
Storage
Standby
Name
Node
Client
Data
Node
writes log reads log
Released on May 2012
• DataNodes send
messages to both nodes.
• Transactional log is
written to High Available
Shared Storage.
• Standby keeps reading
log from storage.
• Blocks are logged as
they are allocated.
• Manual Failover with IO
fencing.
CONCLUSIONS AND
FUTURE WORK
• We built a high availability solution for HDFS that is capable of
delivering good throughput with low overhead to existing
components.
• The solution has a reasonable reaction time to failures and
works well in Elastic Computing environments.
• Impact on the code base was small and no components
external to the Hadoop Project were required.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 21
CONCLUSIONS AND
FUTURE WORK (CONT.)
• Our results showed that we can improve the performance if we
handle better the new blocks. If the Hot Standby becomes
aware of which blocks compose a file before it is closed, we will
be able to continue writes. We also plan to support
reconfiguration.
• High Availability on HDFS is still a open problem:
• Multiple failure support;
• Integration with BooKeeper;
• Using HDFS itself to store the logs.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 22
ACKNOWLEDGMENTS
• Rodrigo Schmidt
• Alumnus of University of Campinas (Unicamp);
• Facebook Engineer.
• Motorola Mobility
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 23
REFERENCES
1. K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in Mass
Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, 3-7 2010, pp. 1 –10.
2. “Facebook has the world’s largest Hadoop cluster!” http://hadoopblog. blogspot.com/2010/05/facebook-
has- worlds- largest- hadoop.html, last access on September 13th, 2012.
3. K. Shvachko, “HDFS Scability: The Limits to Growth,” ;login: The Usenix Magazine, vol. 35, no. 2, pp.
6–16, April 2010.
4. “Apache Hadoop,” http://hadoop.apache.org/, last access on September 13th, 2012.
5. “HBase,” http://hbase.apache.org/, last access on September 13th, 2012.
6. B. Bockelman, “Using Hadoop as a grid storage element,” Journal of Physics: Conference Series, vol.
180, no. 1, p. 012047, 2009. [Online].Available: http://stacks.iop.org/1742-6596/180/i=1/a=012047
7. D. Borthakur, “HDFS High Availability,” http://hadoopblog.blogspot. com/2009/11/hdfs-high-
availability.html, last access on September 13th, 2012.
8. D.Borthakur,J.Gray,J.S.Sarma,K.Muthukkaruppan,N.Spiegelberg, H. Kuang, K. Ranganathan, D.
Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer, “Apache Hadoop Goes Realtime at Facebook,” in
Proceedings of the 2011 international conference on Management of data, ser. SIGMOD ’11. New York,
NY, USA: ACM, 2011, pp. 1071– 1080. [Online]. Available: http://doi.acm.org/10.1145/1989323.1989438
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 24
REFERENCES
9. “Apache Zookeeper,” http://hadoop.apache.org/zookeeper/, last access on September 13th,
2012.
10. “HDFS architecture,” http://hadoop.apache.org/common/docs/current/hdfs design.html, last
access on September 13th, 2012.
11. “Streaming Edits to a Standby Name-Node,” http://issues.apache.org/jira/browse/HADOOP-
4539, last access on September 13th, 2012.
12. “Hot Standby for NameNode,” http://issues.apache.org/jira/browse/HDFS-976, last access on
September 13th, 2012.
13. D. Borthakur, “Hadoop AvatarNode High Availability,” http://
14. hadoopblog.blogspot.com/2010/02/hadoop- namenode- high- availability.html, last access on
September 13th, 2012.
15. “Amazon EC2,” http://aws.amazon.com/ec2/, last access on September 13th, 2012.
16. “HighAvailabilityFrameworkforHDFSNN,”https://issues.apache.org/ jira/browse/HDFS-1623, last
access on September 13th, 2012.
17. “Automatic failover support for NN HA,” http://issues.apache.org/jira/browse/HDFS-3042, last
access on September 13th, 2012.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 25
Q&A
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 26
BACKUP SLIDES
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 27
PERFORMANCE RESULTS
DATANODE’S ETHERNET FLOW
METADATA I/O
35.00
35.50
36.00
36.50
37.00
37.50
38.00
38.50
39.00
39.50
40.00
40.50
Received SentDataFlow(GB)
HDFS 0.21 Hot Standby
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 28
35.00
35.50
36.00
36.50
37.00
37.50
38.00
38.50
39.00
39.50
40.00
40.50
Received Sent
DataFlow(GB)
HDFS 0.21 Hot Standby
PERFORMANCE RESULTS
CPU TIME
METADATA I/O
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 29
0.00
50.00
100.00
150.00
200.00
250.00
300.00
350.00
400.00
450.00
500.00
DataNode Name Node Backup/Hot
Standby
CPUme(s)
HDFS 0.21 Hot Standby
0.00
50.00
100.00
150.00
200.00
250.00
300.00
350.00
400.00
450.00
500.00
DataNode Name Node Backup/Hot
StandbyCPUme(s)
HDFS 0.21 Hot Standby
PERFORMANCE RESULTS
HEAP USAGE
METADATA
0.00
2.50
5.00
7.50
10.00
12.50
15.00
17.50
20.00
22.50
25.00
DataNode Name Node Backup/Hot
Standby
JavaHeap(MB)
HDFS 0.21 Hot Standby
I/O
0.00
2.50
5.00
7.50
10.00
12.50
15.00
17.50
20.00
22.50
25.00
DataNode Name Node Backup/Hot
StandbyJavaHeap(MB)
HDFS 0.21 Hot Standby
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 30
PERFORMANCE RESULTS
NAMENODE RPC DATAFLOW
METADATA
0.00
100.00
200.00
300.00
400.00
500.00
600.00
700.00
800.00
Received Sent
DataFlow(MB)
HDFS 0.21 Hot Standby
I/O
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
Received SentDataFlow(MB)
HDFS 0.21 Hot Standby
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 31
• Group znode holds the
IP address of the active
metadata server.
• Client query Zookeeper
and register to be
notified of changes.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 32
FINDING THE ACTIVE SERVER
Namenodes
active’s IP
Client
Watcher
Notification
Query

Mais conteúdo relacionado

Mais procurados

Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Manish Chopra
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...Leons Petražickis
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityEdureka!
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldUwe Printz
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Chris Nauroth
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldDataWorks Summit
 
Setting up a big data platform at kelkoo
Setting up a big data platform at kelkooSetting up a big data platform at kelkoo
Setting up a big data platform at kelkooFabrice dos Santos
 
Native erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationNative erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationlin bao
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...DataStax
 
Oracle Real Application Cluster ( RAC )
Oracle Real Application Cluster ( RAC )Oracle Real Application Cluster ( RAC )
Oracle Real Application Cluster ( RAC )varasteh65
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance DataWorks Summit/Hadoop Summit
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi namboori
 

Mais procurados (20)

Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High Availability
 
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
Cross-DC Fault-Tolerant ViewFileSystem @ TwitterCross-DC Fault-Tolerant ViewFileSystem @ Twitter
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Apache Hadoop 0.22 and Other Versions
Apache Hadoop 0.22 and Other VersionsApache Hadoop 0.22 and Other Versions
Apache Hadoop 0.22 and Other Versions
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
Setting up a big data platform at kelkoo
Setting up a big data platform at kelkooSetting up a big data platform at kelkoo
Setting up a big data platform at kelkoo
 
Native erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationNative erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentation
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...
 
Oracle Real Application Cluster ( RAC )
Oracle Real Application Cluster ( RAC )Oracle Real Application Cluster ( RAC )
Oracle Real Application Cluster ( RAC )
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS Architecture
 

Semelhante a IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS

Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development Ceph Community
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCeph Community
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoopChiou-Nan Chen
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and  High-Demand EnvironmentHBaseCon 2015: HBase at Scale in an Online and  High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and High-Demand EnvironmentHBaseCon
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosHeiko Loewe
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jkEdureka!
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFSEdureka!
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit
 
Scaling HDFS with a Strongly Consistent Relational Model for Metadata
Scaling HDFS with a Strongly Consistent Relational Model for MetadataScaling HDFS with a Strongly Consistent Relational Model for Metadata
Scaling HDFS with a Strongly Consistent Relational Model for MetadataHooman Peiro Sajjad
 
Secure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosSecure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosEdureka!
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudEdureka!
 
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...VMworld
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around HadoopDataWorks Summit
 

Semelhante a IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS (20)

Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoop
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and  High-Demand EnvironmentHBaseCon 2015: HBase at Scale in an Online and  High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and Mesos
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
HDFS.ppt
HDFS.pptHDFS.ppt
HDFS.ppt
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Scaling HDFS with a Strongly Consistent Relational Model for Metadata
Scaling HDFS with a Strongly Consistent Relational Model for MetadataScaling HDFS with a Strongly Consistent Relational Model for Metadata
Scaling HDFS with a Strongly Consistent Relational Model for Metadata
 
Secure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosSecure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With Kerberos
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around Hadoop
 

Último

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Último (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS

  • 1. André Oriani Islene Calciolari Garcia Institute of Computing – University of Campinas-Brazil FROM BACKUP TO HOT STANDBY: HIGH AVAILABILITY FOR HDFS
  • 2. AGENDA • Motivation; • Architecture of HDFS 0.21; • Implementation of Hot Standby Node; • Experiments and Results; • High Availability features on HDFS 2.0.0-alpha; • Conclusions and Future Work. SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 2
  • 3. MOTIVATION CLUSTER-BASED PARALLEL FILE SYSTEMS Master-Slaves Architecture: • Metadata Server – serves clients, manages the namespace. • Storage Servers – store the data. Centralized System  Metadata Server is a SPOF Importance of a Hot Standby for the metadata server of HDFS Cold start of a 2000-node HDFS cluster with 21PB and 150 million files takes ~ 45min [8]. SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 3
  • 4. HADOOP DISTRIBUTED FILE SYSTEM (HDFS) 0.21 SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 4 NameNodeClient DataNode DataNodeDataNode Backup Node
  • 5. DATANODES SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 5 Storage Nodes • Files are split in equal-sized blocks. • Blocks are replicated to DataNodes. • Send status messages to NameNode: • Heartbeats; • Block-Reports; • Block-Received. NameNode Statuses
  • 6. NAMENODE SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 6 NameNodeClient DataNode DataNode Requests Metadata Hearbeats Block-Reports Block-Received Commands Metadata Server • Manages the file system tree. • Handles metadata requests. • Controls access and leases. • Manages Blocks: • Allocation; • Location; • Replication levels.
  • 7. NAMENODE’S STATE SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 7 File System Tree Block Management Leases LOG Journaling • All state is kept in RAM for better performance. • Changes to namespace are recorded to log. • Lease and Block information is volatile.
  • 8. BACKUP NODE SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 8 NameNode Backup Node Journal Entries Checkpoint Checkpoint Helper • NameNode streams changes to Backup Node. • Efficient checkpoint strategy: apply changes to its own state. • Checkpoint Backup’s state == Checkpoint NameNode’s state.
  • 9. A HOT STANDBY FOR HDFS 0.21 • Backup Node: Opportunity • Already replicates namespace state. • NameNode’s subclass  Can process client requests. • Evolving the Backup Node into Hot Standby Node: 1. Handle the missing state components: 1. Replica locations 2. Leases 2. Detect NameNode’s Failure. 3. Switch the Hot Standby Node to active state (failover). 4. Disseminate current metadata server information. SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 9
  • 10. Reuses AvatarNodes’ strategy: • Sends DataNodes’ status Messages to Hot Standby too. • No rigid sync • DataNode failures are relatively common. Stable clusters: 2-3 failures per day in 1,000 nodes [3] • Hot Standby Node is kept on safe mode. It becomes the authority once it is active SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 10 MISSING STATE REPLICA LOCATIONS NameNode DataNode Hot Standby Node Heartbeats Block-Reports Block-Received
  • 11. Not replicated: • Blocks are only recorded to log when file is closed. • So any write in progress is lost if NameNode fails. • Restarting the write will create a new lease. SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 11 MISSING STATE LEASES NameNodeClient LOG open(file) open(file) getAdditionalBlock() getAdditionalBlock() getAdditionalBlock() close(file) complete(file,blocks) addLease(file,client)
  • 12. Uses ZooKeeper • Highly available distributed coordination service. • Keeps a tree of znodes replicated among the ensemble. • All operations are atomic. • Leaf node may be ephemeral. • Clients can watch znodes. SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 12 FAILURE DETECTION NameNode Hot Standby Node Hot StandbyNameNode ZooKeeper Ensemble Namenodes active’s IP Client
  • 13. NameNode fails SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 13 FAILOVER NameNode Hot Standby Node Namenodes Hot Standby ZooKeeper Ensemble
  • 14. Switch Hot Standby Node to active 1. Stop checkpointing; 2. Close all open files; 3. Restart lease management; 4. Leave safe mode; 5. Update group znode to its network address. SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 14 FAILOVER(CONT.) Hot Standby Node
  • 15. EXPERIMENTS • Environment Test : Amazon EC2 • 1 Zookeeper server; • 1 NameNode; • 1 Backup Node or Hot Standby Node; • 20 DataNodes; • 20 Clients running the test programs; • All small instances. • Performance tests - Comparison against HDFS 0.21. • Failover tests - NameNode is shutdown when block count is more than 2K. • Two test scenarios • Metadata : Each client creates 200 files of one block each; • I/O : Each client creates a single file of 200 blocks; • 5 sample per scenario. • Source code, tests scripts and raw data: • Available at https://sites.google.com/site/hadoopfs/experiments SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 15
  • 16. RESULTS OVERVIEW Lines of Code: 1373 lines: 0.18% of original code for HDFS 0.21. Performance: • NameNode • Failover Manager overhead: increase of 16% in CPU time and 12% in heap memory compared to HDFS 0.21. • DataNodes • No considerable change on network traffic. Extra messages are less than 0.43% of total flow out of DataNode. • Substantial overhead only on I/O scenario: 17% in CPU and 6% in heap memory compared to HDFS 0.21 in the same scenario. SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 16
  • 17. RESULTS OVERVIEW(CONT.) • Big Block-Received Message Problem: • Hot Standby only knows (thru log) the blocks of a file when it is closed. • Hot Standby returns non-recognized blocks, so DataNodes can retry them on next block-received message. • In I/O scenario files have 200 blocks Many pending blocks will be retried until files are closed  larger block-received messages and responses  Processing Memory • Hot Standby Node • CPU time 3.3 times and heap memory 1.9 times higher in I/O scenario compared to metadata scenario. SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 17 DataNode Hot Standby Node Block-Received: new + old blocks Response: non-recognized blocks
  • 18. PERFORMANCE RESULTS THROUGHPUT AT CLIENTS METADATA 6.24 25.8 5.45 25.32 0 5 10 15 20 25 30 Write Read Throughtput(MB/s) HDFS 0.21 Hot Standby I/O 4.72 13.89 4.86 12.39 0 5 10 15 20 25 30 Write ReadThroughtput(MB/s) HDFS 0.21 Hot Standby SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 18
  • 19. FAILOVER RESULTS METADATA I/O SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 19 NameNode Failure Hot Standby is notified of the failure. First request processed NameNode Failure Hot Standby is notified of the failure. First request processed Zookeeper session timeout : 2 min (1.62±0.23) min. (2.31±0.46) min. 0.24% 22%
  • 20. HDFS 2.0.0-ALPHA SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 20 Name Node Shared Storage Standby Name Node Client Data Node writes log reads log Released on May 2012 • DataNodes send messages to both nodes. • Transactional log is written to High Available Shared Storage. • Standby keeps reading log from storage. • Blocks are logged as they are allocated. • Manual Failover with IO fencing.
  • 21. CONCLUSIONS AND FUTURE WORK • We built a high availability solution for HDFS that is capable of delivering good throughput with low overhead to existing components. • The solution has a reasonable reaction time to failures and works well in Elastic Computing environments. • Impact on the code base was small and no components external to the Hadoop Project were required. SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 21
  • 22. CONCLUSIONS AND FUTURE WORK (CONT.) • Our results showed that we can improve the performance if we handle better the new blocks. If the Hot Standby becomes aware of which blocks compose a file before it is closed, we will be able to continue writes. We also plan to support reconfiguration. • High Availability on HDFS is still a open problem: • Multiple failure support; • Integration with BooKeeper; • Using HDFS itself to store the logs. SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 22
  • 23. ACKNOWLEDGMENTS • Rodrigo Schmidt • Alumnus of University of Campinas (Unicamp); • Facebook Engineer. • Motorola Mobility SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 23
  • 24. REFERENCES 1. K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, 3-7 2010, pp. 1 –10. 2. “Facebook has the world’s largest Hadoop cluster!” http://hadoopblog. blogspot.com/2010/05/facebook- has- worlds- largest- hadoop.html, last access on September 13th, 2012. 3. K. Shvachko, “HDFS Scability: The Limits to Growth,” ;login: The Usenix Magazine, vol. 35, no. 2, pp. 6–16, April 2010. 4. “Apache Hadoop,” http://hadoop.apache.org/, last access on September 13th, 2012. 5. “HBase,” http://hbase.apache.org/, last access on September 13th, 2012. 6. B. Bockelman, “Using Hadoop as a grid storage element,” Journal of Physics: Conference Series, vol. 180, no. 1, p. 012047, 2009. [Online].Available: http://stacks.iop.org/1742-6596/180/i=1/a=012047 7. D. Borthakur, “HDFS High Availability,” http://hadoopblog.blogspot. com/2009/11/hdfs-high- availability.html, last access on September 13th, 2012. 8. D.Borthakur,J.Gray,J.S.Sarma,K.Muthukkaruppan,N.Spiegelberg, H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer, “Apache Hadoop Goes Realtime at Facebook,” in Proceedings of the 2011 international conference on Management of data, ser. SIGMOD ’11. New York, NY, USA: ACM, 2011, pp. 1071– 1080. [Online]. Available: http://doi.acm.org/10.1145/1989323.1989438 SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 24
  • 25. REFERENCES 9. “Apache Zookeeper,” http://hadoop.apache.org/zookeeper/, last access on September 13th, 2012. 10. “HDFS architecture,” http://hadoop.apache.org/common/docs/current/hdfs design.html, last access on September 13th, 2012. 11. “Streaming Edits to a Standby Name-Node,” http://issues.apache.org/jira/browse/HADOOP- 4539, last access on September 13th, 2012. 12. “Hot Standby for NameNode,” http://issues.apache.org/jira/browse/HDFS-976, last access on September 13th, 2012. 13. D. Borthakur, “Hadoop AvatarNode High Availability,” http:// 14. hadoopblog.blogspot.com/2010/02/hadoop- namenode- high- availability.html, last access on September 13th, 2012. 15. “Amazon EC2,” http://aws.amazon.com/ec2/, last access on September 13th, 2012. 16. “HighAvailabilityFrameworkforHDFSNN,”https://issues.apache.org/ jira/browse/HDFS-1623, last access on September 13th, 2012. 17. “Automatic failover support for NN HA,” http://issues.apache.org/jira/browse/HDFS-3042, last access on September 13th, 2012. SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 25
  • 26. Q&A SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 26
  • 27. BACKUP SLIDES SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 27
  • 28. PERFORMANCE RESULTS DATANODE’S ETHERNET FLOW METADATA I/O 35.00 35.50 36.00 36.50 37.00 37.50 38.00 38.50 39.00 39.50 40.00 40.50 Received SentDataFlow(GB) HDFS 0.21 Hot Standby SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 28 35.00 35.50 36.00 36.50 37.00 37.50 38.00 38.50 39.00 39.50 40.00 40.50 Received Sent DataFlow(GB) HDFS 0.21 Hot Standby
  • 29. PERFORMANCE RESULTS CPU TIME METADATA I/O SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 29 0.00 50.00 100.00 150.00 200.00 250.00 300.00 350.00 400.00 450.00 500.00 DataNode Name Node Backup/Hot Standby CPUme(s) HDFS 0.21 Hot Standby 0.00 50.00 100.00 150.00 200.00 250.00 300.00 350.00 400.00 450.00 500.00 DataNode Name Node Backup/Hot StandbyCPUme(s) HDFS 0.21 Hot Standby
  • 30. PERFORMANCE RESULTS HEAP USAGE METADATA 0.00 2.50 5.00 7.50 10.00 12.50 15.00 17.50 20.00 22.50 25.00 DataNode Name Node Backup/Hot Standby JavaHeap(MB) HDFS 0.21 Hot Standby I/O 0.00 2.50 5.00 7.50 10.00 12.50 15.00 17.50 20.00 22.50 25.00 DataNode Name Node Backup/Hot StandbyJavaHeap(MB) HDFS 0.21 Hot Standby SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 30
  • 31. PERFORMANCE RESULTS NAMENODE RPC DATAFLOW METADATA 0.00 100.00 200.00 300.00 400.00 500.00 600.00 700.00 800.00 Received Sent DataFlow(MB) HDFS 0.21 Hot Standby I/O 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 18.00 Received SentDataFlow(MB) HDFS 0.21 Hot Standby SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 31
  • 32. • Group znode holds the IP address of the active metadata server. • Client query Zookeeper and register to be notified of changes. SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 32 FINDING THE ACTIVE SERVER Namenodes active’s IP Client Watcher Notification Query

Notas do Editor

  1. Good morning Everyone. I am André Oriani from University of Campinas, Brazil. I’m going to present our work : From Backup to Hot Standby: High Availability for HDFS.
  2. So here’s today agenda:I’m gonna talk briefly about cluster-based parallel file systems, then the architecture of HDFS 0.21, our implementation of a Hot Standby. give some overview about our experiments and results and finish talking about the high availability features introduced on HDFS 2-alpha.
  3. Cluster-based parallel file system generally adopt a master-slaves architecture. The master, the metadata-server, manages the namespace; while the slaves, the storage nodes, store the data. Although this architecture is simple to implement and maintain, as in any centralized system the master, the metadata server, is a single point of failure.One widely used specimen of such file systems is the Hadoop Distributed File System. And what is a importance of a Hot standby for the HDFS? Well, A cold start for a big cluster such as Facebook’s can take about 45 minutes, making it no viable for 24 by seven applications.
  4. This slide show the architecture and the data flows for HDFS 0.21. I’m gonna describe each node.
  5. Like in any Parallel File System, file are split into equal sized blocks that are stored in the Data Nodes, the storage Nodes. In particular a block is replicated to 3 DataNodes for reliability. They constantly communicating their status to the metadata server, the Name Node thru messages:Heartbeats – which are not only used to tell the DataNode is still alive, but also to tell about its load and free space.Block-reports are a list of healthy the DataNode can offer. And the DataNode receives a set of new blocks , It sends a block-received message. From those message, the Name Node can build its global view of the cluster, knowing where each block replica is.
  6. As I said the NameNode is the metadata server and thus it is responsible for servicing the clients, handling metadata requests and and controlling access to files. HDFS offers POSIX-like permission and leases. A client can only write to a file if it has a lease for it.The NameNode also manages the block allocation, location and the replication status. To accomplish that work , the Name Node can send commands to DataNodes in the response to the hearbeatmessages.
  7. The Name Node keeps all its state on main memory for performance reasons. In other to have some resilience, the Name Node employs journaling: every change to the file system tree is recorded to a log file. Information about lease and blocks are not recorded to log because they are ephemeral data. So they are lost if the namenode fails.
  8. If not action were taken , NameNode would end up with a big transactional log, what would seriously impact its startup time. So it counts on the Backup Node, its checkpointer helper , to compact the log in the form of a serialized version of the file system tree. The Backup Node employs an efficient checkpoint strategy. The Name Node streams all changes to it, so it can apply to it owns state. So to generate a checkpoint for the Name Node it only needs to checkpoint itself.
  9. We found the Backup Node to a great opportunity for implementing a hot standby for the NameNode. It already does some state replication and because it is a subclasses of Name Node it can potentially handle client requests. In fact , turning into a Hot standby node was a long term goal for the Backup Node when it was created. To evolve the Backup Node into a Hot Standby, we had to handle the missing namenode’s state components , create an automatic failover mechanism, and means to disseminate the information about the current active metadata server.
  10. To replicate the information about blocks, we reused a technique developed in another high-availability solution, the Avatar Nodes from Facebook. The technique consists in modifying the Data Nodes to also send the status messages to the Hot Standby. The Hot Standby Node is kept on safe mode mode to not issue commands to DataNode which would conflict with Name Node’s. There is no rigid synchronization among the duplicated messages because datanodes fail at considerable rates and the nodes are made to handle that. An once it becomes active , the Hot Standby will become the file system authority, so only its view will matter.
  11. Regarding leases we decided to not replicate them. The reason for that blocks that compose a file are only recorded to the transactional log when the file is closed. So if the Name Node fails while a file was being written, all the blocks of the file are lost. So the client will need to restart the write, requiring a new leases. Thus the previous leases is not going to use and therefore it doesn’t need to be replicated. This behavior is somewhat tolerable by applications. MapReduce will retry any failed job and Hbase will only commit a commit a transaction when the file is flushed and synced.
  12. In order to detect Name Node’s failure we use Zookeeper, Zookeeper is subproject of Hadoop that provides a high available distributed coordination service. It keeps a replicated tree of znodes among the server of the ensemble. One interesting feature of Zookeeper is that the znodes can be ephemeral: if the session of client that created expires , the znode is removed. So you can implement some liveness detection using this principle. You can also register to be notified of such events. So, when they both the Name Node and the Hot Standby create a ephemeral znode for them, under a znode to represent the group. The namenode will write is network address to the group znode.
  13. When the Name Node fails , its session with Zookeeper will eventually expires, so its znode gets removed. The Hot Standby is notified about that and it starts the failover procedures.
  14. The Hot standby wills stop the checkpoint, close all open files. restart the lease management, leave the safe mode so it can control the Data Nodes. And it writes it network address on the znode of the group.
  15. We did experiments in order to determine the overhead implied by our solution over the HDFS 0.21 and the total failover time. We did both tests on two scenarios: one scenario more oriented towards metadata operations and another towards I/O operations. For each scenario we run the test times.The tests were executed on Amazon EC2 using 43 small instances. Source code,test scripts and raw data are available at the address denoted in the slide.
  16. As time is short I am gonna give an overview of the results.The complete implementation took less then fourteen hundred lines. Thus is easy for others to understand the implementation and maintain.Regarding the performance overhead, the namenode should not be impacted since it was not changed by the solution, but its process also hosts the Failover Manager, to we saw Increase of 16% in CPU time and 12% in heap memory compared to HDFS 0.21For theDataNodes there was not considerable change in the network traffic . The extra messages sent to the hot standby got diluted in the I/O flow created bry clients reading and writing files. We only observed substantial overhead in the I/O Scenario. We saw increase of 17% in the CPU time and 6% in the heap if compared to HDFS 0.21 DataNode’s in the same scenario. Tha t is caused by a growth in the blocked received messages. Remember the Hot standby only becomes aware of block that compose a certain file is closed. If the hot standby node receives block-received message from a block it does not know about, it will return that block, so the datanode can retry to that block in the next block-received message again. The trouble is that in the I/O scenario the files are 200 block-long and the hot standby node will only recognized any block aof a file when the 200 block were written. So the block-receive messages become long, taking a lot of processing and memory from the Data Nodes and the Hot Standby. And because the hot standby is just one , he’s the most affected node.
  17. As time is short I am gonna give an overview of the results.The complete implementation took less then fourteen hundred lines. Thus is easy for others to understand the implementation and maintain.Regarding the performance overhead, the namenode should not be impacted since it was not changed by the solution, but its process also hosts the Failover Manager, to we saw Increase of 16% in CPU time and 12% in heap memory compared to HDFS 0.21For theDataNodes there was not considerable change in the network traffic . The extra messages sent to the hot standby got diluted in the I/O flow created bry clients reading and writing files. We only observed substantial overhead in the I/O Scenario. We saw increase of 17% in the CPU time and 6% in the heap if compared to HDFS 0.21 DataNode’s in the same scenario. Tha t is caused by a growth in the blocked received messages. Remember the Hot standby only becomes aware of block that compose a certain file is closed. If the hot standby node receives block-received message from a block it does not know about, it will return that block, so the datanode can retry to that block in the next block-received message again. The trouble is that in the I/O scenario the files are 200 block-long and the hot standby node will only recognized any block aof a file when the 200 block were written. So the block-receive messages become long, taking a lot of processing and memory from the Data Nodes and the Hot Standby. And because the hot standby is just one , he’s the most affected node.
  18. Despite of those problems, the data throughput is still good. We consider the data throughput the most important metric because it measures how much works can be done in behalf of clients . In average, the data throughput was never less than 2 MB/s of throughput achieved by the HDFS 0.21 in both scenario, for read and write.
  19. Failover time…We use a timeout of 2 minutes because it was a safe value to avoid false positive in the virtualized environment of Amazon EC2.We are considering the failover from the time the Name Node fails until the Hot Standby process its first request. In both case the failover took less than 3 minutes. The time from the start of the Hot Standby’s transition until the first request, that it is the time that we can impact with our implementation took only 0.24% of the total failover in the metada scenario. However in the I/O scenario that jumps to 22% because of the problem with block-received messages I just mentioned. The Hot standby node is just too busy processing blocks the transition takes longer. Once it is finish , thingst get worse, because the hot standby not only has to process the block-received but has to instruct datanodes to remove block of all in progress writes, so first resquest is delayed for a long time, although client could react almost instantaneously.
  20. HDFS 2-aplha , released on may of this year has introduced some high availability features. They also use the technique of modifying the DataNodes to also send their messages to the hot standby. But instead of streaming the changes, the active metadata server keeps the logs in a shared storage, and the standby keeps on reading the log from the storage in order to update itself. So the high availability issue is transferred from the file system to the shared storage, which is a external component and needs to be high available. It logs blocks as they are allocated avoid the problem we have. Currently it only supports manual failover, so it is targeted to maintenance and upgrades, but a automatica failover is very likely to be in the next release. They employ IO fencing mechanisms to prevent both namenodes from writing to the shared storage.
  21. How the client can determine which is the current metadata server? Remember the Name Node writes it address to the group znode when it starts, and the Hot standby writes it address when it finishes the failover. So the group znode always keep the address of active metadata server up-to-date. So clients just need to query Zookeeper for that znode and register to be notified of changes.