Featuring a brief overview of fault-tolerant mechanisms across various Big Data systems such as Google File system (GFS), Amazon Dynamo, Bigtable, Hadoop - Map Reduce, Facebook Cassandra along with description of an existing fault tolerant model
3. +
Introduction
Cloud computing is everywhere.
Advantages
Cost Efficient
Unlimited storage
Seamless access
Importance of Fault Tolerance
Mass outage at Amazon Web Services
A zone was off for an entire day!
Time critical systems
Rocket on a mission
Bank applications
4. +
Fault tolerant mechanisms in
Distributed Systems
Google File System (GFS)
Focused on storage
Replication mechanism
different machines on different racks, N=3.
Shadow-master’s in support to primary master
Read access
Checksums for data reliability
CRC
Amazon Dynamo
Focused on High Availability
Use Vector Clocks
For semantic reconcilation
Hinted hand-off
Merkle Tree
To detect and correct instabilities
5. +
Fault tolerant mechanisms in
Distributed Systems (continued)
Facebook’s Cassandra
Accrual Failure detection mechanism with gossip based protocol.
First of its kind
Probabilistic failure rate estimator
Zookeeper
Group of workstations acting as servers
One master, other service providers in accordance with the main master
High availability
Bigtable
Works on top of GFS
Chubby service – metadata storage
Heart of Bigtable
Primary co-ordinator of Bigtable
Data persistence
6. +
Fault tolerant mechanisms in
Distributed Systems (continued)
MapReduce
Classic Master-Slave configuration
Ex - Hadoop
Re-execution of entire operation
If any operation terminates in between
Operational even if some worker’s fail
Efficient load balancing
HDFS
7. +
Existing Fault tolerant model for
Cloud Computing
Proposed by Anjali Meshram, A.S Sambare, S.D Zade
Input is passed to all VM’s
Accepter
Testing carried out on algorithms for every VM.
Timer
Monitoring time constraint for each VM
Reliability Assessor (RA)
Starts with reliability of 100% for every VM
Calculated with time taken for every result for each VM
Decision Maker
Selects output of node with highest reliability.
Raises failure if reliability falls below minimum and node is removed.
9. +
Features that can be combined to
create a new Fault Tolerant Model
Master Node
Co-ordinator
Built on Zookeeper service
Each job carried on three different
node
Accrual Fault Detectors
Probabilistic failure value
Measured on ping responses from
Master
Decision Maker
Selects the majority vote to produce
final output
10. +
Future Work
Develop a better and a more robust fault tolerant model
using the features described in earlier slides.