This document describes a distributed storage system called UniversalDistributedStorage. It discusses distributed computing principles like data hashing, replication, and leader election. UniversalDistributedStorage uses consistent hashing to store data across servers and replicates data for fault tolerance. It elects leaders using the Bully algorithm and synchronizes data asynchronously across multiple masters. The system aims to provide distributed transactions, data independence, fault tolerance and transparency.
2. CONTENTS
1. What is distributed-computing system?
2. Principle of distributed database/storage
system
3. Distributed storage system paradigm
4. Canonical problems in distributed systems
5. Common solution for canonical problems in
distributed systems
6. UniversalDistributedStorage
7. Appendix
3. 1. WHAT IS DISTRIBUTED-COMPUTING
SYSTEM?
Distributed-Computing is the process of solving a
computational problem using a distributed
system.
A distributed system is a computing system in
which a number of components on multiple
computers cooperate by communicating over a
network to achieve a common goal.
4. DISTRIBUTED DATABASE/STORAGE
SYSTEM
A distributed database system, the database is
stored on several computers .
A distributed database is a collection of multiple
, logic computer network .
5. DISTRIBUTED SYSTEM ADVANCE
Advance
Avoid bottleneck & single-point-of-failure
More Scalability
More Availability
Routing model
Client routing: client request to appropriate server to
read/write data
Server routing: server forward request of client to
appropriate server and send result to this client
* can combine the two model above into a system
6. DISTRIBUTED STORAGE SYSTEM
Store some data {1,2,3,4,6,7,8} into 1 server
And store them into 3 distributed server
1,2,3,4,
6,7,8
1,2,3
4,6
7,8
7. 2. PRINCIPLE OF DISTRIBUTED
DATABASE/STORAGE SYSTEM
Shard data key and store it into appropriate
server use Distributed Hash Table (DHT)
DHT must be consistent hashing:
Uniform distribution of generation
Consistent
Jenkins, Murmur are the good choice;some else:
MD5, SHA slower
8. 3. DISTRIBUTED STORAGE SYSTEM
PARADIGM
Data Hashing/Addressing
Determine server for data store in
Data Replication
Store data into multi server node for more
availability, fault-tolerance
9. DISTRIBUTED STORAGE SYSTEM
ARCHITECT
Data Hashing/Addressing
Use DHT to addressing server (use server-name) to a
number, performing it on one circle called the keys
space
Use DHT to addressing data and find server store it
by successor(k)=ceiling(addressing(k))
successor(k): server store k
0
server3
server1
server2
10. DISTRIBUTED STORAGE SYSTEM
ARCHITECT
Addressing – Virtual node
Each server node is generated to more node-id for
evenly distributed, load balance
Server1: n1, n4, n6
Server2: n2, n7
Server3: n3, n5, n8
0
server3
server1
server2
n7
n1
n5
n2
n4
n8
n3
n6
12. 4. CANONICAL PROBLEMS IN DISTRIBUTED
SYSTEMS
Distributed transactions: ACID (Atomicity,
Consistency, Isolation, Durability) requirement
Distributed data independence
Fault tolerance
Transparency
13. 5. COMMON SOLUTION FOR CANONICAL
PROBLEMS IN DISTRIBUTED SYSTEMS
Atomicity and Consistency with Two Phase
Commit protocal
Distributed data independence with consistent
hashing algorithm
Fault tolerance with leader election, multi
master and data replication
Transparency with server routing, client seen
distributed system as a single server
14. TWO PHASE COMMIT PROTOCAL
What is this?
Two-phase commit is a transaction protocol designed for
the complications that arise with distributed resource
managers.
Two-phase commit technology is used for hotel and
airline reservations, stock market transactions, banking
applications, and credit card systems.
With a two-phase commit protocol, the distributed
transaction manager employs a coordinator to manage
the individual resource managers. The commit process
proceeds as follows:
15. TWO PHASE COMMIT PROTOCAL
Phase1: Obtaining a Decision
Step 1 Coordinator asks all participants to prepare
to commit transaction Ti.
Ci adds the records <prepare T> to the log and
forces log to stable storage (a log is a file which
maintains a record of all changes to the database)
sends prepare T messages to all sites where T
executed
16. TWO PHASE COMMIT PROTOCAL
Phase1: Making a Decision
Step 2 Upon receiving message, transaction
manager at site determines if it can commit the
transaction
if not:
add a record <no T> to the log and send abort T
message to Ci
if the transaction can be committed,
then:
1). add the record <ready T> to the log
2). force all records for T to stable storage
3). send ready T message to Ci
17. TWO PHASE COMMIT PROTOCAL
Phase 2: Recording the Decision
Step 1 T can be committed of Ci received a ready T
message from all the participating sites: otherwise T
must be aborted.
Step 2 Coordinator adds a decision record, <commit
T> or <abort T>, to the log and forces record onto stable
storage. Once the record is in stable storage, it cannot
be revoked (even if failures occur)
Step 3 Coordinator sends a message to each
participant informing it of the decision (commit or abort)
Step 4 Participants take appropriate action locally.
18.
19. TWO PHASE COMMIT PROTOCAL
Costs and Limitations
If one database server is unavailable, none of the
servers gets the updates.
This is correctable through network tuning and
correctly building the data distribution through
database optimization techniques.
20. LEADER ELECTION
Some leader election algorithm can use: LCR
(LeLann-Chang-Roberts), Pitterson, HS
(Hirschberg-Sinclair)
24. MULTI MASTER
Solution, 2 candicate model:
Two phase commit (always consistency)
Asynchronize sync data among multi node
Still active despite some node dies
Faster than 2PC
25. MULTI MASTER
Asynchronize sync data
Data store to main master (called sub-leader), then
this data post to queue to sync to other master.
38. APPENDIX - 001
How to join/leave server(s)
1. join/leave
2. join/leave:forward
Leaderserver
4. broadcast result
3. process join/leave
Server A
Server B Server C
39. APPENDIX - 002
How to move data when join/leave server(s)
Make appropriate data for the moving
Async data for the moving by thread, and control
speed of the moving
40. APPENDIX - 003
How to detect Leader or sub-leader die
Easy dectect by polling connection
41. APPENDIX - 004
How to make multi virtual node for one server
Easy generate multi virtual node for one server by
hash server-name
Ex:
make 200 virtual node for server ‘photoTokyo’:
use hash value of: photoTokyo1, photoTokyo2, …,
photoTokyo200
42. APPENDIX - 005
For fast moving data
Use bloomfilter for dectect exist hash value of data-
key
Use a storage for store all data-key for this local
server
43. APPENDIX - 006
How to avoid network turnning
Use client connection pool with screening strategy
before, it avoid many connection hanging when call
remote via network between two server