Distribute Storage System May-2014

DISTRIBUTED STORAGE
SYSTEM
Mr. Dương Công Lợi
Company: VNG-Corp
Tel: +84989510016/+84908522017
Email:loidc@vng.com.vn/loiduongcong@gmail.com

CONTENTS
 1. What is distributed-computing system?
 2. Principle of distributed database/storage
system
 3. Distributed storage system paradigm
 4. Canonical problems in distributed systems
 5. Common solution for canonical problems in
distributed systems
 6. UniversalDistributedStorage
 7. Appendix

1. WHAT IS DISTRIBUTED-COMPUTING
SYSTEM?
 Distributed-Computing is the process of solving a
computational problem using a distributed
system.
 A distributed system is a computing system in
which a number of components on multiple
computers cooperate by communicating over a
network to achieve a common goal.

DISTRIBUTED DATABASE/STORAGE
SYSTEM
 A distributed database system, the database is
stored on several computers .
 A distributed database is a collection of multiple
, logic computer network .

DISTRIBUTED SYSTEM ADVANCE
 Advance
 Avoid bottleneck & single-point-of-failure
 More Scalability
 More Availability
 Routing model
 Client routing: client request to appropriate server to
read/write data
 Server routing: server forward request of client to
appropriate server and send result to this client
* can combine the two model above into a system

DISTRIBUTED STORAGE SYSTEM
 Store some data {1,2,3,4,6,7,8} into 1 server
 And store them into 3 distributed server
1,2,3,4,
6,7,8
1,2,3
4,6
7,8

2. PRINCIPLE OF DISTRIBUTED
DATABASE/STORAGE SYSTEM
 Shard data key and store it into appropriate
server use Distributed Hash Table (DHT)
 DHT must be consistent hashing:
 Uniform distribution of generation
 Consistent
 Jenkins, Murmur are the good choice;some else:
MD5, SHA slower

3. DISTRIBUTED STORAGE SYSTEM
PARADIGM
 Data Hashing/Addressing
 Determine server for data store in
 Data Replication
 Store data into multi server node for more
availability, fault-tolerance

ARCHITECT
 Data Hashing/Addressing
 Use DHT to addressing server (use server-name) to a
number, performing it on one circle called the keys
space
 Use DHT to addressing data and find server store it
by successor(k)=ceiling(addressing(k))
 successor(k): server store k
0
server3
server1
server2

ARCHITECT
 Addressing – Virtual node
 Each server node is generated to more node-id for
evenly distributed, load balance
Server1: n1, n4, n6
Server2: n2, n7
Server3: n3, n5, n8
0
server3
server1
server2
n7
n1
n5
n2
n4
n8
n3
n6

ARCHITECT
 Data Replication
Data k1 store in server1 as master and store in
server2 as slave
0
server3
server1
server2
k1

4. CANONICAL PROBLEMS IN DISTRIBUTED
SYSTEMS
 Distributed transactions: ACID (Atomicity,
Consistency, Isolation, Durability) requirement
 Distributed data independence
 Fault tolerance
 Transparency

5. COMMON SOLUTION FOR CANONICAL
PROBLEMS IN DISTRIBUTED SYSTEMS
 Atomicity and Consistency with Two Phase
Commit protocal
 Distributed data independence with consistent
hashing algorithm
 Fault tolerance with leader election, multi
master and data replication
 Transparency with server routing, client seen
distributed system as a single server

TWO PHASE COMMIT PROTOCAL
 What is this?
 Two-phase commit is a transaction protocol designed for
the complications that arise with distributed resource
managers.
 Two-phase commit technology is used for hotel and
airline reservations, stock market transactions, banking
applications, and credit card systems.
 With a two-phase commit protocol, the distributed
transaction manager employs a coordinator to manage
the individual resource managers. The commit process
proceeds as follows:

 Phase1: Obtaining a Decision
 Step 1  Coordinator asks all participants to prepare
to commit transaction Ti.
 Ci adds the records <prepare T> to the log and
forces log to stable storage (a log is a file which
maintains a record of all changes to the database)
 sends prepare T messages to all sites where T
executed

 Phase1: Making a Decision
 Step 2  Upon receiving message, transaction
manager at site determines if it can commit the
transaction
 if not:
add a record <no T> to the log and send abort T
message to Ci
 if the transaction can be committed,
then:
1). add the record <ready T> to the log
2). force all records for T to stable storage
3). send ready T message to Ci

 Phase 2: Recording the Decision
 Step 1  T can be committed of Ci received a ready T
message from all the participating sites: otherwise T
must be aborted.
 Step 2  Coordinator adds a decision record, <commit
T> or <abort T>, to the log and forces record onto stable
storage. Once the record is in stable storage, it cannot
be revoked (even if failures occur)
 Step 3  Coordinator sends a message to each
participant informing it of the decision (commit or abort)
 Step 4  Participants take appropriate action locally.

 Costs and Limitations
 If one database server is unavailable, none of the
servers gets the updates.
 This is correctable through network tuning and
correctly building the data distribution through
database optimization techniques.

LEADER ELECTION
 Some leader election algorithm can use: LCR
(LeLann-Chang-Roberts), Pitterson, HS
(Hirschberg-Sinclair)

LEADER ELECTION
 Bully Leader Election algorithm

MULTI MASTER
 Multi-master replication
 Problem of multi-master replication

MULTI MASTER
 Solution, 2 candicate model:
 Two phase commit (always consistency)
 Asynchronize sync data among multi node
 Still active despite some node dies
 Faster than 2PC

MULTI MASTER
 Asynchronize sync data
 Data store to main master (called sub-leader), then
this data post to queue to sync to other master.

MULTI MASTER
 Asynchronize sync data
req1
req2
Server1
(leader )
Server2
data queue
req2: forward
X

UNIVERSALDISTRIBUTEDSTORAGE
a distributed storage system

6. UNIVERSALDISTRIBUTEDSTORAGE
 UniversalDistributedStorage is a distributed
storage system develop for:
 Distributed transactions (ACID)
 Distributed data independence
 Fault tolerance
 Leader election (decision for join or leave server node)
 Replicate with multiple master replication
 Transparency

UNIVERSALDISTRIBUTEDSTORAGE
ARCHITECTURE
 Overview
Bussiness
Layer
Distrib
uted
Layer
Storage
Layer
Bussiness
Layer
Distrib
uted
Layer
Storage
Layer
Bussiness
Layer
Distrib
uted
Layer
Storage
Layer
Server

ARCHITECTURE
 Internal Overview
Business
Layer
Distributed
Layer
StorageLayer
dataLocate(),
dataRemote()
Result(s)
localData()
Result{s}
Client request(s)
remote
queuing

FEATURE
 Data hashing/addressing
 Use Murmur hashing function

FEATURE
 Leader election
 Use Bully Leader Election algorithm

FEATURE
 Multi-master replication
 Use asynchronize sync data among server nodes

STATISTIC
 System information:
 3 machine 8GB Ram, core i5 3,220GHz
 LAN/WAN network
 7 physical servers on 3 above mechine
 Concurrence write 16500000 items in 3680s, rate~
4480req/sec (at client computing)
 Concurrence read 16500000 items in 1458s, rate~
11320req/sec (at client computing)
* It doesn’t limit of this system, it limit at clients (this
test using 3 client thread)

Q & A
Contact:
Duong Cong Loi
loidc@vng.com.vn
loiduongcong@gmail.com
https://www.facebook.com/duongcong.loi

APPENDIX - 001
 How to join/leave server(s)
1. join/leave
2. join/leave:forward
Leaderserver
4. broadcast result
3. process join/leave
Server A
Server B Server C

APPENDIX - 002
 How to move data when join/leave server(s)
 Make appropriate data for the moving
 Async data for the moving by thread, and control
speed of the moving

APPENDIX - 003
 How to detect Leader or sub-leader die
 Easy dectect by polling connection

APPENDIX - 004
 How to make multi virtual node for one server
 Easy generate multi virtual node for one server by
hash server-name
 Ex:
make 200 virtual node for server ‘photoTokyo’:
use hash value of: photoTokyo1, photoTokyo2, …,
photoTokyo200

APPENDIX - 005
 For fast moving data
 Use bloomfilter for dectect exist hash value of data-
key
 Use a storage for store all data-key for this local
server

APPENDIX - 006
 How to avoid network turnning
 Use client connection pool with screening strategy
before, it avoid many connection hanging when call
remote via network between two server

Distribute Storage System May-2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Distribute Storage System May-2014

Similar to Distribute Storage System May-2014 (20)

Recently uploaded

Recently uploaded (20)

Distribute Storage System May-2014