SlideShare uma empresa Scribd logo
1 de 45
Baixar para ler offline
Paxos
Building Reliable System
2015-07-02 @drdrxp
Background
Several processes do one thing.
The only problem in distributed system is
achieving consensus.
Paxos: the core of distributed system.
Agenda
1. Problem
2. Replication is not enough
3. Paxos Algorithm
4. Paxos Optimization
Problem
Required:
Durability: 99.99999999%
Availability: 99.99%
What we have:
Hard Drive: 4% of Annual failure rate
Server Down Time: 0.1% or longer
Packet loss between IDC: 5% ~ 30%
Solution(Maybe)
Multiple Replicas
No data loss if x(x<n) replicas lost
Durability:
1 replicas: ~ 0.63%
2 replicas: ~ 0.00395%
3 replicas: < 0.000001%
n replicas: = 1 - x^n /* x = failure rate of single replica */
Solution.
How to replicate
data?
Besides number of replicas:
Availability
Atomicity
Consistency
...
Fundamental Replication Algorithms
Master-Slave Async
Master-Slave Sync
Master-Slave Semi-Sync
Quorum Write and Read
Master-Slave Async
The Mysql Way.
1. Master received write op.
2. Master wrote on disk.
3. Master responded ‘OK’.
4. Master replicated to slaves.
If disk fail before replication
→ Data loss.
Time
MasterClient Slave.1 Slave.2
Disk Failure
Master-Slave Sync
1. Master received write op.
2. Master replicated log to slaves.
3. Slave may block...
4. Client won’t receive ‘OK’ until
all slaves respond.
One unreachable node
halts the entire system.
: No data loss.
: But low availability.
Time
MasterClient Slave.1 Slave.2
Master-Slave Semi-Sync
1. Master received write op.
2. Master replicated log to slaves.
3. Slave may block...
4. Client receives ‘OK’ if [1,n)
slaves respond.
: High durability.
: High availability.
: No slave has all data
→ We need Quorum Write
Time
MasterClient Slave.1 Slave.2
Quorum Write and Read
Dynamo / Cassandra
Write to W >=N/2+1 nodes.
No master required.
Read R >=N/2+1 nodes.
W + R > N
Tolerate upto (N-1)/2 failed
nodes.
Time
Node.1Client Node.2 Node.3
Quorum Write and Read. Last-Win
The last write wins.
Totally ordered based on
timestamp.
Time
Node.1Client Node.2 Node.3
: High durability.
: High availability.
: Data completeness is guaranteed.
Is it enough?
Quorum Write and Read..
Quorum Write and Read... W + R > N
Consistency:
Eventual
Transactionality:
Non-Atomic-Update
Dirty-Read
Lost-Update
http://en.wikipedia.org/wiki/Concurrency_control
An Imaginary Storage Service
● A storage system with 3 nodes(processes).
● Policy: Quorum RW.
● It stores only one variable “i”.
● “i” has multiple versions: i1, i2, i3…
● Commands:
get /* read latest “i” */
set <n> /* assign <n> to “i” */
inc <n> /* increment “i” by <n> */
It shows us the deficiency of Quorum RW
and how paxos solves these problems.
An Imaginary Storage Service.
"set" → Quorum Write.
"inc" → the simplest transactional operation:
1. Read latest “i” with Quorum Read: i1
2. Let i2 = i1 + n
3. set i2
X
set i2=3
X
get i
21
21
00
32
21
32
X
get i1=2
i2 = i1 + 1
32
21
32
set i2=3
OK
set i2=4
An Imaginary Storage Service..
X
X
get i
21
21
00
32
21
32
53
21
53
X
get i1=2
i2 = i1 + 1
We expect X to be able to get i3=5
This requires Y to “fail” after X wrote i2. How do we do that?
Y
get i1=2
Y
i2 = i1 + 2
32
21
32
Y should run Quorum Read and Quorum Write again...
Must Fail.
Or existed
value will be
overwritten.
An Imaginary Storage Service...
In order to correctly get i3 after 2 “inc” operations:
There can only be ONE successful “write” operation
to a certain version of “i”(in our case: i2).
Generalization:
One value(one version of a variable) should not be
modified any more after it is determined(client received
“OK” and believes it is stored).
How to define “determined”?
How to avoid changing a “determined” value?
Determine a Value
X
Y
Any value set?
X
No
XX -
---
Any value set?
---
Y
Yes, Y gives up
X
XX -
XX -
Solution: Before writing a value, run a Quorum Read
round to check if such a value exists(or maybe exist).
Determine a Value.
X Y
Any value set?
X
No
YYX Y
XX -
---
Any value set?
--- Y
No
X
But both X and Y would believe there is no value set.
X and Y both will start to write at the same time.
Lost Update
Determine a Value..
X
Any value set?
X
No
YYX Y
---
---
X
Y---
Any value set?
Quorum Read+Write:
Remember X is the last reader
--- Y
No
Quorum Read+Write:
Remember Y is the last reader
X --
Solution improved: Remember who did the last read And
deny write from previous readers.
now node 1 and 2 will only accept
request from X.
now node 2 and 3 will only accept
request from Y.
Determine a Value...
By applying this policy, a value(each version of “i” in our
case) can be stored safely and consistently.
Leslie Lamport made a paper of this policy.
Paxos
What is Paxos
● A reliable storage: based on Quorum RW.
● Each paxos instance stores only 1 value.
● 2 rounds are required to determine 1 value.
● A value can’t be modified after determined.
● determined means being accepted by a
quorum(>n/2).
● Immediate Consistency.
Paxos
Classic Paxos
2 rounds per instance.
Multi Paxos
~1 round per instance.
Fast Paxos
1 round per instance ( without conflict ).
2 rounds per instance ( with conflict ).
Paxos: Precondition
Storage must be reliable:
No Data loss
/* Or it falls back to Byzantine Paxos */
Tolerate:
Message loss
Message in random order
Proposer: process that starts a paxos round to write sth.
Acceptor: process that receives and stores messages.
Quorum( of acceptors ) : n/2+1 Acceptors.
Round:Including 2 phases:Phase-1 & Phase-2
Round Number (rnd):
ID of a round.
monotonic incremental;Last-Win;Universially unique;
Paxos: Concepts
Last Round Number (last_rnd):
Greatest rnd an Acceptor has ever seen;
To identify the proposer from which a acceptor would
accept write request;
Value (v): the value an Acceptor accepted.
Value round number (vrnd):
At which round an Acceptor accepted the v.
Value determined:
The value accepted by a quorum of acceptors.
Paxos: Concepts.
Illustration of Acceptor
5,x3
last_rnd
v
vrnd
In following slides, an Acceptor would have 3 attributes
saved on it: last_rnd, v and vrnd:
Paxos: Classic - phase 1
X
rnd=1
X
last_rnd=0, v=nil, vrnd=0
last_rnd=0, v=nil, vrnd=0..Phase 1
1,1, -
---
Proposer X Acceptor 1,2,3
Upon Acceptor received requests from Proposer:
● Refuse requests whose rnd < last_rnd.
● Save the rnd from phase-1 request into its last_rnd.
● Since now it only accepts phase-2 request with this
last_rnd.
● Respond with last_rnd, v and vrnd it has previously
accepted.
Paxos: Classic - phase 1.
X
rnd=1
X
Phase 1
1,1, -
---
Proposer X Acceptor 1,2,3
Upon Proposer received replies from Acceptors:
● If a last_rnd > rnd found: Discard this round.
● Choose v with the greatest vrnd if there is non-nil v.
● Choose the v that Proposer wants to write.
● If less than (n+1)/2 responses received, fail this round.
last_rnd=0, v=nil, vrnd=0
last_rnd=0, v=nil, vrnd=0..
Paxos: Classic - phase 2
X
v="x", rnd=1
X
AcceptedPhase 2
1,1, -
1,x1
1,x1
-
Proposer X Acceptor 1,2,3
v=x, vrnd=1
Proposer:
Send phase-2 with v chosen from previous step to
Acceptors
Paxos: Classic - phase 2.
X
v="x", rnd=1
X
AcceptedPhase 2
1,1, -
1,x1
1,x1
-
Proposer X Acceptor 1,2,3
v=x, vrnd=1
Acceptor:
● Accept requests with rnd that equals its last_rnd
last_rnd==rnd guarantees there is no other Proposer
touches this Acceptor.
Paxos: Case 1: Classic, no Conflict
X
rnd=1
X
last_rnd=0, v=nil, vrnd=0
X
v="x", rnd=1
X
Accepted
Phase 1
Phase 2
1,1, -
---
1,1, -
1,x1
1,x1
-
Proposer X Acceptor 1,2,3
v=x, vrnd=1
Paxos: Case 2.1: Resolve Conflict
X
Y
rnd=1
X
Phase 1 for X
rnd=2
OK, forget X
Phase 1 for Y
Y
X
Y
v="x", rnd=1
Fail
v="y",rnd=2
OK
Phase 2
Y
round=1
round=2
Time
2,y2
1,x1
2,y2
2,1,x1
2,
2,1,x1
2,
2,1, 2,
1,1, -
1,1, -
---
Paxos: Case 2.2: Respect Existed v
X
rnd=3
X
v="y",vrnd=2;
v="x",vrnd=1;
choose 'y'
Phase 1
X
v="y",vrnd=3
Phase 2
round=3
2,y2
1,x1
2,y2
3,y2
3,x1
2,y2
3,y2
3,x1
2,y2
X
OK
3,y3
3,y3
3,y3
v=“y” must be chosen by
Proposer X because “y” may
be a determined value and
should not be overwritten.
Although, without checking
the 3rd acceptor we do not
know if “y” is actually
determined(accepted by a
quorum)
Paxos........
Learner:
● Acceptor send phase-3 message to Learner to inform
that a value has been determined.
● Most of the time Proposer can also be a Learner.
Livelock:
Proposers continually raise its rnd and overwrite others’
last_rnd on Acceptors, thus no phase-2 can be done
successfully.
Multi Paxos
Combine multiple phase-1 requests into one
message.
Send each phase-2 request separately.
Applications:
chubby zookeeper megastore spanner
Fast Paxos
● Proposers send phase-2 without sending phase-1.
● rnd in a Fast Paxos phase-2 is 0.
rnd=0 because rnd must be lower than any Classic rnd.
So it can fall back to Classic Paxos safely.
● Acceptor accepts Fast-phase-2 only when v=nil
● If conflict happened, Proposer should fall back to Class
Paxos with a rnd > 0.
Is Fast Paxos as cheap as Class Paxos?
Fast Paxos Quorum
--- - -
0,x0
-0,x0
0,x0
0,y0
0,x0
X
fast rnd=0
X
phase 2
OK
Y
fast rnd=0
phase 2
2/5; Fails
-
0,y0? ?
If Quorum of Fast Paxos is n/2+1 = 3:
When Y found conflict and fell back to Classic Paxos:
No way for Y to know if x0
or y0
is a determined value.
Solution: An undetermined value must not occupy half of the n/2+1 Acceptors:
→ Fast quorum > n*¾;
→ A value is determined in Fast Round if it is accepted by n*¾+1 Acceptors.
Fast Paxos Quorum.
Fast Paxos Quorum = n*¾
Availability becomes lower because Fast Paxos requires
more Acceptors to work.
Fast Paxos requires at least 5 Acceptors in order to tolerate
one failed Acceptor.
Fast Paxos ⅘: Y has a Conflict
--- - -
0,x0
-0,x0
0,x0
0,x0
0,y0
0,x0
0,x0
0,x0
0,x0
2,y0
0,x0
0,x0
2,x0
2,x0
2,x2
0,x0
0,x0
2,x2
2,x2
X
fast rnd=0
X
phase 2
OK
Y
fast rnd=0
phase 2
1/5; Fail
Y
classic rnd=2
phase 1
OK, "x"
Y
phase 2
OK, writes "x"
Y saw two x0
on 3 Acceptors.
Y must choose x0
because x0
might be a determined value.
y0
can not be determined
because even if the other two
untouched acceptors both have
y0
, there are not enough(5*¾ )
y0
to form a quorum.
Fast Paxos ⅘: X Y conflicts
--- - -
0,x0
0,x0
0,x0
0,y0
0,y0
1,x0
1,x0
1,x0
0,y0
0,y0
1,x0
1,x0
2,y0
2,y0
2,x0
X
fast rnd=0
X
phase 2
Conflict
Y
fast rnd=0
phase 2
Y
Conflict
0,x0
0,x0
0,x0
0,y0
0,y0X
classic rnd=1
phase 1
Y
classic rnd=2
phase 1
X
OK, only "x"
Y
OK, choose "y"
Y
phase 2
2,y2
2,y2
2,y2
2,y2
2,y2X
fail in phase 2
Note
In phase-2, it is also correct if Acceptor accpets
request with rnd >= last_rnd
Q&A
Thanks
drdr.xp@gmail.com
http://drmingdrmer.github.io
weibo.com: @drdrxp

Mais conteúdo relacionado

Mais procurados

3.6 &amp; 7. pumping lemma for cfl &amp; problems based on pl
3.6 &amp; 7. pumping lemma for cfl &amp; problems based on pl3.6 &amp; 7. pumping lemma for cfl &amp; problems based on pl
3.6 &amp; 7. pumping lemma for cfl &amp; problems based on plSampath Kumar S
 
process management
 process management process management
process managementAshish Kumar
 
8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating SystemsDr Sandeep Kumar Poonia
 
RPC: Remote procedure call
RPC: Remote procedure callRPC: Remote procedure call
RPC: Remote procedure callSunita Sahu
 
Dining Philosopher's Problem
Dining Philosopher's ProblemDining Philosopher's Problem
Dining Philosopher's ProblemYash Mittal
 
Moore and mealy machines
Moore and mealy machinesMoore and mealy machines
Moore and mealy machineslavishka_anuj
 
Physical and Logical Clocks
Physical and Logical ClocksPhysical and Logical Clocks
Physical and Logical ClocksDilum Bandara
 
Distributed Transactions(flat and nested) and Atomic Commit Protocols
Distributed Transactions(flat and nested) and Atomic Commit ProtocolsDistributed Transactions(flat and nested) and Atomic Commit Protocols
Distributed Transactions(flat and nested) and Atomic Commit ProtocolsSachin Chauhan
 
String Matching Finite Automata & KMP Algorithm.
String Matching Finite Automata & KMP Algorithm.String Matching Finite Automata & KMP Algorithm.
String Matching Finite Automata & KMP Algorithm.Malek Sumaiya
 

Mais procurados (20)

Cs8591 Computer Networks
Cs8591 Computer NetworksCs8591 Computer Networks
Cs8591 Computer Networks
 
24 Multithreaded Algorithms
24 Multithreaded Algorithms24 Multithreaded Algorithms
24 Multithreaded Algorithms
 
3.6 &amp; 7. pumping lemma for cfl &amp; problems based on pl
3.6 &amp; 7. pumping lemma for cfl &amp; problems based on pl3.6 &amp; 7. pumping lemma for cfl &amp; problems based on pl
3.6 &amp; 7. pumping lemma for cfl &amp; problems based on pl
 
P vs NP
P vs NP P vs NP
P vs NP
 
Process Management-Process Migration
Process Management-Process MigrationProcess Management-Process Migration
Process Management-Process Migration
 
cpu scheduling in os
cpu scheduling in oscpu scheduling in os
cpu scheduling in os
 
Cap Theorem
Cap TheoremCap Theorem
Cap Theorem
 
NFA to DFA
NFA to DFANFA to DFA
NFA to DFA
 
process management
 process management process management
process management
 
8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems
 
RPC: Remote procedure call
RPC: Remote procedure callRPC: Remote procedure call
RPC: Remote procedure call
 
Dining Philosopher's Problem
Dining Philosopher's ProblemDining Philosopher's Problem
Dining Philosopher's Problem
 
Moore and mealy machines
Moore and mealy machinesMoore and mealy machines
Moore and mealy machines
 
Greedy method by Dr. B. J. Mohite
Greedy method by Dr. B. J. MohiteGreedy method by Dr. B. J. Mohite
Greedy method by Dr. B. J. Mohite
 
Agreement protocol
Agreement protocolAgreement protocol
Agreement protocol
 
Lower bound
Lower boundLower bound
Lower bound
 
Physical and Logical Clocks
Physical and Logical ClocksPhysical and Logical Clocks
Physical and Logical Clocks
 
Distributed Transactions(flat and nested) and Atomic Commit Protocols
Distributed Transactions(flat and nested) and Atomic Commit ProtocolsDistributed Transactions(flat and nested) and Atomic Commit Protocols
Distributed Transactions(flat and nested) and Atomic Commit Protocols
 
String Matching Finite Automata & KMP Algorithm.
String Matching Finite Automata & KMP Algorithm.String Matching Finite Automata & KMP Algorithm.
String Matching Finite Automata & KMP Algorithm.
 
Replication in Distributed Systems
Replication in Distributed SystemsReplication in Distributed Systems
Replication in Distributed Systems
 

Semelhante a Paxos building-reliable-system

Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningTapas Majumdar
 
Lab: Foundation of Concurrent and Distributed Systems
Lab: Foundation of Concurrent and Distributed SystemsLab: Foundation of Concurrent and Distributed Systems
Lab: Foundation of Concurrent and Distributed SystemsRuochun Tzeng
 
Peer-to-Peer Streaming Based on Network Coding Decreases Packet Jitter
Peer-to-Peer Streaming Based on Network Coding Decreases Packet JitterPeer-to-Peer Streaming Based on Network Coding Decreases Packet Jitter
Peer-to-Peer Streaming Based on Network Coding Decreases Packet JitterAlpen-Adria-Universität
 
Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...
Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...
Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...Media Gorod
 
Computer network (8)
Computer network (8)Computer network (8)
Computer network (8)NYversity
 
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...Kinson Chan
 
How Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and SaferHow Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and SaferScyllaDB
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking VN
 
When Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of TorqueboxWhen Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of Torqueboxrockyjaiswal
 
Slides for a talk on UML Semantics in Nuremberg in 2005
Slides for a talk on UML Semantics in Nuremberg in 2005Slides for a talk on UML Semantics in Nuremberg in 2005
Slides for a talk on UML Semantics in Nuremberg in 2005Alin Stefanescu
 
Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural NetworkOmer Korech
 
MySQL 5.6 Global Transaction IDs - Use case: (session) consistency
MySQL 5.6 Global Transaction IDs - Use case: (session) consistencyMySQL 5.6 Global Transaction IDs - Use case: (session) consistency
MySQL 5.6 Global Transaction IDs - Use case: (session) consistencyUlf Wendel
 
5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron) 5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron) 艾鍗科技
 
Fast dynamic analysis, Kostya Serebryany
Fast dynamic analysis, Kostya SerebryanyFast dynamic analysis, Kostya Serebryany
Fast dynamic analysis, Kostya Serebryanyyaevents
 
Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...
Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...
Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...Yandex
 
Interval Type-2 fuzzy decision making
Interval Type-2 fuzzy decision makingInterval Type-2 fuzzy decision making
Interval Type-2 fuzzy decision makingBob John
 

Semelhante a Paxos building-reliable-system (20)

Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learning
 
Lab: Foundation of Concurrent and Distributed Systems
Lab: Foundation of Concurrent and Distributed SystemsLab: Foundation of Concurrent and Distributed Systems
Lab: Foundation of Concurrent and Distributed Systems
 
Peer-to-Peer Streaming Based on Network Coding Decreases Packet Jitter
Peer-to-Peer Streaming Based on Network Coding Decreases Packet JitterPeer-to-Peer Streaming Based on Network Coding Decreases Packet Jitter
Peer-to-Peer Streaming Based on Network Coding Decreases Packet Jitter
 
Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...
Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...
Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...
 
Computer network (8)
Computer network (8)Computer network (8)
Computer network (8)
 
Multicore Processors
Multicore ProcessorsMulticore Processors
Multicore Processors
 
RabbitMQ in Sprayer
RabbitMQ in SprayerRabbitMQ in Sprayer
RabbitMQ in Sprayer
 
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
 
How Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and SaferHow Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and Safer
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
 
When Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of TorqueboxWhen Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of Torquebox
 
Slides for a talk on UML Semantics in Nuremberg in 2005
Slides for a talk on UML Semantics in Nuremberg in 2005Slides for a talk on UML Semantics in Nuremberg in 2005
Slides for a talk on UML Semantics in Nuremberg in 2005
 
Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural Network
 
MySQL 5.6 Global Transaction IDs - Use case: (session) consistency
MySQL 5.6 Global Transaction IDs - Use case: (session) consistencyMySQL 5.6 Global Transaction IDs - Use case: (session) consistency
MySQL 5.6 Global Transaction IDs - Use case: (session) consistency
 
5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron) 5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron)
 
Fast dynamic analysis, Kostya Serebryany
Fast dynamic analysis, Kostya SerebryanyFast dynamic analysis, Kostya Serebryany
Fast dynamic analysis, Kostya Serebryany
 
Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...
Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...
Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...
 
Interval Type-2 fuzzy decision making
Interval Type-2 fuzzy decision makingInterval Type-2 fuzzy decision making
Interval Type-2 fuzzy decision making
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Último (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

Paxos building-reliable-system

  • 2. Background Several processes do one thing. The only problem in distributed system is achieving consensus. Paxos: the core of distributed system.
  • 3. Agenda 1. Problem 2. Replication is not enough 3. Paxos Algorithm 4. Paxos Optimization
  • 4. Problem Required: Durability: 99.99999999% Availability: 99.99% What we have: Hard Drive: 4% of Annual failure rate Server Down Time: 0.1% or longer Packet loss between IDC: 5% ~ 30%
  • 5. Solution(Maybe) Multiple Replicas No data loss if x(x<n) replicas lost Durability: 1 replicas: ~ 0.63% 2 replicas: ~ 0.00395% 3 replicas: < 0.000001% n replicas: = 1 - x^n /* x = failure rate of single replica */
  • 6. Solution. How to replicate data? Besides number of replicas: Availability Atomicity Consistency ...
  • 7. Fundamental Replication Algorithms Master-Slave Async Master-Slave Sync Master-Slave Semi-Sync Quorum Write and Read
  • 8. Master-Slave Async The Mysql Way. 1. Master received write op. 2. Master wrote on disk. 3. Master responded ‘OK’. 4. Master replicated to slaves. If disk fail before replication → Data loss. Time MasterClient Slave.1 Slave.2 Disk Failure
  • 9. Master-Slave Sync 1. Master received write op. 2. Master replicated log to slaves. 3. Slave may block... 4. Client won’t receive ‘OK’ until all slaves respond. One unreachable node halts the entire system. : No data loss. : But low availability. Time MasterClient Slave.1 Slave.2
  • 10. Master-Slave Semi-Sync 1. Master received write op. 2. Master replicated log to slaves. 3. Slave may block... 4. Client receives ‘OK’ if [1,n) slaves respond. : High durability. : High availability. : No slave has all data → We need Quorum Write Time MasterClient Slave.1 Slave.2
  • 11. Quorum Write and Read Dynamo / Cassandra Write to W >=N/2+1 nodes. No master required. Read R >=N/2+1 nodes. W + R > N Tolerate upto (N-1)/2 failed nodes. Time Node.1Client Node.2 Node.3
  • 12. Quorum Write and Read. Last-Win The last write wins. Totally ordered based on timestamp. Time Node.1Client Node.2 Node.3
  • 13. : High durability. : High availability. : Data completeness is guaranteed. Is it enough? Quorum Write and Read..
  • 14. Quorum Write and Read... W + R > N Consistency: Eventual Transactionality: Non-Atomic-Update Dirty-Read Lost-Update http://en.wikipedia.org/wiki/Concurrency_control
  • 15. An Imaginary Storage Service ● A storage system with 3 nodes(processes). ● Policy: Quorum RW. ● It stores only one variable “i”. ● “i” has multiple versions: i1, i2, i3… ● Commands: get /* read latest “i” */ set <n> /* assign <n> to “i” */ inc <n> /* increment “i” by <n> */ It shows us the deficiency of Quorum RW and how paxos solves these problems.
  • 16. An Imaginary Storage Service. "set" → Quorum Write. "inc" → the simplest transactional operation: 1. Read latest “i” with Quorum Read: i1 2. Let i2 = i1 + n 3. set i2 X set i2=3 X get i 21 21 00 32 21 32 X get i1=2 i2 = i1 + 1 32 21 32
  • 17. set i2=3 OK set i2=4 An Imaginary Storage Service.. X X get i 21 21 00 32 21 32 53 21 53 X get i1=2 i2 = i1 + 1 We expect X to be able to get i3=5 This requires Y to “fail” after X wrote i2. How do we do that? Y get i1=2 Y i2 = i1 + 2 32 21 32 Y should run Quorum Read and Quorum Write again... Must Fail. Or existed value will be overwritten.
  • 18. An Imaginary Storage Service... In order to correctly get i3 after 2 “inc” operations: There can only be ONE successful “write” operation to a certain version of “i”(in our case: i2). Generalization: One value(one version of a variable) should not be modified any more after it is determined(client received “OK” and believes it is stored). How to define “determined”? How to avoid changing a “determined” value?
  • 19. Determine a Value X Y Any value set? X No XX - --- Any value set? --- Y Yes, Y gives up X XX - XX - Solution: Before writing a value, run a Quorum Read round to check if such a value exists(or maybe exist).
  • 20. Determine a Value. X Y Any value set? X No YYX Y XX - --- Any value set? --- Y No X But both X and Y would believe there is no value set. X and Y both will start to write at the same time. Lost Update
  • 21. Determine a Value.. X Any value set? X No YYX Y --- --- X Y--- Any value set? Quorum Read+Write: Remember X is the last reader --- Y No Quorum Read+Write: Remember Y is the last reader X -- Solution improved: Remember who did the last read And deny write from previous readers. now node 1 and 2 will only accept request from X. now node 2 and 3 will only accept request from Y.
  • 22. Determine a Value... By applying this policy, a value(each version of “i” in our case) can be stored safely and consistently. Leslie Lamport made a paper of this policy.
  • 23. Paxos
  • 24. What is Paxos ● A reliable storage: based on Quorum RW. ● Each paxos instance stores only 1 value. ● 2 rounds are required to determine 1 value. ● A value can’t be modified after determined. ● determined means being accepted by a quorum(>n/2). ● Immediate Consistency.
  • 25. Paxos Classic Paxos 2 rounds per instance. Multi Paxos ~1 round per instance. Fast Paxos 1 round per instance ( without conflict ). 2 rounds per instance ( with conflict ).
  • 26. Paxos: Precondition Storage must be reliable: No Data loss /* Or it falls back to Byzantine Paxos */ Tolerate: Message loss Message in random order
  • 27. Proposer: process that starts a paxos round to write sth. Acceptor: process that receives and stores messages. Quorum( of acceptors ) : n/2+1 Acceptors. Round:Including 2 phases:Phase-1 & Phase-2 Round Number (rnd): ID of a round. monotonic incremental;Last-Win;Universially unique; Paxos: Concepts
  • 28. Last Round Number (last_rnd): Greatest rnd an Acceptor has ever seen; To identify the proposer from which a acceptor would accept write request; Value (v): the value an Acceptor accepted. Value round number (vrnd): At which round an Acceptor accepted the v. Value determined: The value accepted by a quorum of acceptors. Paxos: Concepts.
  • 29. Illustration of Acceptor 5,x3 last_rnd v vrnd In following slides, an Acceptor would have 3 attributes saved on it: last_rnd, v and vrnd:
  • 30. Paxos: Classic - phase 1 X rnd=1 X last_rnd=0, v=nil, vrnd=0 last_rnd=0, v=nil, vrnd=0..Phase 1 1,1, - --- Proposer X Acceptor 1,2,3 Upon Acceptor received requests from Proposer: ● Refuse requests whose rnd < last_rnd. ● Save the rnd from phase-1 request into its last_rnd. ● Since now it only accepts phase-2 request with this last_rnd. ● Respond with last_rnd, v and vrnd it has previously accepted.
  • 31. Paxos: Classic - phase 1. X rnd=1 X Phase 1 1,1, - --- Proposer X Acceptor 1,2,3 Upon Proposer received replies from Acceptors: ● If a last_rnd > rnd found: Discard this round. ● Choose v with the greatest vrnd if there is non-nil v. ● Choose the v that Proposer wants to write. ● If less than (n+1)/2 responses received, fail this round. last_rnd=0, v=nil, vrnd=0 last_rnd=0, v=nil, vrnd=0..
  • 32. Paxos: Classic - phase 2 X v="x", rnd=1 X AcceptedPhase 2 1,1, - 1,x1 1,x1 - Proposer X Acceptor 1,2,3 v=x, vrnd=1 Proposer: Send phase-2 with v chosen from previous step to Acceptors
  • 33. Paxos: Classic - phase 2. X v="x", rnd=1 X AcceptedPhase 2 1,1, - 1,x1 1,x1 - Proposer X Acceptor 1,2,3 v=x, vrnd=1 Acceptor: ● Accept requests with rnd that equals its last_rnd last_rnd==rnd guarantees there is no other Proposer touches this Acceptor.
  • 34. Paxos: Case 1: Classic, no Conflict X rnd=1 X last_rnd=0, v=nil, vrnd=0 X v="x", rnd=1 X Accepted Phase 1 Phase 2 1,1, - --- 1,1, - 1,x1 1,x1 - Proposer X Acceptor 1,2,3 v=x, vrnd=1
  • 35. Paxos: Case 2.1: Resolve Conflict X Y rnd=1 X Phase 1 for X rnd=2 OK, forget X Phase 1 for Y Y X Y v="x", rnd=1 Fail v="y",rnd=2 OK Phase 2 Y round=1 round=2 Time 2,y2 1,x1 2,y2 2,1,x1 2, 2,1,x1 2, 2,1, 2, 1,1, - 1,1, - ---
  • 36. Paxos: Case 2.2: Respect Existed v X rnd=3 X v="y",vrnd=2; v="x",vrnd=1; choose 'y' Phase 1 X v="y",vrnd=3 Phase 2 round=3 2,y2 1,x1 2,y2 3,y2 3,x1 2,y2 3,y2 3,x1 2,y2 X OK 3,y3 3,y3 3,y3 v=“y” must be chosen by Proposer X because “y” may be a determined value and should not be overwritten. Although, without checking the 3rd acceptor we do not know if “y” is actually determined(accepted by a quorum)
  • 37. Paxos........ Learner: ● Acceptor send phase-3 message to Learner to inform that a value has been determined. ● Most of the time Proposer can also be a Learner. Livelock: Proposers continually raise its rnd and overwrite others’ last_rnd on Acceptors, thus no phase-2 can be done successfully.
  • 38. Multi Paxos Combine multiple phase-1 requests into one message. Send each phase-2 request separately. Applications: chubby zookeeper megastore spanner
  • 39. Fast Paxos ● Proposers send phase-2 without sending phase-1. ● rnd in a Fast Paxos phase-2 is 0. rnd=0 because rnd must be lower than any Classic rnd. So it can fall back to Classic Paxos safely. ● Acceptor accepts Fast-phase-2 only when v=nil ● If conflict happened, Proposer should fall back to Class Paxos with a rnd > 0. Is Fast Paxos as cheap as Class Paxos?
  • 40. Fast Paxos Quorum --- - - 0,x0 -0,x0 0,x0 0,y0 0,x0 X fast rnd=0 X phase 2 OK Y fast rnd=0 phase 2 2/5; Fails - 0,y0? ? If Quorum of Fast Paxos is n/2+1 = 3: When Y found conflict and fell back to Classic Paxos: No way for Y to know if x0 or y0 is a determined value. Solution: An undetermined value must not occupy half of the n/2+1 Acceptors: → Fast quorum > n*¾; → A value is determined in Fast Round if it is accepted by n*¾+1 Acceptors.
  • 41. Fast Paxos Quorum. Fast Paxos Quorum = n*¾ Availability becomes lower because Fast Paxos requires more Acceptors to work. Fast Paxos requires at least 5 Acceptors in order to tolerate one failed Acceptor.
  • 42. Fast Paxos ⅘: Y has a Conflict --- - - 0,x0 -0,x0 0,x0 0,x0 0,y0 0,x0 0,x0 0,x0 0,x0 2,y0 0,x0 0,x0 2,x0 2,x0 2,x2 0,x0 0,x0 2,x2 2,x2 X fast rnd=0 X phase 2 OK Y fast rnd=0 phase 2 1/5; Fail Y classic rnd=2 phase 1 OK, "x" Y phase 2 OK, writes "x" Y saw two x0 on 3 Acceptors. Y must choose x0 because x0 might be a determined value. y0 can not be determined because even if the other two untouched acceptors both have y0 , there are not enough(5*¾ ) y0 to form a quorum.
  • 43. Fast Paxos ⅘: X Y conflicts --- - - 0,x0 0,x0 0,x0 0,y0 0,y0 1,x0 1,x0 1,x0 0,y0 0,y0 1,x0 1,x0 2,y0 2,y0 2,x0 X fast rnd=0 X phase 2 Conflict Y fast rnd=0 phase 2 Y Conflict 0,x0 0,x0 0,x0 0,y0 0,y0X classic rnd=1 phase 1 Y classic rnd=2 phase 1 X OK, only "x" Y OK, choose "y" Y phase 2 2,y2 2,y2 2,y2 2,y2 2,y2X fail in phase 2
  • 44. Note In phase-2, it is also correct if Acceptor accpets request with rnd >= last_rnd