This slides explains why Paxos is the only correctly way to problems about consensus in a distributed system.
This slides uses several diagram to show how paxos is derived from a naive replication algorithm to a immediate consistent replication algorithm.
It starts with master-slave replication.
Then we refine it to quorum-rw by adding consistency constrain.
And then we refine quorum-rw to paxos by adding atomicity constrain.
5. Solution(Maybe)
Multiple Replicas
No data loss if x(x<n) replicas lost
Durability:
1 replicas: ~ 0.63%
2 replicas: ~ 0.00395%
3 replicas: < 0.000001%
n replicas: = 1 - x^n /* x = failure rate of single replica */
8. Master-Slave Async
The Mysql Way.
1. Master received write op.
2. Master wrote on disk.
3. Master responded ‘OK’.
4. Master replicated to slaves.
If disk fail before replication
→ Data loss.
Time
MasterClient Slave.1 Slave.2
Disk Failure
9. Master-Slave Sync
1. Master received write op.
2. Master replicated log to slaves.
3. Slave may block...
4. Client won’t receive ‘OK’ until
all slaves respond.
One unreachable node
halts the entire system.
: No data loss.
: But low availability.
Time
MasterClient Slave.1 Slave.2
10. Master-Slave Semi-Sync
1. Master received write op.
2. Master replicated log to slaves.
3. Slave may block...
4. Client receives ‘OK’ if [1,n)
slaves respond.
: High durability.
: High availability.
: No slave has all data
→ We need Quorum Write
Time
MasterClient Slave.1 Slave.2
11. Quorum Write and Read
Dynamo / Cassandra
Write to W >=N/2+1 nodes.
No master required.
Read R >=N/2+1 nodes.
W + R > N
Tolerate upto (N-1)/2 failed
nodes.
Time
Node.1Client Node.2 Node.3
12. Quorum Write and Read. Last-Win
The last write wins.
Totally ordered based on
timestamp.
Time
Node.1Client Node.2 Node.3
13. : High durability.
: High availability.
: Data completeness is guaranteed.
Is it enough?
Quorum Write and Read..
14. Quorum Write and Read... W + R > N
Consistency:
Eventual
Transactionality:
Non-Atomic-Update
Dirty-Read
Lost-Update
http://en.wikipedia.org/wiki/Concurrency_control
15. An Imaginary Storage Service
● A storage system with 3 nodes(processes).
● Policy: Quorum RW.
● It stores only one variable “i”.
● “i” has multiple versions: i1, i2, i3…
● Commands:
get /* read latest “i” */
set <n> /* assign <n> to “i” */
inc <n> /* increment “i” by <n> */
It shows us the deficiency of Quorum RW
and how paxos solves these problems.
16. An Imaginary Storage Service.
"set" → Quorum Write.
"inc" → the simplest transactional operation:
1. Read latest “i” with Quorum Read: i1
2. Let i2 = i1 + n
3. set i2
X
set i2=3
X
get i
21
21
00
32
21
32
X
get i1=2
i2 = i1 + 1
32
21
32
17. set i2=3
OK
set i2=4
An Imaginary Storage Service..
X
X
get i
21
21
00
32
21
32
53
21
53
X
get i1=2
i2 = i1 + 1
We expect X to be able to get i3=5
This requires Y to “fail” after X wrote i2. How do we do that?
Y
get i1=2
Y
i2 = i1 + 2
32
21
32
Y should run Quorum Read and Quorum Write again...
Must Fail.
Or existed
value will be
overwritten.
18. An Imaginary Storage Service...
In order to correctly get i3 after 2 “inc” operations:
There can only be ONE successful “write” operation
to a certain version of “i”(in our case: i2).
Generalization:
One value(one version of a variable) should not be
modified any more after it is determined(client received
“OK” and believes it is stored).
How to define “determined”?
How to avoid changing a “determined” value?
19. Determine a Value
X
Y
Any value set?
X
No
XX -
---
Any value set?
---
Y
Yes, Y gives up
X
XX -
XX -
Solution: Before writing a value, run a Quorum Read
round to check if such a value exists(or maybe exist).
20. Determine a Value.
X Y
Any value set?
X
No
YYX Y
XX -
---
Any value set?
--- Y
No
X
But both X and Y would believe there is no value set.
X and Y both will start to write at the same time.
Lost Update
21. Determine a Value..
X
Any value set?
X
No
YYX Y
---
---
X
Y---
Any value set?
Quorum Read+Write:
Remember X is the last reader
--- Y
No
Quorum Read+Write:
Remember Y is the last reader
X --
Solution improved: Remember who did the last read And
deny write from previous readers.
now node 1 and 2 will only accept
request from X.
now node 2 and 3 will only accept
request from Y.
22. Determine a Value...
By applying this policy, a value(each version of “i” in our
case) can be stored safely and consistently.
Leslie Lamport made a paper of this policy.
24. What is Paxos
● A reliable storage: based on Quorum RW.
● Each paxos instance stores only 1 value.
● 2 rounds are required to determine 1 value.
● A value can’t be modified after determined.
● determined means being accepted by a
quorum(>n/2).
● Immediate Consistency.
25. Paxos
Classic Paxos
2 rounds per instance.
Multi Paxos
~1 round per instance.
Fast Paxos
1 round per instance ( without conflict ).
2 rounds per instance ( with conflict ).
26. Paxos: Precondition
Storage must be reliable:
No Data loss
/* Or it falls back to Byzantine Paxos */
Tolerate:
Message loss
Message in random order
27. Proposer: process that starts a paxos round to write sth.
Acceptor: process that receives and stores messages.
Quorum( of acceptors ) : n/2+1 Acceptors.
Round:Including 2 phases:Phase-1 & Phase-2
Round Number (rnd):
ID of a round.
monotonic incremental;Last-Win;Universially unique;
Paxos: Concepts
28. Last Round Number (last_rnd):
Greatest rnd an Acceptor has ever seen;
To identify the proposer from which a acceptor would
accept write request;
Value (v): the value an Acceptor accepted.
Value round number (vrnd):
At which round an Acceptor accepted the v.
Value determined:
The value accepted by a quorum of acceptors.
Paxos: Concepts.
30. Paxos: Classic - phase 1
X
rnd=1
X
last_rnd=0, v=nil, vrnd=0
last_rnd=0, v=nil, vrnd=0..Phase 1
1,1, -
---
Proposer X Acceptor 1,2,3
Upon Acceptor received requests from Proposer:
● Refuse requests whose rnd < last_rnd.
● Save the rnd from phase-1 request into its last_rnd.
● Since now it only accepts phase-2 request with this
last_rnd.
● Respond with last_rnd, v and vrnd it has previously
accepted.
31. Paxos: Classic - phase 1.
X
rnd=1
X
Phase 1
1,1, -
---
Proposer X Acceptor 1,2,3
Upon Proposer received replies from Acceptors:
● If a last_rnd > rnd found: Discard this round.
● Choose v with the greatest vrnd if there is non-nil v.
● Choose the v that Proposer wants to write.
● If less than (n+1)/2 responses received, fail this round.
last_rnd=0, v=nil, vrnd=0
last_rnd=0, v=nil, vrnd=0..
32. Paxos: Classic - phase 2
X
v="x", rnd=1
X
AcceptedPhase 2
1,1, -
1,x1
1,x1
-
Proposer X Acceptor 1,2,3
v=x, vrnd=1
Proposer:
Send phase-2 with v chosen from previous step to
Acceptors
33. Paxos: Classic - phase 2.
X
v="x", rnd=1
X
AcceptedPhase 2
1,1, -
1,x1
1,x1
-
Proposer X Acceptor 1,2,3
v=x, vrnd=1
Acceptor:
● Accept requests with rnd that equals its last_rnd
last_rnd==rnd guarantees there is no other Proposer
touches this Acceptor.
34. Paxos: Case 1: Classic, no Conflict
X
rnd=1
X
last_rnd=0, v=nil, vrnd=0
X
v="x", rnd=1
X
Accepted
Phase 1
Phase 2
1,1, -
---
1,1, -
1,x1
1,x1
-
Proposer X Acceptor 1,2,3
v=x, vrnd=1
35. Paxos: Case 2.1: Resolve Conflict
X
Y
rnd=1
X
Phase 1 for X
rnd=2
OK, forget X
Phase 1 for Y
Y
X
Y
v="x", rnd=1
Fail
v="y",rnd=2
OK
Phase 2
Y
round=1
round=2
Time
2,y2
1,x1
2,y2
2,1,x1
2,
2,1,x1
2,
2,1, 2,
1,1, -
1,1, -
---
36. Paxos: Case 2.2: Respect Existed v
X
rnd=3
X
v="y",vrnd=2;
v="x",vrnd=1;
choose 'y'
Phase 1
X
v="y",vrnd=3
Phase 2
round=3
2,y2
1,x1
2,y2
3,y2
3,x1
2,y2
3,y2
3,x1
2,y2
X
OK
3,y3
3,y3
3,y3
v=“y” must be chosen by
Proposer X because “y” may
be a determined value and
should not be overwritten.
Although, without checking
the 3rd acceptor we do not
know if “y” is actually
determined(accepted by a
quorum)
37. Paxos........
Learner:
● Acceptor send phase-3 message to Learner to inform
that a value has been determined.
● Most of the time Proposer can also be a Learner.
Livelock:
Proposers continually raise its rnd and overwrite others’
last_rnd on Acceptors, thus no phase-2 can be done
successfully.
38. Multi Paxos
Combine multiple phase-1 requests into one
message.
Send each phase-2 request separately.
Applications:
chubby zookeeper megastore spanner
39. Fast Paxos
● Proposers send phase-2 without sending phase-1.
● rnd in a Fast Paxos phase-2 is 0.
rnd=0 because rnd must be lower than any Classic rnd.
So it can fall back to Classic Paxos safely.
● Acceptor accepts Fast-phase-2 only when v=nil
● If conflict happened, Proposer should fall back to Class
Paxos with a rnd > 0.
Is Fast Paxos as cheap as Class Paxos?
40. Fast Paxos Quorum
--- - -
0,x0
-0,x0
0,x0
0,y0
0,x0
X
fast rnd=0
X
phase 2
OK
Y
fast rnd=0
phase 2
2/5; Fails
-
0,y0? ?
If Quorum of Fast Paxos is n/2+1 = 3:
When Y found conflict and fell back to Classic Paxos:
No way for Y to know if x0
or y0
is a determined value.
Solution: An undetermined value must not occupy half of the n/2+1 Acceptors:
→ Fast quorum > n*¾;
→ A value is determined in Fast Round if it is accepted by n*¾+1 Acceptors.
41. Fast Paxos Quorum.
Fast Paxos Quorum = n*¾
Availability becomes lower because Fast Paxos requires
more Acceptors to work.
Fast Paxos requires at least 5 Acceptors in order to tolerate
one failed Acceptor.
42. Fast Paxos ⅘: Y has a Conflict
--- - -
0,x0
-0,x0
0,x0
0,x0
0,y0
0,x0
0,x0
0,x0
0,x0
2,y0
0,x0
0,x0
2,x0
2,x0
2,x2
0,x0
0,x0
2,x2
2,x2
X
fast rnd=0
X
phase 2
OK
Y
fast rnd=0
phase 2
1/5; Fail
Y
classic rnd=2
phase 1
OK, "x"
Y
phase 2
OK, writes "x"
Y saw two x0
on 3 Acceptors.
Y must choose x0
because x0
might be a determined value.
y0
can not be determined
because even if the other two
untouched acceptors both have
y0
, there are not enough(5*¾ )
y0
to form a quorum.
43. Fast Paxos ⅘: X Y conflicts
--- - -
0,x0
0,x0
0,x0
0,y0
0,y0
1,x0
1,x0
1,x0
0,y0
0,y0
1,x0
1,x0
2,y0
2,y0
2,x0
X
fast rnd=0
X
phase 2
Conflict
Y
fast rnd=0
phase 2
Y
Conflict
0,x0
0,x0
0,x0
0,y0
0,y0X
classic rnd=1
phase 1
Y
classic rnd=2
phase 1
X
OK, only "x"
Y
OK, choose "y"
Y
phase 2
2,y2
2,y2
2,y2
2,y2
2,y2X
fail in phase 2
44. Note
In phase-2, it is also correct if Acceptor accpets
request with rnd >= last_rnd