O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Making Time with Vector Clocks
Matthew Sackman
matthew@goshawkdb.io
https://goshawkdb.io/
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese an...
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practi...
1. Have you played with a NoSQL or NewSQL store?
1. Have you played with a NoSQL or NewSQL store?
2. Have you deployed a NoSQL or NewSQL store?
1. Have you played with a NoSQL or NewSQL store?
2. Have you deployed a NoSQL or NewSQL store?
3. Have you studied and kno...
Traditional Database Semantics
ACID
• Atomic: an operation (transaction) either succeeds or aborts
completely - no partial...
Traditional Database Semantics
ACID
• Atomic: an operation (transaction) either succeeds or aborts
completely - no partial...
Traditional Database Semantics
Default isolation levels
• PostgreSQL:
• Oracle 11g:
• MS SQL Server:
• MySQL InnoDB:
Traditional Database Semantics
Default isolation levels
• PostgreSQL: Read Committed
• Oracle 11g: Read Committed
• MS SQL...
Traditional Database Semantics
Default isolation levels
• PostgreSQL: Read Committed
• Oracle 11g: Read Committed
• MS SQL...
Isolation Levels
Snapshot Isolation
as per Wikipedia
“Snapshot isolation is a guarantee that all reads made in a transaction
will see a con...
Snapshot Isolation
as per Wikipedia
“Snapshot isolation is a guarantee that all reads made in a transaction
will see a con...
Snapshot Isolation
trouble up mill
x, y := 0,0
1 func t1() {
2 if x == 0 {
3 y = 1
4 }
5 }
func t2() {
if y == 0 {
x = 1
}...
Snapshot Isolation
trouble up mill
x, y := 0,0
1 func t1() {
2 if x == 0 {
3 y = 1
4 }
5 }
func t2() {
if y == 0 {
x = 1
}...
Snapshot Isolation
trouble up mill
x, y := 0,0
1 func t1() {
2 if x == 0 {
3 y = 1
4 }
5 }
func t2() {
if y == 0 {
x = 1
}...
Snapshot Isolation
trouble up mill
x, y := 0,0
1 func t1() {
2 if x == 0 {
3 y = 1
4 }
5 }
func t2() {
if y == 0 {
x = 1
}...
Snapshot Isolation
trouble up mill
x, y := 0,0
1 func t1() {
2 if x == 0 {
3 y = 1
4 }
5 }
func t2() {
if y == 0 {
x = 1
}...
Snapshot Isolation
trouble up mill
x, y := 0,0
1 func t1() {
2 if x == 0 {
3 y = 1
4 }
5 }
func t2() {
if y == 0 {
x = 1
}...
Snapshot Isolation
trouble up mill
x, y := 0,0
1 func t1() {
2 if x == 0 {
3 y = 1
4 }
5 }
func t2() {
if y == 0 {
x = 1
}...
Snapshot Isolation
trouble up mill
x, y := 0,0
1 func t1() {
2 if x == 0 {
3 y = 1
4 }
5 }
func t2() {
if y == 0 {
x = 1
}...
Snapshot Isolation
trouble up mill
x, y := 0,0
1 func t1() {
2 if x == 0 {
3 y = 1
4 }
5 }
func t2() {
if y == 0 {
x = 1
}...
Snapshot Isolation
trouble up mill
x, y := 0,0
1 func t1() {
2 if x == 0 {
3 y = 1
4 }
5 }
func t2() {
if y == 0 {
x = 1
}...
Snapshot Isolation
trouble up mill
x, y := 0,0
1 func t1() {
2 if x == 0 {
3 y = 1
4 }
5 }
func t2() {
if y == 0 {
x = 1
}...
Snapshot Isolation
trouble up mill
x, y := 0,0
1 func t1() {
2 if x == 0 {
3 y = 1
4 }
5 }
func t2() {
if y == 0 {
x = 1
}...
Snapshot Isolation
trouble up mill
x, y := 0,0
1 func t1() {
2 if x == 0 {
3 y = 1
4 }
5 }
func t2() {
if y == 0 {
x = 1
}...
Snapshot Isolation
trouble up mill
x, y := 0,0
1 func t1() {
2 if x == 0 {
3 y = 1
4 }
5 }
func t2() {
if y == 0 {
x = 1
}...
Desired Features
• General purpose transactions
Desired Features
• General purpose transactions
• Strong serializability
Desired Features
• General purpose transactions
• Strong serializability
• Distribution
• Automatic sharding
• Horizontal ...
Isolation Levels
Isolation Levels
CAP
Possibility of Partitions =) ¬(Consistency ^ Availability)
CAP
Possibility of Partitions =) ¬(Consistency ^ Availability)
CAP
Possibility of Partitions =) ¬(Consistency ^ Availability)
CAP
Possibility of Partitions =) ¬(Consistency ^ Availability)
CAP
Possibility of Partitions =) ¬(Consistency ^ Availability)
Achieving Consistency
Colours Indicate Connected Nodes
Achieving Consistency
Colours Indicate Connected Nodes
Achieving Consistency
Colours Indicate Connected Nodes
Achieving Consistency
Colours Indicate Connected Nodes
Achieving Consistency
Colours Indicate Connected Nodes
Achieving Consistency
Colours Indicate Connected Nodes
Achieving Consistency
Colours Indicate Connected Nodes
Achieving Consistency
Colours Indicate Connected Nodes
Achieving Consistency
Colours Indicate Connected Nodes
Achieving Consistency
Colours Indicate Connected Nodes
Learnings 1
• Strong serializability requires Consistency, so must sacrifice
Availability
Learnings 1
• Strong serializability requires Consistency, so must sacrifice
Availability
• To achieve Consistency, only ac...
Learnings 1
• Strong serializability requires Consistency, so must sacrifice
Availability
• To achieve Consistency, only ac...
Modifying Values
Colours Indicate Extent of Txn Write
Modifying Values
Colours Indicate Extent of Txn Write
Modifying Values
Colours Indicate Extent of Txn Write
Modifying Values
Colours Indicate Extent of Txn Write
Modifying Values
Colours Indicate Extent of Txn Write
Modifying Values
Colours Indicate Extent of Txn Write
Learnings 2
• Strong serializability requires Consistency, so must sacrifice
Availability
• To achieve Consistency, only ac...
Reading Values
Colours Indicate Extent of Txn Write
Reading Values
Colours Indicate Extent of Txn Write
Reading Values
Colours Indicate Extent of Txn Write
Reading Values
Colours Indicate Extent of Txn Write
Reading Values
Colours Indicate Extent of Txn Write
Reading Values
Colours Indicate Extent of Txn Write
Learnings 3
• Strong serializability requires Consistency, so must sacrifice
Availability
• To achieve Consistency, only ac...
Txn Processing in Distributed Databases
1. Client submits txn
Txn Processing in Distributed Databases
1. Client submits txn
2. Node(s) vote on txn
Txn Processing in Distributed Databases
1. Client submits txn
2. Node(s) vote on txn
3. Node(s) reach consensus on txn out...
Txn Processing in Distributed Databases
1. Client submits txn
2. Node(s) vote on txn
3. Node(s) reach consensus on txn out...
Txn Processing in Distributed Databases
1. Client submits txn
2. Node(s) vote on txn
3. Node(s) reach consensus on txn out...
Txn Processing in Distributed Databases
1. Client submits txn
2. Node(s) vote on txn
3. Node(s) reach consensus on txn out...
Leader Based Ordering
Leader Based Ordering
Leader Based Ordering
Leader Based Ordering
• Only leader votes on whether txn commits or aborts
Leader Based Ordering
• Only leader votes on whether txn commits or aborts
• Therefore leader must know everything
Leader Based Ordering
• Only leader votes on whether txn commits or aborts
• Therefore leader must know everything
• If le...
Leader Based Ordering
• Only leader votes on whether txn commits or aborts
• Therefore leader must know everything
• If le...
Leader Based Ordering
• Only leader votes on whether txn commits or aborts
• Therefore leader must know everything
• If le...
Client Clock Based Ordering
Client Clock Based Ordering
Client Clock Based Ordering
Client Clock Based Ordering
• Nodes receive txns and must vote on txn outcome and then
consensus must be reached (not show...
Client Clock Based Ordering
• Nodes receive txns and must vote on txn outcome and then
consensus must be reached (not show...
Client Clock Based Ordering
• Nodes receive txns and must vote on txn outcome and then
consensus must be reached (not show...
Client Clock Based Ordering
• Nodes receive txns and must vote on txn outcome and then
consensus must be reached (not show...
Client Clock Based Ordering
• Nodes receive txns and must vote on txn outcome and then
consensus must be reached (not show...
Syntax
V1 < V2 ,8x 2 dom(V1 [ V2) : V1[x]  V2[x]
^9y 2 dom(V1 [ V2) : V1[y] < V2[y]
Simple Transaction
Simple Transaction
Simple Transaction
Simple Transaction
Simple Transaction
Simple Transaction
Simple Transaction
Simple Transaction
Two Writes
Two Writes
Two Writes
Two Writes
Two Writes
Two Writes
Three Writes
Three Writes
Three Writes
Three Writes
Three Writes
Three Writes
Three Writes
Three Writes
The Dumb Approach Doesn’t Work
• Changing state when receiving a txn seems to be a very bad
idea
The Dumb Approach Doesn’t Work
• Changing state when receiving a txn seems to be a very bad
idea
• Maybe only change state...
The Dumb Approach Doesn’t Work
• Changing state when receiving a txn seems to be a very bad
idea
• Maybe only change state...
New Ideas
• Divide time into frames. First half of frame is reads, second half
writes.
New Ideas
• Divide time into frames. First half of frame is reads, second half
writes.
• Within a frame, we don’t care abo...
Frames & Dependencies
Frames & Dependencies
Frames & Dependencies
Frames & Dependencies
Frames & Dependencies
Frames & Dependencies
Frames & Dependencies
Frames & Dependencies
Calculating the Write Clock from Reads
• Merge all read clocks together
• Add 1 to result for every object that was writte...
Calculating the Frame Winner
& Next Frame’s Read Clock
• Partition write results by local clock elem, and within that by
t...
Calculating the Frame Winner
& Next Frame’s Read Clock
• Partition write results by local clock elem, and within that by
t...
Calculating the Frame Winner
& Next Frame’s Read Clock
• Partition write results by local clock elem, and within that by
t...
Transitive Vector Clocks
Transitive Vector Clocks
Transitive Vector Clocks
Transitive Vector Clocks
Transitive Vector Clocks
Transitive Vector Clocks
Transitive Vector Clocks
Transitive Vector Clocks
Transitive Vector Clocks
Transitive Vector Clocks
Shrinking Vector Clocks
• Hardest part of Paxos is garbage collection
Shrinking Vector Clocks
• Hardest part of Paxos is garbage collection
• Need additional messages to determine when Paxos i...
Shrinking Vector Clocks
• Hardest part of Paxos is garbage collection
• Need additional messages to determine when Paxos i...
Shrinking Vector Clocks
• Hardest part of Paxos is garbage collection
• Need additional messages to determine when Paxos i...
Conclusions
Vector clocks capture dependencies and causal relationship
between transactions
Conclusions
Vector clocks capture dependencies and causal relationship
between transactions
Plus we always add transaction...
Conclusions
Vector clocks capture dependencies and causal relationship
between transactions
Plus we always add transaction...
Conclusions
No leader, so no potential bottleneck
Conclusions
No leader, so no potential bottleneck
No wall clocks, so no issues with clock skews
Conclusions
No leader, so no potential bottleneck
No wall clocks, so no issues with clock skews
Can separate F from cluste...
Conclusions
No leader, so no potential bottleneck
No wall clocks, so no issues with clock skews
Can separate F from cluste...
Conclusions
References (1)
• Interval Tree Clocks - Paulo Sérgio Almeida, Carlos Baquero,
Victor Fonte
• Highly available transactions...
References (2)
• Consensus on transaction commit - Gray, Lamport
• Spanner: Google’s globally distributed database - Corbe...
References (3)
• The part-time parliament - Lamport
• Paxos made simple - Lamport
• Consistency, Availability, and Converg...
Distributed databases are FUN!
https://goshawkdb.io/
Watch the video with slide synchronization on
InfoQ.com!
https://www.infoq.com/presentations/goshawkdb
GoshawkDB: Making Time with Vector Clocks
GoshawkDB: Making Time with Vector Clocks
GoshawkDB: Making Time with Vector Clocks
GoshawkDB: Making Time with Vector Clocks
Próximos SlideShares
Carregando em…5
×

GoshawkDB: Making Time with Vector Clocks

340 visualizações

Publicada em

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/22vmyrg.

Matthew Sackman discusses dependencies between transactions, how to capture these with Vector Clocks, how to treat Vector Clocks as a CRDT, and how GoshawkDB uses all these ideas and more to achieve a scalable distributed data store with strong serializability, general transactions and fault tolerance. Filmed at qconlondon.com.

Matthew Sackman has been building and studying concurrent and distributed systems for over ten years. He built the first prototype of RabbitMQ, and contributed heavily to the design and implementation of many of its core features. He worked on the initial implementation of the Weave router in 2014. In 2015, he built GoshawkDB: a distributed, transactional, fault-tolerant object store.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

GoshawkDB: Making Time with Vector Clocks

  1. 1. Making Time with Vector Clocks Matthew Sackman matthew@goshawkdb.io https://goshawkdb.io/
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ goshawkdb
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  4. 4. 1. Have you played with a NoSQL or NewSQL store?
  5. 5. 1. Have you played with a NoSQL or NewSQL store? 2. Have you deployed a NoSQL or NewSQL store?
  6. 6. 1. Have you played with a NoSQL or NewSQL store? 2. Have you deployed a NoSQL or NewSQL store? 3. Have you studied and know their semantics?
  7. 7. Traditional Database Semantics ACID • Atomic: an operation (transaction) either succeeds or aborts completely - no partial successes • Consistent: constraints like uniqueness, foreign keys, etc are honoured • Durable: flushed to disk before the client can find out the result
  8. 8. Traditional Database Semantics ACID • Atomic: an operation (transaction) either succeeds or aborts completely - no partial successes • Consistent: constraints like uniqueness, foreign keys, etc are honoured • Isolation: the degree to which operations in one transaction can observe actions of concurrent transactions • Durable: flushed to disk before the client can find out the result
  9. 9. Traditional Database Semantics Default isolation levels • PostgreSQL: • Oracle 11g: • MS SQL Server: • MySQL InnoDB:
  10. 10. Traditional Database Semantics Default isolation levels • PostgreSQL: Read Committed • Oracle 11g: Read Committed • MS SQL Server: Read Committed • MySQL InnoDB:
  11. 11. Traditional Database Semantics Default isolation levels • PostgreSQL: Read Committed • Oracle 11g: Read Committed • MS SQL Server: Read Committed • MySQL InnoDB: Repeatable Read
  12. 12. Isolation Levels
  13. 13. Snapshot Isolation as per Wikipedia “Snapshot isolation is a guarantee that all reads made in a transaction will see a consistent snapshot of the database and the transaction itself will successfully commit only if no updates it has made conflict with any concurrent updates made since that snapshot.”
  14. 14. Snapshot Isolation as per Wikipedia “Snapshot isolation is a guarantee that all reads made in a transaction will see a consistent snapshot of the database and the transaction itself will successfully commit only if no updates it has made conflict with any concurrent updates made since that snapshot.” Snapshot isolation is called “serializable” mode in Oracle.
  15. 15. Snapshot Isolation trouble up mill x, y := 0,0 1 func t1() { 2 if x == 0 { 3 y = 1 4 } 5 } func t2() { if y == 0 { x = 1 } }
  16. 16. Snapshot Isolation trouble up mill x, y := 0,0 1 func t1() { 2 if x == 0 { 3 y = 1 4 } 5 } func t2() { if y == 0 { x = 1 } } • Serialized: t1 then t2
  17. 17. Snapshot Isolation trouble up mill x, y := 0,0 1 func t1() { 2 if x == 0 { 3 y = 1 4 } 5 } func t2() { if y == 0 { x = 1 } } • Serialized: t1 then t2
  18. 18. Snapshot Isolation trouble up mill x, y := 0,0 1 func t1() { 2 if x == 0 { 3 y = 1 4 } 5 } func t2() { if y == 0 { x = 1 } } • Serialized: t1 then t2
  19. 19. Snapshot Isolation trouble up mill x, y := 0,0 1 func t1() { 2 if x == 0 { 3 y = 1 4 } 5 } func t2() { if y == 0 { x = 1 } } • Serialized: t1 then t2
  20. 20. Snapshot Isolation trouble up mill x, y := 0,0 1 func t1() { 2 if x == 0 { 3 y = 1 4 } 5 } func t2() { if y == 0 { x = 1 } } • Serialized: t1 then t2: x:0, y:1
  21. 21. Snapshot Isolation trouble up mill x, y := 0,0 1 func t1() { 2 if x == 0 { 3 y = 1 4 } 5 } func t2() { if y == 0 { x = 1 } } • Serialized: t1 then t2: x:0, y:1 t2 then t1
  22. 22. Snapshot Isolation trouble up mill x, y := 0,0 1 func t1() { 2 if x == 0 { 3 y = 1 4 } 5 } func t2() { if y == 0 { x = 1 } } • Serialized: t1 then t2: x:0, y:1 t2 then t1: x:1, y:0
  23. 23. Snapshot Isolation trouble up mill x, y := 0,0 1 func t1() { 2 if x == 0 { 3 y = 1 4 } 5 } func t2() { if y == 0 { x = 1 } } • Serialized: t1 then t2: x:0, y:1 t2 then t1: x:1, y:0 • Snapshot Isolation:
  24. 24. Snapshot Isolation trouble up mill x, y := 0,0 1 func t1() { 2 if x == 0 { 3 y = 1 4 } 5 } func t2() { if y == 0 { x = 1 } } • Serialized: t1 then t2: x:0, y:1 t2 then t1: x:1, y:0 • Snapshot Isolation: t1 || t2
  25. 25. Snapshot Isolation trouble up mill x, y := 0,0 1 func t1() { 2 if x == 0 { 3 y = 1 4 } 5 } func t2() { if y == 0 { x = 1 } } • Serialized: t1 then t2: x:0, y:1 t2 then t1: x:1, y:0 • Snapshot Isolation: t1 || t2
  26. 26. Snapshot Isolation trouble up mill x, y := 0,0 1 func t1() { 2 if x == 0 { 3 y = 1 4 } 5 } func t2() { if y == 0 { x = 1 } } • Serialized: t1 then t2: x:0, y:1 t2 then t1: x:1, y:0 • Snapshot Isolation: t1 || t2
  27. 27. Snapshot Isolation trouble up mill x, y := 0,0 1 func t1() { 2 if x == 0 { 3 y = 1 4 } 5 } func t2() { if y == 0 { x = 1 } } • Serialized: t1 then t2: x:0, y:1 t2 then t1: x:1, y:0 • Snapshot Isolation: t1 || t2: x:1, y:1
  28. 28. Snapshot Isolation trouble up mill x, y := 0,0 1 func t1() { 2 if x == 0 { 3 y = 1 4 } 5 } func t2() { if y == 0 { x = 1 } } • Serialized: t1 then t2: x:0, y:1 t2 then t1: x:1, y:0 • Snapshot Isolation: Write Skew t1 || t2: x:1, y:1
  29. 29. Desired Features • General purpose transactions
  30. 30. Desired Features • General purpose transactions • Strong serializability
  31. 31. Desired Features • General purpose transactions • Strong serializability • Distribution • Automatic sharding • Horizontal scalability • ...
  32. 32. Isolation Levels
  33. 33. Isolation Levels
  34. 34. CAP Possibility of Partitions =) ¬(Consistency ^ Availability)
  35. 35. CAP Possibility of Partitions =) ¬(Consistency ^ Availability)
  36. 36. CAP Possibility of Partitions =) ¬(Consistency ^ Availability)
  37. 37. CAP Possibility of Partitions =) ¬(Consistency ^ Availability)
  38. 38. CAP Possibility of Partitions =) ¬(Consistency ^ Availability)
  39. 39. Achieving Consistency Colours Indicate Connected Nodes
  40. 40. Achieving Consistency Colours Indicate Connected Nodes
  41. 41. Achieving Consistency Colours Indicate Connected Nodes
  42. 42. Achieving Consistency Colours Indicate Connected Nodes
  43. 43. Achieving Consistency Colours Indicate Connected Nodes
  44. 44. Achieving Consistency Colours Indicate Connected Nodes
  45. 45. Achieving Consistency Colours Indicate Connected Nodes
  46. 46. Achieving Consistency Colours Indicate Connected Nodes
  47. 47. Achieving Consistency Colours Indicate Connected Nodes
  48. 48. Achieving Consistency Colours Indicate Connected Nodes
  49. 49. Learnings 1 • Strong serializability requires Consistency, so must sacrifice Availability
  50. 50. Learnings 1 • Strong serializability requires Consistency, so must sacrifice Availability • To achieve Consistency, only accept operations if connected to majority
  51. 51. Learnings 1 • Strong serializability requires Consistency, so must sacrifice Availability • To achieve Consistency, only accept operations if connected to majority • If cluster size is 2F + 1 then we can withstand no more than F failures
  52. 52. Modifying Values Colours Indicate Extent of Txn Write
  53. 53. Modifying Values Colours Indicate Extent of Txn Write
  54. 54. Modifying Values Colours Indicate Extent of Txn Write
  55. 55. Modifying Values Colours Indicate Extent of Txn Write
  56. 56. Modifying Values Colours Indicate Extent of Txn Write
  57. 57. Modifying Values Colours Indicate Extent of Txn Write
  58. 58. Learnings 2 • Strong serializability requires Consistency, so must sacrifice Availability • To achieve Consistency, only accept operations if connected to majority • If cluster size is 2F + 1 then we can withstand no more than F failures • Writes must go to F + 1 nodes
  59. 59. Reading Values Colours Indicate Extent of Txn Write
  60. 60. Reading Values Colours Indicate Extent of Txn Write
  61. 61. Reading Values Colours Indicate Extent of Txn Write
  62. 62. Reading Values Colours Indicate Extent of Txn Write
  63. 63. Reading Values Colours Indicate Extent of Txn Write
  64. 64. Reading Values Colours Indicate Extent of Txn Write
  65. 65. Learnings 3 • Strong serializability requires Consistency, so must sacrifice Availability • To achieve Consistency, only accept operations if connected to majority • If cluster size is 2F + 1 then we can withstand no more than F failures • Writes must go to F + 1 nodes • Reads must read from F + 1 nodes and be able to order results
  66. 66. Txn Processing in Distributed Databases 1. Client submits txn
  67. 67. Txn Processing in Distributed Databases 1. Client submits txn 2. Node(s) vote on txn
  68. 68. Txn Processing in Distributed Databases 1. Client submits txn 2. Node(s) vote on txn 3. Node(s) reach consensus on txn outcome
  69. 69. Txn Processing in Distributed Databases 1. Client submits txn 2. Node(s) vote on txn 3. Node(s) reach consensus on txn outcome 4. Client is informed of outcome
  70. 70. Txn Processing in Distributed Databases 1. Client submits txn 2. Node(s) vote on txn 3. Node(s) reach consensus on txn outcome 4. Client is informed of outcome Most important thing is all nodes agree on the order of transactions
  71. 71. Txn Processing in Distributed Databases 1. Client submits txn 2. Node(s) vote on txn 3. Node(s) reach consensus on txn outcome 4. Client is informed of outcome Most important thing is all nodes agree on the order of transactions (focus for the rest of this talk!)
  72. 72. Leader Based Ordering
  73. 73. Leader Based Ordering
  74. 74. Leader Based Ordering
  75. 75. Leader Based Ordering • Only leader votes on whether txn commits or aborts
  76. 76. Leader Based Ordering • Only leader votes on whether txn commits or aborts • Therefore leader must know everything
  77. 77. Leader Based Ordering • Only leader votes on whether txn commits or aborts • Therefore leader must know everything • If leader fails, a new leader will be elected from remaining nodes
  78. 78. Leader Based Ordering • Only leader votes on whether txn commits or aborts • Therefore leader must know everything • If leader fails, a new leader will be elected from remaining nodes • Therefore all nodes must know everything
  79. 79. Leader Based Ordering • Only leader votes on whether txn commits or aborts • Therefore leader must know everything • If leader fails, a new leader will be elected from remaining nodes • Therefore all nodes must know everything • Fine for small clusters, but scaling issues when clusters get big
  80. 80. Client Clock Based Ordering
  81. 81. Client Clock Based Ordering
  82. 82. Client Clock Based Ordering
  83. 83. Client Clock Based Ordering • Nodes receive txns and must vote on txn outcome and then consensus must be reached (not shown)
  84. 84. Client Clock Based Ordering • Nodes receive txns and must vote on txn outcome and then consensus must be reached (not shown) • Clients are responsible for applying an increasing clock value to txns
  85. 85. Client Clock Based Ordering • Nodes receive txns and must vote on txn outcome and then consensus must be reached (not shown) • Clients are responsible for applying an increasing clock value to txns • If a client’s clock races then it can prevent other clients from getting txns submitted
  86. 86. Client Clock Based Ordering • Nodes receive txns and must vote on txn outcome and then consensus must be reached (not shown) • Clients are responsible for applying an increasing clock value to txns • If a client’s clock races then it can prevent other clients from getting txns submitted • So must be very careful to try and keep clocks running at the same rate
  87. 87. Client Clock Based Ordering • Nodes receive txns and must vote on txn outcome and then consensus must be reached (not shown) • Clients are responsible for applying an increasing clock value to txns • If a client’s clock races then it can prevent other clients from getting txns submitted • So must be very careful to try and keep clocks running at the same rate • No possibility to reorder transactions at all to maximise commits
  88. 88. Syntax V1 < V2 ,8x 2 dom(V1 [ V2) : V1[x]  V2[x] ^9y 2 dom(V1 [ V2) : V1[y] < V2[y]
  89. 89. Simple Transaction
  90. 90. Simple Transaction
  91. 91. Simple Transaction
  92. 92. Simple Transaction
  93. 93. Simple Transaction
  94. 94. Simple Transaction
  95. 95. Simple Transaction
  96. 96. Simple Transaction
  97. 97. Two Writes
  98. 98. Two Writes
  99. 99. Two Writes
  100. 100. Two Writes
  101. 101. Two Writes
  102. 102. Two Writes
  103. 103. Three Writes
  104. 104. Three Writes
  105. 105. Three Writes
  106. 106. Three Writes
  107. 107. Three Writes
  108. 108. Three Writes
  109. 109. Three Writes
  110. 110. Three Writes
  111. 111. The Dumb Approach Doesn’t Work • Changing state when receiving a txn seems to be a very bad idea
  112. 112. The Dumb Approach Doesn’t Work • Changing state when receiving a txn seems to be a very bad idea • Maybe only change state when receiving the outcome of a vote
  113. 113. The Dumb Approach Doesn’t Work • Changing state when receiving a txn seems to be a very bad idea • Maybe only change state when receiving the outcome of a vote • And don’t vote on txns until we know it’s safe to do so
  114. 114. New Ideas • Divide time into frames. First half of frame is reads, second half writes.
  115. 115. New Ideas • Divide time into frames. First half of frame is reads, second half writes. • Within a frame, we don’t care about order of reads, • but all reads must come after writes of previous frame, • all writes must come after reads of this frame, • all writes must be totally ordered within the frame - must know which write comes last.
  116. 116. Frames & Dependencies
  117. 117. Frames & Dependencies
  118. 118. Frames & Dependencies
  119. 119. Frames & Dependencies
  120. 120. Frames & Dependencies
  121. 121. Frames & Dependencies
  122. 122. Frames & Dependencies
  123. 123. Frames & Dependencies
  124. 124. Calculating the Write Clock from Reads • Merge all read clocks together • Add 1 to result for every object that was written by txns in our frame’s reads
  125. 125. Calculating the Frame Winner & Next Frame’s Read Clock • Partition write results by local clock elem, and within that by txn id • Each clock inherits missing clock elems from above • Then sort each partition first by clock (now all same length), then by txn id • Next frame starts with winner’s clock, +1 for all writes
  126. 126. Calculating the Frame Winner & Next Frame’s Read Clock • Partition write results by local clock elem, and within that by txn id • Each clock inherits missing clock elems from above • Then sort each partition first by clock (now all same length), then by txn id • Next frame starts with winner’s clock, +1 for all writes • Guarantees no concurrent vector clocks (proof in progress!)
  127. 127. Calculating the Frame Winner & Next Frame’s Read Clock • Partition write results by local clock elem, and within that by txn id • Each clock inherits missing clock elems from above • Then sort each partition first by clock (now all same length), then by txn id • Next frame starts with winner’s clock, +1 for all writes • Guarantees no concurrent vector clocks (proof in progress!) • Many details elided! (deadlock freedom, etc)
  128. 128. Transitive Vector Clocks
  129. 129. Transitive Vector Clocks
  130. 130. Transitive Vector Clocks
  131. 131. Transitive Vector Clocks
  132. 132. Transitive Vector Clocks
  133. 133. Transitive Vector Clocks
  134. 134. Transitive Vector Clocks
  135. 135. Transitive Vector Clocks
  136. 136. Transitive Vector Clocks
  137. 137. Transitive Vector Clocks
  138. 138. Shrinking Vector Clocks • Hardest part of Paxos is garbage collection
  139. 139. Shrinking Vector Clocks • Hardest part of Paxos is garbage collection • Need additional messages to determine when Paxos instances can be deleted
  140. 140. Shrinking Vector Clocks • Hardest part of Paxos is garbage collection • Need additional messages to determine when Paxos instances can be deleted • We can use these to also express: You will never see any of these vector clock elems again
  141. 141. Shrinking Vector Clocks • Hardest part of Paxos is garbage collection • Need additional messages to determine when Paxos instances can be deleted • We can use these to also express: You will never see any of these vector clock elems again • Therefore we can remove matching elems from vector clocks! • Many more details elided!
  142. 142. Conclusions Vector clocks capture dependencies and causal relationship between transactions
  143. 143. Conclusions Vector clocks capture dependencies and causal relationship between transactions Plus we always add transactions into the youngest frame
  144. 144. Conclusions Vector clocks capture dependencies and causal relationship between transactions Plus we always add transactions into the youngest frame Which gets us Strong Serializability
  145. 145. Conclusions No leader, so no potential bottleneck
  146. 146. Conclusions No leader, so no potential bottleneck No wall clocks, so no issues with clock skews
  147. 147. Conclusions No leader, so no potential bottleneck No wall clocks, so no issues with clock skews Can separate F from cluster size,
  148. 148. Conclusions No leader, so no potential bottleneck No wall clocks, so no issues with clock skews Can separate F from cluster size, Which gets us horizontal scalability
  149. 149. Conclusions
  150. 150. References (1) • Interval Tree Clocks - Paulo Sérgio Almeida, Carlos Baquero, Victor Fonte • Highly available transactions: Virtues and limitations - Bailis et al • Coordination avoidance in database systems - Bailis et al • k-dependency vectors: A scalable causality-tracking protocol - Baldoni, Melideo • Multiversion concurrency control-theory and algorithms - Bernstein, Goodman • Serializable isolation for snapshot databases - Cahill, Röhm, Fekete • Paxos made live: an engineering perspective - Chandra, Griesemer, Redstone
  151. 151. References (2) • Consensus on transaction commit - Gray, Lamport • Spanner: Google’s globally distributed database - Corbett et al • Faster generation of shorthand universal cycles for permutations - Holroyd, Ruskey, Williams • s-Overlap Cycles for Permutations - Horan, Hurlbert • Universal cycles of k-subsets and k-permutations - Jackson • Zab: High-performance broadcast for primary-backup systems - Junqueira, Reed, Serafini • Time, clocks, and the ordering of events in a distributed system - Lamport
  152. 152. References (3) • The part-time parliament - Lamport • Paxos made simple - Lamport • Consistency, Availability, and Convergence - Mahajan, Alvisi, Dahlin • Notes on Theory of Distributed Systems - Aspnes • In search of an understandable consensus algorithm - Ongaro, Ousterhaut • Perfect Consistent Hashing - Sackman • The case for determinism in database systems - Thomson, Abadi
  153. 153. Distributed databases are FUN! https://goshawkdb.io/
  154. 154. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/goshawkdb

×