MySQL Group Replication is a new 'synchronous', multi-master, auto-everything replication plugin for MySQL introduced with MySQL 5.7. It is the perfect tool for small 3-20 machine MySQL clusters to gain high availability and high performance. It stands for high availability because the fault of replica don't stop the cluster. Failed nodes can rejoin the cluster and new nodes can be added in a fully automatic way - no DBA intervention required. Its high performance because multiple masters process writes, not just one like with MySQL Replication. Running applications on it is simple: no read-write splitting, no fiddling with eventual consistency and stale data. The cluster offers strong consistency (generalized snapshot isolation).
It is based on Group Communication principles, hence the name.
2. The speaker says...
MySQL 5.7 introduces a new kind of replication: MySQL
Group Replication. At the time of writing (10/2014)
MySQL Group Replication is available as a preview release
on labs.mysql.com. In common user terms it features
(virtually) synchronous, multi-master, auto-everything
replication.
3. Proper wording...
An eager update everywhere system based
on the database state machine approach
atop of a group communication system
offering virtual synchrony and
reliable total ordering messaging.
MySQL Group Replication offers
generalized snapshot isolation.
6. The speaker says...
The technical description given for MySQL Group
Replication may sound confusing because it has elements
from the distributed systems and database systems theory.
From around 1996 and 2006 the two research communities
jointly formulated the replication method implemented by
MySQL Group Replication.
As a web developer or MySQL DBA you are not expected to
know distributed systems theory inside out. Yet to
understand the properties of MySQL Group Replication and
to get most of it, we'll have to touch some of the concepts.
Let's see first how the new stuff compares to the existing.
7. Goals of distributed databases
Availability
• Cluster as a whole unaffected by loss of nodes
Scalability
• Geographic distribution
• Scale size in terms of users and data
• Database specific: read and/or write load
Distribution Transparency
• Access, Location, Migration, Relocation (while in use)
• Replication
• Concurrency, Failure
8. The speaker says...
MySQL Group Replication is about building a distributed
database. To catalog it and compare it with the existing
MySQL solutions in this area, we can ask what the goals of
distributed databases are. The goals lead to some criteria
that is used to give a first, brief overview.
Goal: a distributed database cluster strives for maximum
availability and scalability while maintaining distribution
transparency.
Criteria: availability, scalability, distribution transparency.
9. MySQL clustering cheat sheet
MySQL
Replication
MySQL
Cluster
MySQL
Fabric
Availability Primary = SpoF,
no auto failover
Shared
nothing,
auto failover
SpoF monitored,
auto failover
Scalability Reads
Partial
replication,
node limit
Partial
replication,
no node limit
Scale on
WAN Asynchronous Synchronous
(WAN option)
Asynchronous
(depends)
Distribution
Transparency R/W splitting SQL: yes
(low level: no)
Special clients
No distributed
queries
10. The speaker says...
Already today MySQL has three solutions to build a
distributed MySQL cluster: MySQL Replication, MySQL
Cluster and MySQL Fabric. Each system has different
optimizations, none can achieve all the goals of a distributed
cluster at once. Some goals are orthogonal.
Take MySQL Cluster. MySQL Cluster is a shared nothing
system. Data storage is reundant, nodes fail independently.
Transparent sharding (partial replication) ensures read and
write scalability until the maximum number of nodes is
reached. Great for clients: any SQL node runs any SQL,
synchronous updates become visible immediately
everywhere. But, it won't scale on slow WAN connections.
11. How Group Replication fits in
Repl. Cluster Group Repl. Fabric
Availability Shared nothing,
auto failover
Shared nothing,
auto failover/join
Scalability
Partial
replication,
node limit
Full replication,
read and some
write scalability
Scale on
WAN
Synchronous
(WAN option)
(Virtually)
Synchronous
Distribution
Transparenc
y
SQL: yes
(low level: no)
All nodes run
all SQL
12. The speaker says...
MySQL Group Replication has many of the desireable
properties of MySQL Cluster. Its strong on availability and
client friendly due to the distribution transparency. No
complex client or application logic is required to use the
cluster. So, how do the two differ?
Unlike MySQL Cluster, MySQL Group Replication supports
the InnoDB storage engine. InnoDB is the dominant storage
engine for web applications. This makes MySQL Group
Replication a very attractive choice for small clusters (3-7
nodes) running Drupal, WordPress, … in LAN settings! Also,
Group Replication is not synchronous in a technical way. For
practical matters it is.
13. Group Replication (vs. Cluster)
Availability
• Nodes fail independently
• Cluster continues operation in case of node failures
Scalability
• Geographic distribution: n/a, needs fast messaging
• All nodes accept writes, mild write scalability
• All nodes accept reads, full read scalability
Distribution Transparency
• Full replication: all nodes have all the data
• Fail stop model: developer free'd to worry about consistency
14. The speaker says...
Another major difference between MySQL Cluster and
MySQL Group Replication is the use of partial replication
versus full replication. MySQL Cluster has transparent
sharding (partial replication) build-in. On the inside, on the
level of so-called MySQL Cluster data nodes, not every node
has all the data. Writes don't add work to all nodes of the
cluster but only a subset of them. Partial replication is the
only known solution to write scalability. With MySQL Group
Replication all nodes have all the data. Writes can be
executed concurrently on different nodes but each write
must be coordinated with every other node.
… time to dig deeper >:).
16. A developers categorization...
Where are transactions run?
Primary Copy Update Everywhere
When does
synchronizatio
n happen?
Eager (MySQL semi-synch
Replication)
MySQL Cluster
MySQL Group
3rd party: Galera
Lazy
MySQL
Replication/Fabric
3rd party: Tungsten
MySQL Cluster
Replication
17. The speaker says...
I've described MySQL Group Replication as „ an eager
update everywhere system“. The term comes from a
categorization of different database replication systems by
the two questions:
- where can transaction every be run?
- when are transactions synchronized between nodes?
The answers to the questions tells a developer which
challenges to expect. The answers determine which
additional tasks an application must handle when its run on
a cluster instead of a single server.
19. The speaker says...
When you try to scale an application running it on a lazy
(asynchronous) replication cluster instead of a single server
you will soon have users complaining about outdated and
„incorrect“ data. Depending which node the application
connects to after a write, a user may or may not see his own
updates. This can neither happen on a single server system
nor on an eager (synchronous) replication cluster. Lazy
replication causes extra work for the developer.
BTW, have a look at PECL/mysqlnd_ms. It abstracts the
problem of consistency for you. Things like read-your-writes
boil down to a single function call.
21. The speaker says...
Judging from the developer perspective only, primary copy is
an undesired replication solution. In a primary copy system
only one node accepts writes. The other nodes copy the
updates performed on the primary. Because of the read-write
splitting, the replication system does not need to
coordinate conflicting operations. Great for the replication
system author, bad for the developer. As a developer you
must ensure that all write operations are directed to the
primary node... Again, have a look at PECL/mysqlnd_ms.
MySQL Replication follows this approach. Worse, MySQL
Replication is a lazy primary copy system.
23. The speaker says...
From a developer perspective an eager update anywhere
system, like MySQL Group Replication, is indistinguishable
from a single node. The only extra work it brings you is load
balancing, but that is the case with any cluster. An eager
update anywhere cluster improves distribution transparency
and removes the risk of reading stale data. Transparency
and flexibility is improved because any transaction can be
directed to any replica. (Sometimes synchronization
happens as part of the commit, thus strong consistency can
be achieved.) Fault tolerance is better than with Primary
Copy. There is no single point of failure – a single primary -
that can cause a total outage of the cluster. Nodes may fail
individually without bringing the cluster down immediately.
25. The speaker says...
In the mid-1990s two observations made the database and
distributed system theory communities wondered if they
could develop a joint replication approach.
First Gray et. al. (database communitiy) showed that the
common two-phase locking has an expected deadlock rate
that grows with the third power of the number of replicas.
Second, Schiper and Raynal noted that transactions have
common properties with group communication principles
(distributed systems) such as ordering, agreement/'all-or-nothing'
and even durability.
26. Three building blocks
State machine replication
• … trivial to understand
Atomic Broadcast
• … database meets distributed systems community
• … OMG, how easy state machine replication is to implement!
Deferred Update Database Replication
• … database meets distributed systems community
• … how we gain high availability and high performance
• … what those MySQL Replication team blogs talk about ;-)
27. The speaker says...
Finally, in 1999 Pedone, Guerraoui and Schiper published
the paper „The Database State Machine Approach“. The
paper combines two well known building blocks for
replication with a messaging primitive common in the
distributed systems world: atomic broadcast.
MySQL Group Replication is slightly different from this 1999
version, more following a later refinement from 2005 plus a
bit of additional ease-of-use. However, by end of this chapter
you learned how MySQL Cluster and MySQL Group
Replication differ beyond InnoDB support and built-in
sharding.
28. State machine replication
Input
Set A = 1
Replica Replica
Replica
Output
A = 1 A = 1 A = 1
Output Output
29. The speaker says...
The first building block is trivial: a state machine. A state
machine takes some input and produces some output.
Assume your state machines are determinisitic. Then, if you
have a set of replicas all running the same state machine
and they all get the same input, they all will produce the
same output. On an aside: state machine replication is also
known as active replication. Active means that every replica
executes all the operations, active adds compute load to
every replica. With passive replication, also called primary-backup
replication, one replica (primary) executes the
operations and forwards the results to the others. Passive
suffers under primary availability and possibly network
bandwith.
31. The speaker says...
Here's more trivia about the state machine replication
approach. There are two requirements for it to work. Quite
obviously, every replica has to receive all input to come to
the same output. And the precondition for receiving input is
that the replica is still alive.
In academic words the requirement is: agreement. Every
non-faulty replica receives every request. Non-faulty replicas
must agree on the input.
32. Requirement: Order
1) Set A = 1
2) Set B = 1
3) Set B = A *2
Input: 1, 2, 3 Input: 1, 3, 2 Input: 3, 1, 2
Replica Replica
Replica
A = 1 A = 1
B = 2 B = 1
A = 1
B = 1
33. The speaker says...
The second trivial requirement for state machine replication
is ordering. To produce the same output any two state
machines must execute the very same input – including the
ordering of input operations. The academic wording goes: if
a replica processes requests r1 before r2, then no replica
processes request r2 before r1. Note that if operations
commute, some reording may still lead to correct output.
The sequence A = 1, B = 1, B = A * 2 and the sequence B =
1, A = 1, B = A * 2 produce the same output.
(Unrelated here: the database scaling talk touches the fancy
commutative replicated data types Riak offers... hot!)
34. Atomic Broadcast
Distributed systems messaging abstraction
• Meets all replicated state machine requirements
Agreement
• If a site delivers a message m then every site delivers m
Order
• No two sites deliver any two messages in different orders
Termination
• If a site broadcasts message m and does not fail, then every
site eventually delivers m
• We need this in asynchronous enivronments
35. The speaker says...
State machine replication is the first building block for
understanding the database state machine approach. The
second building block is a messaging abstraction from the
distributed systems world called atomic broadcast. Atomic
broadcast provides all the properties required for state
machine replication: agreement and ordering. It adds a
property needed for communication in an asynchronous
system, such as a system communicating via network
messages: termination.
All in all, this greatly simplifies state machine replication and
contributes to a simple, layered design.
36. Delivery, durability, group
Client
Replica
Replica
Replica
Mr. X
Replica
Replica
Replica
Group
Send first, possibly delivered second
37. The speaker says...
The Atomic broadcast properties given are literally copied
from the original paper describing the database state
machine replication approach. There is two things in it not
explained yet. First, atomic broadcast defines properties in
terms of message delivery. The delivery property not only
ensures total ordering despite slow transport but also covers
message loss (MySQL desires uniform agreement here,
something better than Corosync) and even the crash and
recovery of processors (durability)! A recovering processor
must first deliver outstanding messages before it continues.
Second, note that atomic broadcast introduces the notion of
a group. Only (correct) members of a group can exchange
messages.
39. The speaker says...
We are almost there. The third building block to the
database state machine replication is deferred update
database replication. The slide shows a generic functional
model used by Pedone and Schiper in 2010 to illustrate their
choice of deferred update.The argument goes that deferred
update combines the best of the two most prominent object
replication techniques: active and passive replication. Only
the comination of the best from the two will give both high
availability and high performance.
Translation: MySQL Group Replication can – in theory -
have higher overall throughput than MySQL Replication. Do
you love the theory ;-) ? As a DBA you should.
40. Active Replication (SM)
Replica
Replica
Replica
Replica
Replica
Replica
Client Client
Client sends op to all
Requests get ordered
Execution
All reply to client
41. The speaker says...
In an active replication system, a pure state machine
replication system, the client operations are forwarded to all
replicas and each replica individually executes the operation.
The two challenges are to ensure all replicas execute
requests in the same order and all replicas decide the same.
Recall, that we talk multi-threaded database servers here.
A downside is that every replica has to execute the
operation. If the operation is expensive in terms of CPU, this
can be a waste of CPU time.
42. Passive Replication
Backup
Primary
Backup
Replica
Replica
Replica
Client Client
Client sends op to primary
Only primary executes
Primary forwards changes
Primary replies to client
43. The speaker says...
The alternative is passive replication or primary-backup
replication. Here, the client talks to only one server, the
primary. Only the primary server executes client operations.
After computation of the result, the primary forwards the
changes to the backups which apply tem.
The problem here is that the primary determines the
systems throughput. None of the backups can contribute its
computing power to the overall system throughput.
44. Multi-primary (pass.) replication
What we want...
• … for performance: more than one primary
• … for scalability: no distributed locking
• .. and of course: transactions
• Two-staged transaction protocol
Client Primary
Primary
Primary
Transaction processing Transaction termination
45. The speaker says...
Multi-primary (passive) replication has all the ingredients
desired.
Transaction processing is two staged. First, a client picks
any replica to execute a transaction. This replica becomes
the primary of the transaction. The transaction executes
locally, the stage is called transaction processing. In the
second stage, during transaction termination, the primaries
jointly decide whether the transaction can commit or must
abort.
Because updates are not immediately applied, database
folks call this deferred update – our last building block.
46. Deferred Update DB Replication
Deterministic certification
• Reads execute locally, Updates get certified
• Certification ensures transaction serializability
• Replicas decide independently about certification result
Read Primary
Write Primary
Primary
Primary
Rs/Ws/U
47. The speaker says...
One property of transactions is isolation. Isolation is also
know as serializability: the concurrent execution of
transactions should be equivalent to a serial execution of the
same transactions. In Deferred Update system, read
transactions are processed and terminated on one replica
and serialized locally.
Updates must be certified. After the transaction processing
the readset, writeset and updates are sent to all other
replicas. The servers then decide in a deterministic
procedure whether (one-copy) serializability holds, if the
transaction commits. Because its a deterministic procedure,
the servers can certify transactions independently!
48. Options for termination
Atomic Broadcast based
• … this is what is used, by MySQL, by DBSM
Optimization: Reordering (atop of Atomic Broadcast)
• … in theory it means less transaction aborts
Optimization limit: Generic Broadcast based
• … this has issues, which make it nasty
Atomic Commit based
• … more transactions than atomic broadcast
49. The speaker says...
There are several ways of implementing the termination
protocol and the certification. There are two truly distinct
choices: atomic broadcast and atomic commit. Atomic
commit causes more transaction aborts than atomic
broadcast. So, it's out and atomic broadcast remains.
Atomic broadcast can – in theory – be further optimized
towards less transaction aborts using reordering. For
practically matters, this is about where the optimizations
end. A weaker (and possibly faster) generic broadcast
causes problems in the transactional model. For databases,
it could be an over-optimization.
50. Generic certification test
Transactions have a state
• Executing, Comitting, Comitted, Aborted
Reads are handled locally
Updates are send to all replicas
• Readset and writeset are forwarded
On each replica: search for 'conflicting' transactions
• Can be serialized with all previous transactions? Commit!
• Commit? Abort local transaction that overlap with update
51. The speaker says...
No matter what termination procedure is used, the basic
procedure for certification in the deferred update model is
always the same. Updates/writes need certification. The
data read and the data written by a transaction is forwarded
to all other replicas.
Every replica searches for potentially 'conflicting'
transactions, the details depend on the termination
procedure. A transaction is decided to commit if it does not
violate serializability with all previous transactions. Any local
transaction currently running and conflicting with the update
is aborted.
52. Database State Machine
Deferred Update Database Replication as a state
machine
• Atomic Broadcast based termination
Plugin Services
MySQL
Transaction hooks
Plugins
MySQL Group Replication
Capture Apply Recover
Replication Protocol incl. termination protocol/certifier
Group Communication System
53. The speaker says...
The Database State Machine Approach combines all the bits
and pieces. Let's do a bottom up summary. Atomic
broadcast not only free's the database developer to bother
about networking APIs it also solves the nasty bits of
communicating in an asynchronous network. It provides
properties that meet the requirements of the state machine
replication. A deterministic state machine is what one needs
to implement the termination protocol within deferred update
replication. Deferred update replication does not use
distributed locking which Gray proved problematic and it
combines the best of active and passive replication. Side
effects: simple replication protocol, layered code.
54. The termination algorithm
Updates are send to all replicas
• Readset and writeset are forwarded
Step 1 - On each replica: certify
• Is there any comitted transaction that conflicts?
(In the original paper: check for write-read conflicts between
comitting transaction and comitted transactions using. Does
the committing transaction readset overlap with any comitted
transactions writeset. Works slightly different in MySQL.)
Step 2 – On each replica: commitment
• Apply transactions decided to commit
• Handle concurrent local transactions: remote wins
55. The speaker says...
The termination process has two logical steps, just like the
general one presented earlier. The very details of how
exactly two transactions are checked for conflicts in the first
step don't matter here. MySQL Group Replication is using a
refinement of the algorithm tailored to its own needs. As a
developer all you need to know is: a remote transaction
always wins no matter how expensive local transactions are.
And, keep conflicting writes on one replica. It's faster.
The puzzling bit on the slide is the rule to check check a
commiting transaction against any commited transaction for
conflicts. Any !? Not any... only concurrent.
56. What's concurrent?
Any other transaction that precedes the current one
• Recall: total ordering
• Recall: asynchronous, delay between broadcast and delivery
Replica
Replica
Replica
Replica
Replica
Broadcast Delivery
1
Total order 1
2
1 2 2
1 2
57. The speaker says...
The definition of what concurrent means is a bit tricky. Its
defined through a negation and that's confusing on the first
look but becomes – hopefully – clear on the next slide.
Concurrent to a transaction is any other transaction that
does precede it. If we know the order of all transactions – in
the entire cluster -, then we can which transactions precede
one another.
Atomic broadcast ensures total order on delivery. Some
implementations decide on ordering when sending and that
number (logical clock) could be be used. Any logical clock
works.
58. Certify against all previous?
Replica
Replica
Replica
Replica
Replica
Transaction(2)
2
Total order 3
Certification
2
2
3
4
3
4
4
Broadcast:
Transaction 4 is based
on all previous up to 2
Certification when 4 is delivered:
Check conflicts with trx >2 and trx < 4
59. The speaker says...
The slide has an example how to find any other transaction
that precedes one. When a transaction enters the
committing state and is broadcasted, the broadcast includes
the logical time (= total order number on the slide) of the
latest transaction comitted on the replica.
Eventually the transaction is delivered on all sites. Upon
delivery the certification considers all transactions that
happend after the logical time of the to be certified
transaction. All those transactions precede the one to be
certified, they executed concurrently at different replicas. We
don't have to look further in the past. Further in the past is
stuff that's been decided on already.
61. The speaker says...
Good news! The algorithm used by MySQL Group
Replication is different and simpler. For correctness, the
precedes relation is still relevant. But it comes for free...
62. A developers view on commit
Replica
Replica
Replica
Replica
Replica
BEGIN COMMIT Result
t(3)
4 Certify
4 Certify
Apply
Client Execute
63. The speaker says...
We are not done with the theory yet but let's do some slides
that take the developers perspective. Assuming you have to
scale a PHP application, assuming a small cluster of a
handful MySQL servers is enough and assuming these
servers are co-located on racks, then MySQL Group
Replication is your best possible choice.
Did you get this from the theory? Replication is
'synchronous'. On commit you wait only for the server you
are connected to. Once your transaction is broadcasted, you
are done. You don't wait for the other servers to execute the
transaction. With uniform atomic broadcast, once your
transaction is broadcasted, it cannot get lost. (That's why I
torture you with theory.)
64. MySQL Replication
Master
Slave
Replica
Replica
Fetch Replica
BEGIN COMMIT OK
Bin log etc.
Apply
Client execute
65. The speaker says...
If your network is slow or mother earth, the speed of light
and network message round trip time adds too much too
your transaction execution time, then asynchronous MySQL
Replication is a better choice.
In MySQL Replication the master (primary) never waits for
the network. Not even to broadcast updates. Slaves
asynchronously pull changes. Despite pushing work on the
developer this approach has the downsite that a hardware
crash on the master can cause transaction loss. Slaves may
or may not have pulled the latest data.
66. MySQL Semi-sync Replication
Master
Slave
Replica
Replica
BEGIN COMMIT OK
Wait for first ACK
Fetch Replica
Bin log
Apply
Client Execute
Slave Fetch Apply Replica
67. The speaker says...
In the times of MySQL 5.0 the MySQL Community
suggested that to avoid transaction loss the master should
wait for one slave to acknowledge it has fetched the update
from the master. The fact that it's fetched does not mean
that it's been applied. The update may not be visible to
clients yet.
It is a back and forth whether database replication should be
asynchronous or not. It depends on your needs.
Back to theory after this break.
69. Virtual Synchrony
Groups and views
• A turbo-charged veryion of Atomic Broadcast
P1
P2
P3
P4
M1
M2
VC
M3
M4
G1 = {P1, P2, P3} G2 = {P1, P2, P3, P4}
70. The speaker says...
Good news! Virtual Synchrony and Atomic Broadcast are the
same. Our Atomic Broadcast definition assumes a static
group. Adding group members, removing members or
detecting failed ones is covered.
Virtual Synchrony handles all these membership changes.
Whenever an existing group agrees on changes, a new view
is installed through a view change (VC) event.
(The term 'virtual': it's not synchronous. There is a delay we
don't want to wait for short message delays. Yet, the system
appears to be synchronous to most real life observers.)
71. Virtual Synchrony
View changes act as a message barrier
• That's a case causing troubles in Two-Phase Commit
P1
P2
P3
P4
M5
VC
M6
M7
M8
G2 = {P1, P2, P3, P4} G3 = {P1, P2, P3}
72. The speaker says...
View changes are message barriers. If the group members
suspect a member to have failed they install a new view.
Maybe the former member was not dead but just too slow to
respond, or disconnected for a brief period. False alarm. The
former member then tries to broadcast some updates.
Virtual Synchrony ensures that the updates will not be seen
by the remaining members. Furthermore the former member
will realize that it was excluded.
Some GCS implementing virtual synchrony even provide
abstractions that ensure a joining member learns all updates
it missed (state transfer) before it rejoins.
73. Auto-everything: failover
MySQL Group Replication has a pluggable GCS API
• Split brain handling? Depends onGCS and/or GCS config
• Default GCS is Corosync
MySQL
MySQL
MySQL
MySQL
MySQL
MySQL
74. The speaker says...
Good news! The Virtual Synchrony group membership
advantages are fully exposed to the user level: node failures
are detected and handled automatically. PECL/mysqlnd_ms
can help you with the client site. It's a minor tweak to have it
automatically learn about remaining MySQL server. Expect
and update release soon.
MySQL Group Replication works with any Group
Communication system that can be accessed from C and
implements Virtual Synchrony. The default choice is
Corosync. Split brain handling is GCS dependent. MySQL
follows view change notifications of the GCS.
75. Auto-everything: joining
Elastic cluster grows and shrinks on demand
• State transfer done via asynch replication channel
MySQL
MySQL
MySQL
MySQL
MySQL
MySQL
Donor State transfer
Joiner
76. The speaker says...
Good news! When adding a server you don't fiddle with the
very details. You start the server, tell it to join the cluster and
wait for it to catch up. The server picks a donor, begins
fetching updates using much of the existing MySQL
Replication code infrastructure and that's it.
78. Deferred Update tweak
Transaction read set does not need to be broadcasted
• Readset is hard to extract and can be huge
• Weaker serializability level than 1SR
• Sufficient for InnoDB default isolation
Read Primary
Write Primary
Primary
Primary
V/Ws/U
79. The speaker says...
Good news! This is last bit of theory. The original Database
State Machine proposal was followed by a simpler to
implement proposal in 2005. If the clusters serialization level
is marginally lowered to snapshot, certification becomes
easier. Generalized snapshot isolation can be achieved
without having to broadcast the readset of transactions.
Recording the readset of a transaction is difficult in most
existing databases. Also, readsets can be huge.
Snapshot isolation is an isolation level for multi-version
concurrency control. MVCC? InnoDB! Somehow... Whatever
this is the MySQL Group Replication termination base
algorithm.
80. Snapshot Isolation
Concurrent and write conflict? First comitter wins!
• Reads use snapshot from the beginning of the transaction
First committer
Conflict (both change x)
T1
T2
T1
T2
BEGIN(v1), W(v1, x=1), COMMIT!, x:v2=1
BEGIN(v1), W(v1, x=2), …, …, COMMIT?
Concurrent write (version 1)
81. The speaker says...
In Snapshot Isolations transactions take a snapshot when
they begin. All reads return data from this snapshot.
Although any other concurrent transaction may update the
underlying data while the transaction still runs, the change is
unvisiable, the transaction runs in isolation. If two concurrent
transactions change the same data item they conflict. In
case of conflicts, the first comitter wins.
MVCC requires that as part update of an data item its
version is incremented. Future transactions will base their
snapshot on the new version.
82. The actual termination protocol
Replica
Replica
Replica
Replica
Replica
Write(v2, x=1)
Certification
Object Latest version
x 1
y 13
OK
83. The speaker says...
Every replica checks the version of a write during
certification. It compares the writes data items version
number with the latest it knows of. If the version is higher or
equal than the one found in the replicas certification index,
the write is accepted. A lower number indicates that
someone has already updated the data item before.
Because the first comitter must win a write showing a lower
version number than is in the certification index must abort.
(The certification index fills over time and is truncated
periodically by MySQL. MySQL reports the size through
Performance Schema tables.)
85. It's a preview – there are limits
General
• InnoDB only
• Corosync lacks uniform agreement
• No rules to prevent split-brain (it's a preview, you're allowed to
fool yourself if you misconfigure the GCS!)
Isolation level
• Primary Key based
• Foreign Keys and Unique Keys not supported yet
No concurrent DDL
88. Network messages – pffft!
MySQL super hero at Facebook
@markcallaghan Sep 30
For MySQL sync replication, when all commits originate from 1 master is
there 1 network round trip or 2? http://mysqlhighavailability.com/mysql-group-
replication-hello-world …
@Ulf_Wendel
@markcallaghan AFAIK, on the logical level, there should be one. Some
of your questions might depend on the GCS used. The GCS is
pluggable
@markcallaghan
@Ulf_Wendel @h_ingo Henrik tells me it is "certification based" so I
remain confused
89. GCS != MySQL Semi-sync
It's many round trips, how many depends on GCS
• Default GCS is Corosync, Corosyc is Totem Ring
• Corosync uses a privilege-based approach for total ordering
• Many options: fixed sequencer, moving sequencer, ...
• Where you run your updates only impacts collision rate
MySQL
MySQL
Corosync
Corosync
MySQL
Corosync
90. The speaker says...
No Mark, MySQL Group Replication cannot be understood
as a replacement for MySQL Semi-sync Replication. The
question about network round trips is hard to answer. Atomic
Broadcast and Virtual Synchrony stack many subprotocols
together. Let's consider a stable group, no network failure,
Totem. Totem orders messages using a token that circulates
along a virtual ring of all members. Whoever has the token,
has the priviledge to broadcast. Others wait for the token to
appear. Atomic Broadcast gives us all or nothing messaging.
It takes at least another full round on the ring to be sure the
broadcast has been received by all. How many round trips
are that? Welcome to distributed systems...