73. Percona XtraDB Cluster
State Transfer
Full data Incremental
SST IST
Node
New node disconnected
short time
Node long
time
disconnected
74. Percona XtraDB Cluster
Snapshot State Transfer
Mysqldump Rsync XtraBackup
Donor Donor
Small
disconnected disconnected
databases
for copy time for short time
faster slower
84. Percona XtraDB Cluster
It looks so easy. Why did
not you implement it earlier?
It is not easy.
Computer science of group
communication and distributed
transactions.
Credits to Codership Oy
90. Percona XtraDB Cluster
XtraDB MySQL
Cluster Cluster
Easy to migrate
Easy to use
Cloud / EC2
Changes in an
application
Write scaling
99.999%
Today I want to focus on High Availability questions.I personally define the current MySQL era as era of High Availability.While there are many materials how to setup and tune single server, HA for MySQL is still on the initial but raising stage.You may see an increasing interest to many third party software and scripts, like Continuent, MHA, MMM, Flipper, Percona Replication Manager, etc
So what is High Availablity. Availability refers to the ability of the user community to access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is said to be unavailable. (Wikipedia)Word “High” refers to some pre-arranged level of performance during long period of time
An usual approach to provide availability is to have redundant systems. This is especially useful for services like web server or application server
Basically we duplicate resources
And a failover procedure is very simple. When one system is down, we redirect user requests to second system.
It apply to more than twosystems, having multiple servers for redundancy we can manage a probability of failure.If single system has probability P. then two servers have probability P/2,And for X servers: P/X
Or graphically we have an Inversely Proportional function – the more server you have, the less probability of failure
Was that easy ?Not, if we deal with databases
Databases is more complicated. There are two parts: a system (server) and data
And beside a system redundancy we need to provide a data redundancy
To provide availability for database, we need to provide availability for two components: service + data
This is where a replication comes into play. The replication is a process of sharing events between resources. When one of systems updates data, it informs other systems.
And MySQL Replication is of course is most known in MySQL world
When we speak Availability and it is based on MySQL replication, I have this picture in my mind. This systems works, but also it has many life-support elements, and if you do something wrong, your system fails.Well, yes, I exaggerate this a little, but I am doing it to draw your attention.There is a lot of MySQL replication setups which are to serve HA purposes, and they are doing great job.What is wrong with it – I will explain myself just in couple minutes.MySQL Replication is a great tool, no questions. Simple and easy to understand. This all made MySQL and MySQL Replication very popular. And MySQL Replication is one of factors why MySQL is most popular Open Source database. The lack of a good replication was the biggest issue for PostgreSQLusers.
So what’s wrong with MySQL Replication ? This is as simple as “a”
“a” in asynchronous
MySQL replication is asynchronous
asynchronous means that there is no a confirmation or a guarantee when an event will be applied on the replicated system.By its nature it assumes that there is a delay between a slave and a master. And the delay can be microseconds or hours…
In a simple term: asynchronous does not guarantee that your data is the same in this given point of time.
Synchronous communication is different. It assumes a confirmation process. The first systems are waiting on confirmation.
Wait, is not it how DRBD works?
Yes, DRBD. However we with DRBD we have a different problem. While data is always available, the system is not. DRBD only works in active-passive mode. Failover time depends on how fast you can wake up your system. And it cannot be 0 time.
On this step we are coming to Clustering solutions
And namely our solutions: PerconaXtraDB Cluster.As usually our soft is free and open source.You are free to install, use, uninstall, copy and distribute, provide commercial support, whatever you want.
In XtraDB Cluster the nodes are connected through group communication, and interexchange events synchronous
Well, in fact the process is not fully synchronous,but “Virtually synchronous”. I will not go in details now, it is fairly complicated, the good reading is by following link: http://en.wikipedia.org/wiki/Virtual_synchrony
I tried to simplify the process, and this is the simplest one I could come up with.Important points: The network interaction happens when you issue COMMIT statement. At this time the NODE performs network communication and certification of transaction. On an intercontinental communication between nodes the network roundtrip can be significant ( e.g. 0.18 sec for nodes on Amazon-California and Amazon-Europe zone).Also important to note that Slave still can have a short period of time (i.e. less than 1 sec) when it is out of sync with master. The difference come from: Applying event on slave make take longer than COMMIT on the master.
So there is list of benefits that XtraDB Cluster provides.1. Is synchronous replication, the importance of this I already covered
Second, Multi-master replication
In a regular MySQL replication writes to several servers are possible, but you are looking for big troubles.This comes from the fact I already mentioned: being asynchronous, the second server may be behind, and updating it we may update a stale data.
With XtraDB Cluster it is different: you can update any server in the Cluster
Third benefit is: Parallel replication
It is well know limitation of MySQL that slave is only single threaded, and it is an additional factor why a slave may get behind a master. On modern servers with 32 cores, the master can perform much more updates per second than the slave with single thread can handle.
Fourth benefit is: data consistency.
In XtraDB Cluster we guarantee that data is equal on all nodes. A transaction is either committed on all nodes or not committed at all.
And Fifth, Automatic node provisioning
When a new node joins the Cluster, it automatically copies data from an existing node
Finishing with benefits, I want to give some attention to CAP theorem. You probably heard about it, especially how it is applied to different NOSQL distributed system. I will try to use to explain a difference between MySQL Replication and XtraDB Cluster using CAP theorem.
I will simplifythe theorem to next statement:In a distributed system you can have only two from following choice:Data consistencyNode availabilityPartition Tolerance
I understand that still sounds somewhat fuzzy, let me show the following example:Imagine in a system with three nodes, we have a network failure to one of nodes.
In this case MySQL Replication will provide you access to all nodes. Even the node disconnected from other nodes, you still are able to connect to MySQL locally or through available connection and extract and change even change data. However as you understand, the node will not receive updates from Cluster. The Data consistency is compromised.
WithXtraDB Cluster, we guarantee data consistency, but as a downside we can’t allow access to a node that disconnected from the cluster
As consequence: minimal recommended configuration is 3 modes, why ? Let’s see what we have with 2 nodes:
In 2 nodes configuration, in case of link failure, we have a case which has a special name “Split brain”.We can’t pretty much decide which node accepts queries and which is not.By default, if that happens: both nodes will refuse queries.
There is however a special option that allows you to have this schema, but you take responsibility on yourself.You need to make sure that you one and only one master, that is a server that executes update queries.
Once again, back to our theorem. It shows a principal difference between MySQL Replication and XtraDB Cluster.MySQL Replication provides you an access to all systems. XtraDB Cluster prioritize Data Consistency.
This applies to all external software and scripts that are based on MySQL replications.As long as MySQL replication is asynchronous, it does not guarantee a Data Consistency
On this stage I want to give a little more details about XtraDB Cluster
PerconaXtraDB Cluster is old good Percona Server + special replication patches + Galera library.The patches and Galera library are developed by Finnish company CodershipOy.
The fact, that XtraDB Cluster is based on Percona Server, which is compatible with MySQL, means that XtraDB Cluster is compatible with MySQL setups. You use the same InnoDB storage engine, and database server behaves the same way: queries have the same execution plan, you use the same configurations and the same optimization techniques.
There are also minimal efforts to migrate from existing working systems to system running XtraDB Cluster. It is not much harder than to upgrade from MySQL to Percona Server. If you did it, you know that is quite easy: you just replace old binaries by new binaries files. For XtraDB Cluster you will need to make couple changes to configuration file.
And also important, there is no lock-in, which for me is an quite important factor. If you do not like this solution by some reason there is always an easy way to return to previous setup.
This allsounds so good, so is this a perfect solution? I want to answer “Yes”, but you won’t believe me. Of course there are limitations.
It is a new product and new solution and there are limitations, some of them will be resolved later and we are already working on.
The first limitation is that only InnoDB tables are supported. Changes to MyISAM tables are not replication, so you need to make sure you have only InnoDB when you test XtraDB Cluster.
The second limitation or incompatibility for applications is that XtraDB Cluster introduces OPTIMISTIC locking. This locking is applied not to all cases but to transactions that running on different servers. Let me explain what does it mean for you.
First, traditionalInnoDB locking, for transactions on the same server.When two transactions are trying to update the same row, the second transaction waits until first COMMITs or ROLLBACKs
In XtraDB Cluster running transactions on different servers (multi-master) we can get Error on COMMIT statement.Once again, this applies for different servers. If we run on the same server, we have a traditional InnoDB locking.On different servers, however, we get so named “OPTIMISTIC locking”. Two transactions are running lockless, as they assume there is no conflicts, and later, only on COMMIT stage transactions are communicate each with other. Again, we have this model because all communication happens on COMMIT stage. So if on this stage transaction 2 finds out it updated a row, that also updated by another transaction, then transaction 2 performs ROLLBACK and returns ERROR to client.This is not something that usually happens in traditional applications based on MySQL.And your application may not be ready for that. Fair to say that many applications or frameworks do not handle errors on COMMIT query.And this may require changes in the application logic if you expect to run transactions on different nodes.But also this could be the ONLY ONE significant change you need.
Ok, next limitation. Write performance is limited by weakest node you have. This is price we pay for a data consistency. If one of nodes becomes suddenly slow (by different reasons, i.e. a disk failure in a RAID) , write queries are equally slow in the whole cluster. Let me show why
When user runs update on one of servers, this write event is communicated to all nodes. The user gets the confirmation after server gets the confirmation for every node.If one of nodes is slow, the whole cluster is slow.
Now let’s talk about write intensive application. Write intensive I mean very high rate of updates/inserts/deletes per second. If you have this case,there will be some limit of how much data you can have in the cluster. And this limitation is not physical, there is nothing hardcoded, but it is rather logical. Let me explain why
I also will explain how Cluster handles JOIN process. Let’s assume one new node wants to join to existing cluster. As we already discussed, it has to have full copy of data, to have the same data as others nodes in the cluster.
So what happens:Cluster allocates one node, which gets status DONOR andJOINER copies whole dataset from DONORYou understand that, for example, for 200GB of data it make take time to copy it over network.Meantime DONOR is also gets OUT OF CLUSTER. It may be short or long period of time. It depends on what copying method you choose. I will show different copying methods later.
Now, when data copying is finished - we have two nodes that were disconnected from Cluster.And they need to apply events that happened while they were disconnected.If your have big database, and if you have an intensive rate of changes, it may take long time.And while events are applied to DONOR and JOINER, new events may still happen in the cluster.You understand that these two are trying to catch up, but the cluster generates and generates new events.In the worst case these two outsides may never catch up.
So for write intensive applications this should be Hardware + Software solution. This is the case which we can’t solve only using software solutions.And this applies not only to XtraDB Cluster. Let me show an analogy.
Let’s take singleInnoDB system. When we need good write performance from InnoDB, the usual setup is to have Disk Array. You can’t have decent performance with single disk.
If you need InnoDB to provide good performance and durability, you need not just Disk Array, but Array with cache, which backed by a battery.
The same for cluster.For write intensive applications and good performance in cluster, you will need Good networking, like 10 Gigabit or Infinibad andGood storage, condider SSD drives or PCIe Flash cards
Let’s back to JOIN process. As I promised, let me review methods XtraDB Cluster can use to copy date. The process by it self has a name: State Transfer
InXtraDB cluster we have two State transfers:Full data copy: this is Snapshot State Transfer. It happens when totally new node joins cluster. Or node was in the cluster, but then by some reason it was disconnected for long period of time.And second is Incremental State Transfer. This happens when node was disconnected for short period of time
For Snapshot State Transfer (full copy of data), we have following choice:mysqldump, obviously it is good for small database to just to play with clusterRsync, the data is copied using rsync process. Usually it is a fast way to copy data, but the drawback is that DONOR is disconnected from cluster for whole copying timeXtraBackup. With xtrabackup the donor is disconnected for short period of time, but copying data and joining may be slower than rsync
Incremental State Transfer is used when a node was in the cluster but we had it put it down for short period of time, like server reboot, or change some configuration parameters. The second case when it can be used (but not yet, this is work in progress), is when node crashed. Yes, it happens. Unfortunately at this moment, after the crash node has to perform Full Snapshot State Transfer.
Ok, if you are still with me, we can continue with scalabilty topic
Scaleability for me is quite similar to availability. That’s why we can use XtraDB Cluster for needs to scale a load
Scaleability is similar to availability is a sense how it can be handled: by redundancy. In this case only difference is that the first system is not able to handle a user’s request not because it is down, but because it is overloaded. An experience for the user is the same: the system refuses to handle his query, the system is not available for the user. It can be handled by redirecting query to a second system.
InXtraDB Cluster it is easy to scale reads. Reads queries do not require additional overhead or group communication
Scaling writes is more complicated. As each write has to be replicated on every system. Each server has to handle writes coming from all servers
We can make some a rule of thumb. It is very approximate. Just to get basic understanding how to look on it.If we have N servers, and our workload is 100% reads, we can scale as much as N factor.For 100% writes – we can scale only to some constant or even can’t scale at all
That is if 1 server can handle 100 read queries per second, than 10 servers can handle 1000 the same queries per second.For 100% of write traffic there is no much room to grow. Because of internal communication I showed before,If 1 server can handle 100 update queries per second, then 10 servers still are able to handle only 100 the same queries per second.Actually it can be a little better. Because internal communication happens in optimized
With all this group communication over network and synchronous process, is it fast ?Actually it is reasonably fast. Certification and virtual synchronous minimizes overhead.If we look at two performance characteristics: response time and throughput, the response time the one that may take hit. A network roundtrip for sure will increase it. If it is critical, make sure you have a decent network and storage.The throughput is less affected. As we can do many operations in parallel, we can have reasonable performance numbers for throughput.
One of setup when XtraDB Cluster is considered is an intercontinental, or inter-coast replication
I do not like this question. I usually answer “It is different”. I mean, it is really different systems with different goals, how you can compare it. But usually people do not like this answer.
That’s why I come up with this a marketing-like table with checklist.