This is a tutorial I gave with my colleague Kenny Gryp at Percona Live 2016 in Santa Clara
Percona XtraDB Cluster is a high availability and high scalability solution for MySQL clustering. Percona XtraDB Cluster integrates Percona Server with the Galera synchronous replication library in a single product package, which enables you to create a cost-effective MySQL cluster.
For three years at Percona Live, we've introduced people to this technology... but what's next? This tutorial continues your education, and targets users that already have experience with Percona XtraDB Cluster and want to go further.
This tutorial will cover the following topics:
- Bootstrapping in details
- certification errors, understanding and preventing them
- Replication failures, how to deal with them
- Secrets of Galera Cache
- Mastering flow control
- Understanding and verifying replication throughput
- How to use WAN replication
- Implications of consistent reads
- Backups
- Load balancers and proxy protocol
5. Setting Up The Environment
Fetch a USB Stick, Install VirtualBox & Copy the 3 images
5
6. Attention - Hands On!
When you see in the right bottom,
there is an exercise that you should do!
When you see take your QR reader, and participate!
6
7. Testing The Environment
Install QR Reader to update your Status.
Start all 3 VirtualBox images
ssh/puttyto:
pxc1: ssh root@localhost -p 8821
pxc2: ssh root@localhost -p 8822
pxc3: ssh root@localhost -p 8823
rootpassword is perconapassword
HAProxy is running on pxc1
(http://localhost:8881/)
Verify sshbetween nodes
Open 2 sshsessions to every node
7
10. Bootstrapping PXC
You all should know this already...
# service mysql bootstrap-pxc
# /etc/init.d/mysql boostrap-pxc
# /etc/init.d/mysql start --wsrep-new-cluster
or with systemdenvironments like Centos 7:
# systemctl start mysql@bootstrap
Today we are using 64-bit Centos 7.
For /(Emily|Pim)/, just to be complete....
pxc1# systemctl start mysql@bootstrap
pxc2# systemctl start mysql
pxc3# systemctl start mysql
10
11. Bootstrapping PXC
CentOS 7
The node you bootstrapped..
Stop it with:
# systemctl stop mysql@bootstrap
Start it again with:
# systemctl start mysql
(unless a new cluster needs to be created)
11
12. Bootstrapping PXC
Bootstrapping a node gives it permission to form a new cluster
Bootstrapping should NOT happen automatically without a system
with split-brain protection that can coordinate it. Usually this is done
manually
The bootstrapped node is the source of truth for all nodes going
forward
12
13. Bootstrapping PXC
Recap On IST/SST
IST: Incremental State Transfer
Only transfer missing transactions
SST: State Snapshot Transfer
Snapshot the whole database and transfer, using:
Percona XtraBackup
rsync
mysqldump
One node of the cluster is DONOR
13
14. Bootstrapping PXC
With Stop/Start Of MySQL
When you need to start a new cluster from scratch, you decide which
node to start with and you bootstrap it
# systemctl start mysql@bootstrap
That node becomes the cluster source of truth (SSTs for all new nodes)
14
15. Bootstrapping PXC
Without Restarting MySQL
When a cluster is already partitioned and you want to bring it up again.
1 or more nodes need to be in Non-Primarystate.
Choose the node that is newest and can be enabled (to work with
application)
To bootstrap online:
mysql> set global wsrep_provider_options="pc.bootstrap=true";
be sure there is NO OTHER PRIMARY partition or there will be a
split brain!!
15
16. Bootstrapping PXC
Without Restarting MySQL
Use Case:
Only 1 of the 3 nodes is available and the other 2 nodes crashed,
causing node 1 to go Non-Primary.
In Multi Datacenter environments:
DC1 has 2 nodes, DC2 has 1 node,
If DC1 dies, the single node in DC2 will go Non-Primary. To
activate secondary DC, a bootstrap is necessary
16
17. Recover Cleanly Shutdown Cluster
Run the application (run_app.sh haproxy-all) on pxc1
One by one, stop mysql on all 3 nodes
How can you know which node to bootstrap?
17
18. Recover Cleanly Shutdown Cluster
Run the application (run_app.sh haproxy-all) on pxc1
One by one, stop mysql on all 3 nodes
How can you know which node to bootstrap?
Solution
# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 3759f5c0-56f6-11e5-ad87-afbd92f4dcd2
seqno: 1933471
cert_index:
Bootstrap node with highest seqnoand start other nodes.
18
19. Recover Unclean Stopped Cluster
Run the application (run_app.sh haproxy-all) on pxc1
On all nodes at the same time run:
# killall -9 mysqld mysqld_safe
How can you know which node has the latest commit?
19
20. Recover Unclean Stopped Cluster
Run the application (run_app.sh haproxy-all) on pxc1
On all nodes at the same time run:
# killall -9 mysqld mysqld_safe
How can you know which node has the latest commit?
Solution
# mysqld_safe --wsrep-recover
Logging to '/var/lib/mysql/error.log'.
Starting mysqld daemon with databases from /var/lib/mysql
WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.L
WSREP: Recovered position 44e54b4b-5c69-11e5-83a3-8fc879cb495e:1719976
mysqld from pid file /var/lib/mysql/pxc1.pid ended
20
22. Recover Unclean Stopped Cluster
What methods can we use to bring back the cluster?
Solutions
Bootstrap the most accurate server
Since PXC 5.6.19-25.6 we have pc.recovery(enabled by default)
that uses the information stored in gvwstate.dat. We can then
just start MySQL on all 3 nodes at the same time
# systemctl restart mysql
22
23. Avoiding SST
When MySQL cannot start due to an error, such as a
configuration error, an SST is always performed.
23
24. Avoiding SST
When MySQL cannot start due to an error, such as a
configuration error, an SST is always performed.
Let's try, stop MySQL on pxc2, modify /etc/my.cnfand add foobar
under [mysqld]section.
Then start MySQL, does it fail? Check /var/lib/mysql/error.log.
# systemctl stop mysql
# systemctl stop mysql@bootstrap
# cat >> /etc/my.cnf << EOF
[mysqld]
foobar
EOF
# systemctl start mysql
24
25. Avoiding SST
When MySQL cannot start due to an error, such as a
configuration error, an SST is always performed.
Fix the error (remove the foobarconfiguration) and restart MySQL.
Does it perform SST?
Check /var/lib/mysql/error.log. Why?
25
26. Avoiding SST
When MySQL cannot start due to an error, such as a
configuration error, an SST is always performed.
Fix the error (remove the foobarconfiguration) and restart MySQL.
Does it perform SST?
Check /var/lib/mysql/error.log. Why?
SST is done, as we can see in the error.log:
[Warning] WSREP: Failed to prepare for incremental state transfer:
Local state UUID (00000000-0000-0000-0000-000000000000)
does not match group state UUID
(93a81eed-57b2-11e5-8f5e-82e53aab8d35): 1 (Operation not permitted)
...
WSREP_SST: [INFO] Proceeding with SST (20150916 19:13:50.990)
26
27. Avoiding SST
When MySQL cannot start due to an error, such as a
configuration error, an SST is always performed.
So how can we avoid SST?
It's easy, you need to hack /var/lib/mysql/grastate.dat.
Create the error again:
Bring node back in cluster
Add foobarto the configuration again
Start MySQL
Remove foobaragain
27
28. Avoiding SST
When MySQL cannot start due to an error, such as a
configuration error, an SST is always performed.
# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 93a81eed-57b2-11e5-8f5e-82e53aab8d35
seqno: 1300762
cert_index:
28
29. Avoiding SST
When MySQL cannot start due to an error, such as a
configuration error, an SST is always performed.
# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 93a81eed-57b2-11e5-8f5e-82e53aab8d35
seqno: 1300762
cert_index:
When it fails due to an error, grastate.datis reset to:
# GALERA saved state
version: 2.1
uuid: 00000000-0000-0000-0000-000000000000
seqno: -1
cert_index:
29
30. Avoiding SST
When MySQL cannot start due to an error, such as a
configuration error, an SST is always performed.
You need then to set the right uuid and seqno in grastate.dat
Run mysqld_safe --wsrep-recoverto find the values to set:
[root@pxc2 mysql]# mysqld_safe --wsrep-recover
...
2015-09-16 19:26:14 6133 [Note] WSREP:
Recovered position: 93a81eed-57b2-11e5-8f5e-82e53aab8d35:1300762
30
31. Avoiding SST
When MySQL cannot start due to an error, such as a
configuration error, an SST is always performed.
Create grastate.datwith info from wsrep-recover:
# GALERA saved state
version: 2.1
uuid: 93a81eed-57b2-11e5-8f5e-82e53aab8d35
seqno: 1300762
cert_index:
Start MySQL again and check /var/lib/mysql/error.log:
[root@pxc2 mysql]# systemctl start mysql
...
150916 19:27:53 mysqld_safe
Assigning 93a81eed-57b2-11e5-8f5e-82e53aab8d35:1300762
to wsrep_start_position
...
WSREP_SST: [INFO] xtrabackup_ist received from donor:
Running IST (20150916 19:34:05.545
31
32. When putting in production unprepared...
Certification Errors
32
33. Certification
What it does:
Determine if writeset can be applied.
Based on unapplied earlier transactions on master
Such conflicts must come from other nodes
Happens on every node, individually
Deterministic
Results are not reported to other nodes in the cluster, as every node
does certification and is a determinstic process.
Pass: enter apply queue (commit success on master)
Fail: drop transaction (or return deadlock on master)
Serialized by GTID
Cost based on # of keys or # of rows
33
41. Conflict Detection
Local Certification Failure (lcf)
Transaction fails certification
Post-replication
Deadlock/Transaction Rollback
Status Counter: wsrep_local_cert_failures
Brute Force Abort (bfa)
(Most Common)
Deadlock/Transaction rolled back by applier threads
Pre-commit
Transaction Rollback
Status Counter: wsrep_local_bf_aborts
41
42. Conflict Deadlock/Rollback
note: Transaction Rollback can be gotten on any statement, including
SELECTand COMMIT
Example:
pxc1 mysql> commit;
ERROR 1213 (40001): Deadlock found when trying to get lock;
try restarting transaction
42
43. Multi-writer Conflict Types
Brute Force Abort (bfa)
Transaction rolled back by applier threads
Pre-commit
Transaction Rollback can be gotten on any statement, including
SELECTand COMMIT
Status Counter: wsrep_local_bf_aborts
43
65. Reproducing Conflicts - 1
On pxc1, create test table:
pxc1 mysql> CREATE TABLE test.deadlocks (
i INT UNSIGNED NOT NULL PRIMARY KEY,
j varchar(32),
t datetime
);
pxc1 mysql> INSERT INTO test.deadlocks VALUES (1, NULL, NULL);
Run myq_statuson pxc1:
# myq_status wsrep
mycluster / pxc1 (idx: 1) / Galera 3.11(ra0189ab)
Cluster Node Repl Queue Ops Bytes Conflct Gcache Window
cnf # Stat Laten Up Dn Up Dn Up Dn lcf bfa ist idx dst appl comm
12 3 Sync N/A 0 0 0 0 0.0 0.0 0 0 1.8m 0 0 0
12 3 Sync N/A 0 0 0 0 0.0 0.0 0 0 1.8m 0 0 0
12 3 Sync N/A 0 0 0 0 0.0 0.0 0 0 1.8m 0 0 0
65
66. Reproducing Conflicts - 1
On pxc1:
pxc1 mysql> BEGIN;
pxc1 mysql> UPDATE test.deadlocks SET j='pxc1', t=now() WHERE i=1;
Before commit, go to pxc3:
pxc3 mysql> BEGIN;
pxc3 mysql> UPDATE test.deadlocks SET j='pxc3', t=now() WHERE i=1;
pxc3 mysql> COMMIT;
Now commit the transaction on pxc1:
pxc1 mysql> COMMIT;
pxc1 mysql> SELECT * FROM test.deadlocks;
66
67. Reproducing Conflicts - 1
It fails:
pxc1 mysql> commit;
ERROR 1213 (40001): Deadlock found when trying to get lock;
try restarting transaction
67
68. Reproducing Conflicts - 1
It fails:
pxc1 mysql> commit;
ERROR 1213 (40001): Deadlock found when trying to get lock;
try restarting transaction
Which commit succeeded?
Is this a lcf or a bfa?
How would you diagnose this error?
68
69. Reproducing Conflicts - 1
Which commit succeeded? PXC3, first one that got in cluster.
Is this a lcf or a bfa? BFA
How would you diagnose this error?
show global status like 'wsrep_local_bf%';
show global status like 'wsrep_local_cert%';
+---------------------------+-------+
| Variable_name | Value |
+---------------------------+-------+
| wsrep_local_bf_aborts | 1 |
| wsrep_local_cert_failures | 0 |
+---------------------------+-------+
# myq_status wsrep
mycluster / pxc1 (idx: 1) / Galera 3.11(ra0189ab)
Wsrep Cluster Node Repl Queue Ops Bytes Conflct Gcache Wi
time P cnf # Stat Laten Up Dn Up Dn Up Dn lcf bfa ist idx dst
10:49:43 P 12 3 Sync 1.1ms 0 0 0 0 0.0 0.0 0 0 3 3
10:49:44 P 12 3 Sync 1.1ms 0 0 0 0 0.0 0.0 0 0 3 3
10:49:45 P 12 3 Sync 1.1ms 0 0 0 0 0.0 0.0 0 0 3 3
10:49:46 P 12 3 Sync 1.1ms 0 0 0 1 0.0 0.3K 0 1 4 3
69
70. Reproducing Conflicts - 1
Log Conflicts
pxc1 mysql> set global wsrep_log_conflicts=on;
*** Priority TRANSACTION:
TRANSACTION 7743569, ACTIVE 0 sec starting index read
MySQL thread id 2, OS thread handle 0x93e78b70, query id 1395484 System lock
*** Victim TRANSACTION:
TRANSACTION 7743568, ACTIVE 9 sec
MySQL thread id 89984, OS thread handle 0x82bb1b70, query id 1395461 localhost roo
*** WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 80 page no 3 n bits 72 index PRIMARY of table test.deadlocks
trx id 7743568 lock_mode X locks rec but not gap
2015-09-19 12:36:17 4285 [Note] WSREP: cluster conflict due to high priority
abort for threads:
2015-09-19 12:36:17 4285 [Note] WSREP: Winning thread:
THD: 2, mode: applier, state: executing, conflict: no conflict,
seqno: 1824234
SQL: (null)
2015-09-19 12:36:17 4285 [Note] WSREP: Victim thread:
THD: 89984, mode: local, state: idle, conflict: no conflict,
seqno: -1
SQL: (null)
70
71. Reproducing Conflicts - 1
Log Conflicts - Debug
pxc1 mysql> set global wsrep_debug=on;
[Note] WSREP: BF kill (1, seqno: 1824243), victim: (90473) trx: 7743601
[Note] WSREP: Aborting query: void
[Note] WSREP: kill IDLE for 7743601
[Note] WSREP: enqueuing trx abort for (90473)
[Note] WSREP: signaling aborter
[Note] WSREP: WSREP rollback thread wakes for signal
[Note] WSREP: client rollback due to BF abort for (90473), query: (null)
[Note] WSREP: WSREP rollbacker aborted thd: (90473 2649955184)
[Note] WSREP: Deadlock error for: (null)
71
72. Reproducing Conflicts - 2
rollback;all transactions on all mysql clients
ensure SET GLOBAL wsrep_log_conflicts=on;on all nodes;
run myq_status wsrepon pxc1and pxc2
run run_app.sh lcfon pxc1 to reproduce a LCF
check:
output of run_app.sh lcf
myq_status
/var/lib/mysql/error.log
72
74. Reproducing Conflicts - 2
*** Priority TRANSACTION:
TRANSACTION 7787747, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
1 lock struct(s), heap size 312, 0 row lock(s)
MySQL thread id 1, OS thread handle 0x93e78b70, query id 301870 System lock
*** Victim TRANSACTION:
TRANSACTION 7787746, ACTIVE 0 sec
mysql tables in use 1, locked 1
2 lock struct(s), heap size 312, 1 row lock(s), undo log entries 1
MySQL thread id 2575, OS thread handle 0x82369b70, query id 286919 pxc1 192.168
update test.test set sec_col = 0 where id = 1
[Note] WSREP: Winning thread:
THD: 1, mode: applier, state: executing, conflict: no conflict,
seqno: 1846028, SQL: (null)
[Note] WSREP: Victim thread:
THD: 2575, mode: local, state: committing, conflict: no conflict,
seqno: -1, SQL: update test.test set sec_col = 0 where id = 1
[Note] WSREP: BF kill (1, seqno: 1846028), victim: (2575) trx: 7787746
[Note] WSREP: Aborting query: update test.test set sec_col = 0 where id = 1
[Note] WSREP: kill trx QUERY_COMMITTING for 7787746
[Note] WSREP: trx conflict for key (1,FLAT8)258634b1 d0506abd:
source: 5cb369ab-5eca-11e5-8151-7afe8943c31a version: 3 local: 1
state: MUST_ABORT flags: 1 conn_id: 2575 trx_id: 7787746
seqnos (l: 21977, g: 1846030, s: 1846027, d: 1838545, ts: 161299135504641)
<--X-->
source: 9b376860-5e09-11e5-ac17-e6e46a2459ee version: 3 local: 0
state: APPLYING flags: 1 conn_id: 95747 trx_id: 7787229
74
75. Reproducing Conflicts
Summary
Conflicts are of a concern when determining if PXC is a fit for the
application:
Long running transactions increase chance of conflicts
Heavy write workload on multiple nodes
Large transactions increase chance of conflicts
Mark Callaghan's law a given row can't be modified more often
than 1/RTT times a second
These issues can usually be resolved by writing to 1 node only.
75
80. Replication Failures
When Do They Happen?
When a Total Order Isolation (TOI) error happened
DDL error: CREATE TABLE, ALTER TABLE...
GRANTfailed
When there was a node inconsistency
Bug in Galera replication
Human Error, for example skipping binary log
(SQL_LOG_BIN=0) when doing writes
80
81. Replication Failures
What Happens?
At every error:
A GRA_*.logfile is created into the MySQL datadir
[root@pxc1 ~]# ls -alsh1 /var/lib/mysql/GRA_*
-rw-rw----. 1 mysql 89 Sep 15 10:26 /var/lib/mysql/GRA_1_127792.log
-rw-rw----. 1 mysql 83 Sep 10 12:00 /var/lib/mysql/GRA_2_5.log
A message is written to the errorlog
/var/lib/mysql/error.log
It's possible to decode them, they are binary logs
They can be safely removed
81
83. Replication Failures
Reading GRAContents
Run Application only on pxc1 using only pxc1 as writer:
# run_app.sh pxc1
pxc1 mysql> create table test.nocolumns;
What do you get?
ERROR 1113 (42000): A table must have at least 1 column
83
84. Replication Failures
Reading GRAContents
Run Application only on pxc1 using only pxc1 as writer:
# run_app.sh pxc1
pxc1 mysql> create table test.nocolumns;
What do you get?
ERROR 1113 (42000): A table must have at least 1 column
Error Log Other Nodes:
[ERROR] Slave SQL: Error 'A table must have at least 1 column' on query.
Default database: ''. Query: 'create table test.nocolumns', Error_code: 1113
[Warning] WSREP: RBR event 1 Query apply warning: 1, 1881065
[Warning] WSREP: Ignoring error for TO isolated action:
source: 9b376860-5e09-11e5-ac17-e6e46a2459ee version: 3 local: 0
state: APPLYING flags: 65 conn_id: 106500 trx_id: -1
seqnos (l: 57292, g: 1881065, s: 1881064, d: 1881064, ts: 76501555632582)
84
85. Replication Failures
Making GRA Header File
note: Binary log headers only differ between versions, they can be reused
Get a binary log without checksums:
Create the GRA_headerfile with the new binary log:
dd if=/var/lib/mysql/pxc2-bin.000018 bs=120 count=1
of=/root/GRA_header
pxc2 mysql> set global binlog_checksum=0;
pxc2 mysql> flush binary logs;
pxc2 mysql> show master status;
+-----------------+----------+--------------+------------------+------------------
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set
+-----------------+----------+--------------+------------------+------------------
| pxc2-bin.000018 | 790333 | | |
+-----------------+----------+--------------+------------------+------------------
85
86. Replication Failures
Reading GRAContents
Join the header and one GRA*.logfile (note the seqnoof 1881065)
cat /root/GRA_header /var/lib/mysql/GRA_1_1881065.log
>> /root/GRA_1_1881065-bin.log
View the content with mysqlbinlog
mysqlbinlog -vvv /root/GRA_1_1881065-bin.log
86
87. Replication Failures
Node Consistency Compromised
Delete some data while skipping binary log completely:
pxc2 mysql> set sql_log_bin=0;
Query OK, 0 rows affected (0.00 sec)
pxc2 mysql> delete from sbtest.sbtest1 limit 100;
Query OK, 100 rows affected (0.00 sec)
Repeat the DELETEuntil pxc2 crashes...
87
88. Replication Failures
Node Consistency Compromised
Error:
[ERROR] Slave SQL: Could not execute Update_rows event
on table sbtest.sbtest1;
Can't find record in 'sbtest1', Error_code: 1032;
handler error HA_ERR_KEY_NOT_FOUND;
the event's master log FIRST, end_log_pos 540, Error_code: 1032
[Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 1890959
[Warning] WSREP: Failed to apply app buffer: seqno: 1890959, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 2th time ...
Retrying 4th time ...
[ERROR] WSREP: Failed to apply trx:
source: 9b376860-5e09-11e5-ac17-e6e46a2459ee version: 3 local: 0
state: APPLYING flags: 1 conn_id: 117611 trx_id: 7877213
seqnos (l: 67341, g: 1890959, s: 1890958, d: 1890841, ts: 78933399926835)
[ERROR] WSREP: Failed to apply trx 1890959 4 times
[ERROR] WSREP: Node consistency compromized, aborting...
...
[Note] WSREP: /usr/sbin/mysqld: Terminated.
88
89. Replication Failures
Node Consistency Compromised
[root@pxc2 ~]# cat /root/GRA_header /var/lib/mysql/GRA_1_1890959.log |
mysqlbinlog -vvv -
BINLOG '
...MDQwNjUyMjMwMTQ=...
'/*!*/;
### UPDATE sbtest.sbtest1
### WHERE
### @1=3528 /* INT meta=0 nullable=0 is_null=0 */
### @2=4395 /* INT meta=0 nullable=0 is_null=0 */
### @3='01945529982-83991536409-94055999891-11150850160-46682230772-19159811582-
### @4='92814455222-06024456935-25380449439-64345775537-04065223014' /* STRING(6
### SET
### @1=3528 /* INT meta=0 nullable=0 is_null=0 */
### @2=4396 /* INT meta=0 nullable=0 is_null=0 */
### @3='01945529982-83991536409-94055999891-11150850160-46682230772-19159811582-
### @4='92814455222-06024456935-25380449439-64345775537-04065223014' /* STRING(6
# at 660
#150919 15:56:36 server id 1 end_log_pos 595 Table_map: sbtest.sbtest1 mapped t
# at 715
#150919 15:56:36 server id 1 end_log_pos 1005 Update_rows: table id 70 flags: ST
...
89
91. Galera Cache
All nodes contain a Cache of recent writesets, used to perform IST
used to store the writesets in circular buffer style
91
92. Galera Cache
All nodes contain a Cache of recent writesets, used to perform IST
used to store the writesets in circular buffer style
preallocated file with a specific size, configurable:
wsrep_provider_options = "gcache.size=1G"
default size is 128M
92
93. Galera Cache
All nodes contain a Cache of recent writesets, used to perform IST
used to store the writesets in circular buffer style
preallocated file with a specific size, configurable:
wsrep_provider_options = "gcache.size=1G"
default size is 128M
Galera Cache is mmaped (I/O buffered to memory)
So OS might swap (set vm.swappinessto 10)
use fincore-linuxor dbsake fincoreto see how much of the
file is cached in memory
93
94. Galera Cache
All nodes contain a Cache of recent writesets, used to perform IST
used to store the writesets in circular buffer style
preallocated file with a specific size, configurable:
wsrep_provider_options = "gcache.size=1G"
default size is 128M
Galera Cache is mmaped (I/O buffered to memory)
So OS might swap (set vm.swappinessto 10)
use fincore-linuxor dbsake fincoreto see how much of the
file is cached in memory
status counter wsrep_local_cached_downtoto find last seqno
in the gcache
wsrep_gcache_pool_sizeshows the size of the page pool and/or
dynamic memory allocated for gcache (since PXC 5.6.24)
94
95. Galera Cache
Calculating Optimal Size
It would be great that we could handle 1 hour of changes in the galera
cache for IST.
How large does the Galera cache need to be?
95
96. Galera Cache
Calculating Optimal Size
It would be great that we could handle 1 hour of changes in the galera
cache for IST.
How large does the Galera cache need to be?
We can calculate how much writes happen over time
wsrep_replicated_bytes: writesets sent to other nodes
wsrep_received_bytes: writesets received from other nodes
96
97. Galera Cache
Calculating Optimal Size
It would be great that we could handle 1 hour of changes in the galera
cache for IST.
How large does the Galera cache need to be?
We can calculate how much writes happen over time
wsrep_replicated_bytes: writesets sent to other nodes
wsrep_received_bytes: writesets received from other nodes
SHOW GLOBAL STATUS LIKE 'wsrep_%d_bytes';
SELECT SLEEP(60);
SHOW GLOBAL STATUS LIKE 'wsrep_%d_bytes';
Sum up both replicated and received
for each status and subtract.
97
98. Galera Cache
Calculating Optimal Size
Easier to do is:
SELECT ROUND(SUM(bytes)/1024/1024*60) AS megabytes_per_hour FROM
(SELECT SUM(VARIABLE_VALUE) * -1 AS bytes
FROM information_schema.GLOBAL_STATUS
WHERE VARIABLE_NAME IN ('wsrep_received_bytes',
'wsrep_replicated_bytes')
UNION ALL
SELECT sleep(60) AS bytes
UNION ALL
SELECT SUM(VARIABLE_VALUE) AS bytes
FROM information_schema.GLOBAL_STATUS
WHERE VARIABLE_NAME IN ('wsrep_received_bytes',
'wsrep_replicated_bytes')
) AS COUNTED
+--------------------+
| megabytes_per_hour |
+--------------------+
| 302 |
+--------------------+
1 row in set (1 min 0.00 sec)
98
99. Galera Cache
How Much Filesystem Cache Used?
Check galera cache's memory usage using dbsake:
dbsake fincore /var/lib/mysql/galera.cache
/var/lib/mysql/galera.cache: total_pages=32769 cached=448 percent=1.37
99
101. Flow Control
Avoiding nodes drift behind (slave lag?)
Any node in the cluster can ask the other nodes to pause writes if it
lags behind too much.
Caused by wsrep_local_recv_queueexceeding a node’s
gcs.fc_limit
Can cause all writes on all nodes in the entire cluster to be
stalled.
101
122. Flow Control
Status Counters
wsrep_flow_control_paused_ns:
nanoseconds since starts of node did the cluster get stalled.
wsrep_flow_control_recv:
Amount of flow control messages received from other nodes
wsrep_flow_control_sent:
Amount of flow control messages sent to other nodes
(wsrep_flow_control_paused: Only use in Galera 2
% of the time the cluster was stalled since last SHOW GLOBAL
STATUS)
122
123. Flow Control
Observing Flow Control
Run the application (run_app.sh)
Run myq_status wsrepon all nodes.
Take a read lock on pxc3 and observe its effect on the cluster.
FLUSH TABLES WITH READ LOCK
123
124. pxc1
run_app.sh
all nodes:
myq_status wsrep
Flow Control
Observing Flow Control
Run the application (run_app.sh)
Run myq_status wsrepon all nodes.
Take a read lock on pxc3 and observe its effect on the cluster.
FLUSH TABLES WITH READ LOCK
pxc3 mysql> flush tables with read lock;
wait until flow control kicks in...
pcx3 mysql> unlock tables;
124
125. Flow Control
Increase the limit
Increase the flow control limit on pxc3 to 20000and perform the same
exercise as previously.
125
126. pxc1
run_app.sh
all nodes:
myq_status wsrep
Flow Control
Increase the limit
Increase the flow control limit on pxc3 to 20000and perform the same
exercise as previously.
pxc3 mysql> set global wsrep_provider_options="gcs.fc_limit=20000";
pxc3 mysql> flush tables with read lock;
wait until flow control kicks in...
pcx3 mysql> unlock tables;
126
127. Flow Control
Increase the limit
What do you see?
A node can lag behind more before sending flow control messages.
This can be controlled per node.
Is there another alternative?
127
128. Flow Control
DESYNCmode
It's possible to let a node going behind the flow control limit.
This can be performed by setting wsrep_desync=ON
Try the same exercises but enable DESYNCon pxc3.
128
129. pxc1
run_app.sh
all nodes:
myq_status wsrep
Flow Control
DESYNCmode
It's possible to let a node going behind the flow control limit.
This can be performed by setting wsrep_desync=ON
Try the same exercises but enable DESYNCon pxc3.
pxc3 mysql> set global wsrep_provider_options="gcs.fc_limit=16";
pxc3 mysql> set global wsrep_desync=on;
Don't forget when done:
pcx3 mysql> unlock tables;
129
130. How much more can we handle?
Max Replication Throughput
130
131. Max Replication Throughput
We can measure the write throughput of a node/cluster:
Put a node in wsrep_desync=onto avoid flow control messages
being sent
Lock the replication with FLUSH TABLES WITH READ LOCK
Wait and build up a queue for a certain amount of time
Unlock replication again with UNLOCK TABLES
Measure how fast it syncs up again
Compare with normal workload
131
132. Max Replication Throughput
Measure
On pxc2run
show global status like 'wsrep_last_committed';
select sleep(60);
show global status like 'wsrep_last_committed';
One Liner:
UNLOCK TABLES; SELECT ROUND(SUM(trx)/60) AS transactions_per_second FROM
(SELECT VARIABLE_VALUE * -1 AS trx
FROM information_schema.GLOBAL_STATUS
WHERE VARIABLE_NAME = 'wsrep_last_committed'
UNION ALL SELECT sleep(60) AS trx
UNION ALL SELECT VARIABLE_VALUE AS trx
FROM information_schema.GLOBAL_STATUS
WHERE VARIABLE_NAME = 'wsrep_last_committed') AS COUNTED;
+-------------------------+
| transactions_per_second |
+-------------------------+
| 185 |
+-------------------------+
132
133. Max Replication Throughput
Measure
The following stored function is already installed:
USE test; DROP FUNCTION IF EXISTS galeraWaitUntilEmptyRecvQueue;
DELIMITER $$
CREATE
DEFINER=root@localhost FUNCTION galeraWaitUntilEmptyRecvQueue()
RETURNS INT UNSIGNED READS SQL DATA
BEGIN
DECLARE queue INT UNSIGNED;
DECLARE starttime TIMESTAMP;
DECLARE blackhole INT UNSIGNED;
SET starttime = SYSDATE();
SELECT VARIABLE_VALUE AS trx INTO queue
FROM information_schema.GLOBAL_STATUS
WHERE VARIABLE_NAME = 'wsrep_local_recv_queue';
WHILE queue > 1 DO /* we allow the queue to be 1 */
SELECT VARIABLE_VALUE AS trx INTO queue
FROM information_schema.GLOBAL_STATUS
WHERE VARIABLE_NAME = 'wsrep_local_recv_queue';
SELECT SLEEP(1) into blackhole;
END WHILE;
RETURN SYSDATE() - starttime;
END$$
133
134. Max Replication Throughput
Measure
SET GLOBAL wsrep_desync=on;
FLUSH TABLES WITH READ LOCK;
...wait until the queue rises to be quite high, about 20.000
UNLOCK TABLES; use test;
SELECT sum(trx) as transactions, sum(duration) as time,
IF(sum(duration) < 5, 'DID NOT TAKE LONG ENOUGH TO BE ACCURATE',
ROUND(SUM(trx)/SUM(duration)))
AS transactions_per_second
FROM
(SELECT VARIABLE_VALUE * -1 AS trx, null as duration
FROM information_schema.GLOBAL_STATUS
WHERE VARIABLE_NAME = 'wsrep_last_committed'
UNION ALL
SELECT null as trx, galeraWaitUntilEmptyRecvQueue() AS duration
UNION ALL
SELECT VARIABLE_VALUE AS trx, null as duration
FROM information_schema.GLOBAL_STATUS
WHERE VARIABLE_NAME = 'wsrep_last_committed'
) AS COUNTED;
+--------------+------+-------------------------+
| transactions | time | transactions_per_second |
+--------------+------+-------------------------+
| 17764 | 11 | 1615 |
+--------------+------+-------------------------+
134
137. Networking
With Synchronous Replication, It Matters
Network issues cause cluster issues, much faster compared to
asynchronous replication.
Network Partitioning
Nodes joining/leaving
Causing clusters to go Non-Primary,
not accepting any reads and writes anymore.
Latency has an impact on response time:
at COMMITof a transaction
depending on wsrep_sync_waitsetting for other statements
too.
137
138. Networking
Status Variables
pxc2 mysql> show global status like 'wsrep_evs_repl_latency';
+------------------------+-------------------------------------------------+
| Variable_name | Value |
+------------------------+-------------------------------------------------+
| wsrep_evs_repl_latency | 0.000745194/0.00175792/0.00832816/0.00184453/16 |
+------------------------+-------------------------------------------------+
Reset Interval with evs.stats_report_period=1min
# myq_status wsrep_latency
mycluster / pxc2 (idx: 2) / Galera 3.12(r9921e73)
Wsrep Cluster Node Ops Latencies
time P cnf # Stat Up Dn Size Min Avg Max Dev
22:55:48 P 53 3 Sync 0 65 9 681µs 1307µs 4192µs 1032µs
22:55:49 P 53 3 Sync 0 52 10 681µs 1274µs 4192µs 984µs
22:55:50 P 53 3 Sync 0 47 10 681µs 1274µs 4192µs 984µs
22:55:51 P 53 3 Sync 0 61 11 681µs 1234µs 4192µs 947µs
138
139. Networking
Latency
On pxc1, start the 'application':
# myq_status wsrep_latency
mycluster / pxc2 (idx: 2) / Galera 3.12(r9921e73)
Wsrep Cluster Node Ops Latencies
time P cnf # Stat Up Dn Size Min Avg Max Dev
23:02:44 P 53 3 Sync 0 48 7 777µs 1236µs 2126µs 434µs
23:02:45 P 53 3 Sync 0 47 7 777µs 1236µs 2126µs 434µs
23:02:46 P 53 3 Sync 0 58 7 777µs 1236µs 2126µs 434µs
run_app.sh pxc1
[1125s] tps: 51.05, reads: 687.72, writes: 204.21, response time: 10.54ms (95%),
[1126s] tps: 33.98, reads: 475.77, writes: 135.94, response time: 15.07ms (95%),
[1127s] tps: 42.01, reads: 588.19, writes: 168.05, response time: 12.79ms (95%),
# myq_status wsrep
Wsrep Cluster Node Repl Queue Ops Bytes Conflct Gcache Win
time P cnf # Stat Laten Up Dn Up Dn Up Dn lcf bfa ist idx dst
23:02:44 P 53 3 Sync 1.2ms 0 0 0 49 0.0 80K 0 0 77k 178
23:02:45 P 53 3 Sync 1.2ms 0 0 0 43 0.0 70K 0 0 77k 176
23:02:46 P 53 3 Sync 1.2ms 0 0 0 55 0.0 90K 0 0 77k 164
139
141. Networking
WAN Impact on Latency
Change from a LAN setup into a cluster across 2 datacenters
last_node_to_dc2.sh enable
141
142. Networking
WAN Impact on Latency
last_node_to_dc2.sh enable
What can we observe in the cluster after running this command?
142
143. Networking
WAN Impact on Latency
last_node_to_dc2.sh enable
What can we observe in the cluster after running this command?
myq_status wsrep_latencyis up 200ms
run_app.shthroughput is a lot lower
run_app.shresponse time is a lot higher
mycluster / pxc3 (idx: 0) / Galera 3.12(r9921e73)
Wsrep Cluster Node Ops Latencies
time P cnf # Stat Up Dn Size Min Avg Max Dev
23:23:34 P 53 3 Sync 0 14 6 201ms 202ms 206ms 2073µs
23:23:35 P 53 3 Sync 0 16 6 201ms 202ms 206ms 2073µs
23:23:36 P 53 3 Sync 0 14 6 201ms 202ms 206ms 2073µs
23:23:37 P 53 3 Sync 0 14 6 201ms 202ms 206ms 2073µs
23:23:38 P 53 3 Sync 0 14 6 201ms 202ms 206ms 2073µs
143
145. Networking
WAN Impact on Latency
Why Is that?
Don't forget this is synchronous replication, the writeset is replicated
synchronously.
Delivers the writeset to all nodes in the cluster at trx commit.
And all nodes acknowledging the writeset
Generates a GLOBAL ORDER for that transaction (GTID)
Cost is ~roundtrip latency for COMMITto furthest node
GTID serialized, but many writesets can be replicating in parallel
Remember Mark Callaghan's Law a given row can't be modified
more often than 1/RTT times a second
145
146. Networking
WAN Configuration
Don't forget in WAN to use higher timeouts and send windows:
evs.user_send_window=2 ~> 256
evs.send_window=4 ~> 512
evs.keepalive_period=PT1S ~> PT1S
evs.suspect_timeout=PT5S ~> PT15S
evs.inactive_timeout=PT15S ~> PT45S
Don't forget to disable the WAN:
last_node_to_dc2.sh disable
146
148. Networking
WAN Configuration - Bandwidth
How to reduce bandwith used between datacenters?
Use segments (gmcast.segment) to reduce network traffic between
datacenters
Use minimal binlog_row_imageto reduce binary log size
repl.key_format = FLAT8which by default is already smallest
148
153. Networking
Replication With Segments
Galera 3.0 introduced the segment concept
Replication traffic is minimized between segments
Donor selection is preferred on local segment
153
156. Networking
Replication With Segments
Run the run_app.shon pxc1
On another terminal on pxc1, run speedometer
speedometer -r eth1 -t eth1 -l -m 524288
Change the segment on pxc2an pxc3in /etc/my.cnfand restart
MySQL (this is not dynamic)
wsrep_provider_options='gmcast.segment=2'
Check the bandwidth usage again.
How do you explain this?
156
158. Networking
Binlog Row Image Format
On pxc1 run speedometer again::
speedometer -r eth1 -t eth1 -l -m 262144
On pxc1, set the binlog_row_image=minimal:
pxc1 mysql> SET GLOBAL binlog_row_image=minimal;
Check the bandwith usage
158
159. Networking
Binlog Row Image Format
On pxc1 run speedometer again::
speedometer -r eth1 -t eth1 -l -m 262144
On pxc1, set the binlog_row_image=minimal:
pxc1 mysql> SET GLOBAL binlog_row_image=minimal;
Check the bandwith usage
159
160. Networking
Not Completely Synchronous
Applying transactions is asynchronous
By default, reads on different nodes might show stale data.
Practically, flow control prevents this from lagging too much
behind, reducing stale data.
Read Consistency can be configured: We can enforce a read reads the
latest committed data, cluster wide.
What if we absolutely need consistency?
160
161. Networking
Not Completely Synchronous
What if we absolutely need consistency?
Since PXC 5.6.20-27.7:
SET <session|global> wsrep_sync_wait=[1|2|4];
1 Indicates check on READstatements, including SELECT,
SHOW, BEGIN/START TRANSACTION.
2 Indicates check on UPDATEand DELETEstatements.
4 Indicates check on INSERTand REPLACEstatements
Before: <session|global> wsrep_causal_reads=[1|0];
161
162. Networking
Consistent Reads & Latency
How does enabling WSREP_SYNC_WAITconsistent reads affect WAN
environments?
Stop the application run_app.sh
Move the last node to DC2:
last_node_to_dc2.sh enable
On pxc1, run:
pxc1 mysql> select * from sbtest.sbtest1 where id = 4;
...
1 row in set (0.00 sec)
162
163. Networking
Consistent Reads & Latency
Now change the causality check to ensure that READstatements are in
sync, and perform the same SELECT:
pxc1 mysql> SET SESSION wsrep_sync_wait=1;
pxc1 mysql> select * from sbtest.sbtest1 where id = 4;
What do you see ?
163
164. Networking
Consistent Reads & Latency
Now change the causality check to ensure that READstatements are in
sync, and perform the same SELECT:
pxc1 mysql> SET SESSION wsrep_sync_wait=1;
pxc1 mysql> select * from sbtest.sbtest1 where id = 4;
What do you see ?
...
1 row in set (0.20 sec)
164
165. Networking
Consistent Reads & Latency
Now change the causality check to ensure that READstatements are in
sync, and perform the same SELECT:
pxc1 mysql> SET SESSION wsrep_sync_wait=1;
pxc1 mysql> select * from sbtest.sbtest1 where id = 4;
What do you see ?
...
1 row in set (0.20 sec)
Put back pxc3on dc1:
last_node_to_dc2.sh disable
165
166. Networking
Dirty Reads
By default, If a node is non-Primary, it will not serve reads & writes.
pxc1# killall -9 mysqld mysqld_safe
pxc2# killall -9 mysqld mysqld_safe
pxc3 mysql> SELECT * FROM information_schema.GLOBAL_STATUS
WHERE Variable_name='wsrep_cluster_status';
+----------------------+----------------+
| WSREP_CLUSTER_STATUS | non-Primary |
+----------------------+----------------+
pxc3 mysql> SELECT * FROM test.test LIMIT 1;
ERROR 1047 (08S01): WSREP has not yet prepared node for application use
166
167. Networking
Dirty Reads
But you can still read if you really want to with
wsrep_dirty_reads=on(GLOBALand SESSION).
pxc3 mysql> SELECT * FROM test.test LIMIT 1;
ERROR 1047 (08S01): WSREP has not yet prepared node for application use
pxc3 mysql> SET SESSION wsrep_dirty_reads=on;
pxc3 mysql> SELECT * FROM test.test LIMIT 1;
+----+---------+
| id | sec_col |
+----+---------+
| 1 | 0 |
+----+---------+
Fix it again:
pxc1# systemctl start mysql
pxc2# systemctl start mysql
167
169. Backups
Full: Percona XtraBackup
Feature-rich online physical Backups
Since PXC 5.6.21-25.8, There is LOCK TABLES FOR BACKUP
No FLUSH TABLES WITH READ LOCKanymore
Locks only DDL and MyISAM, leaves InnoDB fully unlocked
No more need to set the backup node in DESYNCto avoid Flow
Control
169
170. Backups
Full: Percona XtraBackup
Feature-rich online physical Backups
Since PXC 5.6.21-25.8, There is LOCK TABLES FOR BACKUP
No FLUSH TABLES WITH READ LOCKanymore
Locks only DDL and MyISAM, leaves InnoDB fully unlocked
No more need to set the backup node in DESYNCto avoid Flow
Control
Point In Time Recovery: Binary Logs
It's also recommended to save the binary logs to perform point-
in-time recovery
With mysqlbinlog5.6, it's possible to stream them to another
'backup' host.
170
171. Backups
Full Backup
On pxc1, run the application:
run_app.sh pxc1
On pxc3, take a full backup with Percona Xtrabackup
171
172. Backups
Full Backup
On pxc1, run the application:
run_app.sh pxc1
On pxc3, take a full backup with Percona Xtrabackup
# xtrabackup --backup --target-dir=/root/backups/
160415 05:55:24 Connecting to MySQL server host: localhost, user: root, password:
Using server version 5.6.28-76.1-56-log
xtrabackup version 2.3.4 based on MySQL server 5.6.24 Linux (x86_64) (revision id:
160415 05:55:25 [01] Copying ./ibdata1 to /root/backups/ibdata1
160415 05:55:25 [01] ...done
160415 05:55:25 [01] Copying ./mysql/innodb_table_stats.ibd to /root/backups/mysql
...
160415 05:55:26 Executing LOCK TABLES FOR BACKUP...
160415 05:55:26 Starting to backup non-InnoDB tables and files
...
160415 05:55:27 Executing UNLOCK TABLES
...
160415 05:55:27 completed OK!
172
173. Backups
Full Backup
Apply the logs and get the seqno:
# xtrabackup --prepare --apply-log-only --target-dir=/root/backups/
# xtrabackup --prepare --apply-log-only --target-dir=/root/backups/
# cat /root/backups/xtrabackup_galera_info
b55685a3-5f70-11e5-87f8-2f86c54ca425:1945
# cat /root/backups/xtrabackup_binlog_info
pxc3-bin.000001 3133515
We now have a full backup ready to be used.
173
174. Backups
Stream Binary Logs
Now setup mysqlbinlogto stream the binlogs in /root/binlogs.
As requirement, ensure the following is configured:
log_slave_updates
server-id=__ID__
174
175. Backups
Stream Binary Logs
Now setup mysqlbinlogto stream the binlogs in /root/binlogs.
As requirement, ensure the following is configured:
log_slave_updates
server-id=__ID__
Get mysqlbinlogrunning:
# mkdir /root/binlogs
# mysql -BN -e "show binary logs" | head -n1 | cut -f1
pxc3-bin.000001
# mysqlbinlog --read-from-remote-server --host=127.0.0.1
--raw --stop-never --result-file=/root/binlogs/ pxc3-bin.000001 &
175
176. Backups
Point-in-Time Recovery
On pxc2we update a record:
pxc2 mysql> update sbtest.sbtest1 set pad = "PLAM2015" where id = 999;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
176
177. Backups
Point-in-Time Recovery
On pxc2we update a record:
pxc2 mysql> update sbtest.sbtest1 set pad = "PLAM2015" where id = 999;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
And now it's time to break things, on pxc2, TRUNCATEthe sbtest1
table.
pxc2 mysql> truncate table sbtest.sbtest1;
Query OK, 0 rows affected (0.06 sec)
177
179. Backups
Point-in-Time Recovery
BROKEN! What now?
Let's stop MySQL on all nodes and restore from backup.
# systemctl stop mysql # on non bootstrapped nodes
# systemctl stop mysql@bootstrap # on the bootstrapped node
Restore the backup on pxc3:
[root@pxc3 ~]# rm -rf /var/lib/mysql/*
[root@pxc3 ~]# xtrabackup --copy-back --target-dir=/root/backups/
[root@pxc3 ~]# chown mysql. -R /var/lib/mysql/
179
180. Backups
Point-in-Time Recovery
BROKEN! What now?
Let's stop MySQL on all nodes and restore from backup.
# systemctl stop mysql # on non bootstrapped nodes
# systemctl stop mysql@bootstrap # on the bootstrapped node
Restore the backup on pxc3:
[root@pxc3 ~]# rm -rf /var/lib/mysql/*
[root@pxc3 ~]# xtrabackup --copy-back --target-dir=/root/backups/
[root@pxc3 ~]# chown mysql. -R /var/lib/mysql/
On pxc3, we bootstrap a completely new cluster:
[root@pxc3 ~]# systemctl start mysql@bootstrap
180
181. Backups
Point-in-Time Recovery
The full backup is restored, now we need to do point in time recovery....
Find the position of the "event" that caused the problems
181
182. Backups
Point-in-Time Recovery
The full backup is restored, now we need to do point in time recovery....
Find the position of the "event" that caused the problems
We know the sbtest.sbtest1table got truncated. Let's find that
statement:
[root@pxc3 ~]# mysqlbinlog /root/binlogs/pxc3-bin.* | grep -i truncate -B10
OC05MzAyOTU2MDQzOC0xNzU5MDQyMTM1NS02MDYyOTQ1OTk1MC0wODY4ODc0NTg2NTCjjIc=
'/*!*/;
# at 13961536
#150920 8:33:15 server id 1 end_log_pos 13961567 CRC32 0xc97eb41f Xid = 8667
COMMIT/*!*/;
# at 13961567
#150920 8:33:15 server id 2 end_log_pos 13961659 CRC32 0x491d7ff8 Query threa
SET TIMESTAMP=1442737995/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
truncate table sbtest.sbtest1
182
183. Backups
Point-in-Time Recovery
We need to recover up to TRUNCATE TABLE, which was position
13961567
We can replay the binary log(s) from the last position we backupped
183
184. Backups
Point-in-Time Recovery
We need to recover up to TRUNCATE TABLE, which was position
13961567
We can replay the binary log(s) from the last position we backupped
# cat /root/backups/xtrabackup_binlog_info
pxc3-bin.000001 3133515
184
185. Backups
Point-in-Time Recovery
We need to recover up to TRUNCATE TABLE, which was position
13961567
We can replay the binary log(s) from the last position we backupped
# cat /root/backups/xtrabackup_binlog_info
pxc3-bin.000001 3133515
Note that if we don't have streamed the binary logs from the backup
server... it can happen, then you need to find the position from the Xid,
which is the galera seqno:
#150920 8:33:15 server id 1 end_log_pos 13961567 CRC32 0xc97eb41f Xid = 8667
COMMIT/*!*/;
# at 13961567
185
187. Backups
Point-in-Time Recovery
Let's replay it now:
# mysqlbinlog /root/binlogs/pxc3-bin.000001
--start-position=3133515 --stop-position=13961567 | mysql
Let's Verify:
pxc3 mysql> select id, pad from sbtest.sbtest1 where id =999;
+-----+----------+
| id | pad |
+-----+----------+
| 999 | PLAM2015 |
+-----+----------+
1 row in set (0.00 sec)
187
188. Backups
Point-in-Time Recovery
Let's replay it now:
# mysqlbinlog /root/binlogs/pxc3-bin.000001
--start-position=3133515 --stop-position=13961567 | mysql
Let's Verify:
pxc3 mysql> select id, pad from sbtest.sbtest1 where id =999;
+-----+----------+
| id | pad |
+-----+----------+
| 999 | PLAM2015 |
+-----+----------+
1 row in set (0.00 sec)
You can now restart the other nodes and they will perform
SST.
188
190. Load Balancers
With PXC a Load Balancer is commonly used:
Layer 4
Lot's of choice
Usually HAProxy (most-common)
Layer 7:
MariaDB MaxScale
ProxySQL
ScaleArc (proprietary)
...
190
191. Load Balancers
Usually with Galera, people uses a load balancer to route the MySQL
requests from the application to a node
Redirect writes to another node when problems happen
Mostly 1 node for writes, others for reads
Layer 4: 1 TCP port writes, 1 TCP port reads
Layer 7: Automatic (challenging)
191
192. Load Balancers
HAProxy
On pxc1, we have HAProxy configured like this when listening on port
3308:
## active-passive
listen 3308-active-passive-writes 0.0.0.0:3308
mode tcp
balance leastconn
option httpchk
server pxc1 pxc1:3306 check port 9200 inter 1000 rise 3 fall 3
server pxc2 pxc2:3306 check port 9200 1000 rise 3 fall 3 backup
server pxc3 pxc3:3306 check port 9200 inter 1000 rise 3 fall 3 backup
192
193. Load Balancers
HAProxy
On pxc2and pxc3, we connect to the loadbalancer and run a SELECT:
mysql -h pxc1 -P 3308 -utest -ptest -e "select @@wsrep_node_name, sleep(100)"
On pxc1while the commands are running we check the processlist:
pxc1 mysql> SELECT PROCESSLIST_ID AS id, PROCESSLIST_USER AS user,
PROCESSLIST_HOST AS host, PROCESSLIST_INFO
FROM performance_schema.threads
WHERE PROCESSLIST_INFO LIKE 'select @% sleep%';
+------+------+------+-------------------------------------+
| id | user | host | PROCESSLIST_INFO |
+------+------+------+-------------------------------------+
| 294 | test | pxc1 | select @@wsrep_node_name, sleep(10) |
| 297 | test | pxc1 | select @@wsrep_node_name, sleep(10) |
+------+------+------+-------------------------------------+
193
194. Load Balancers
HAProxy
On pxc2and pxc3, we connect to the loadbalancer and run a SELECT:
mysql -h pxc1 -P 3308 -utest -ptest -e "select @@wsrep_node_name, sleep(100)"
On pxc1while the commands are running we check the processlist:
pxc1 mysql> SELECT PROCESSLIST_ID AS id, PROCESSLIST_USER AS user,
PROCESSLIST_HOST AS host, PROCESSLIST_INFO
FROM performance_schema.threads
WHERE PROCESSLIST_INFO LIKE 'select @% sleep%';
+------+------+------+-------------------------------------+
| id | user | host | PROCESSLIST_INFO |
+------+------+------+-------------------------------------+
| 294 | test | pxc1 | select @@wsrep_node_name, sleep(10) |
| 297 | test | pxc1 | select @@wsrep_node_name, sleep(10) |
+------+------+------+-------------------------------------+
What do you notice ?
194
195. Load Balancers
HAProxy & Proxy Protocol
Since Percona XtraDB Cluster 5.6.2573.1 we support proxy
protocol!
195
196. Load Balancers
HAProxy & Proxy Protocol
Since Percona XtraDB Cluster 5.6.2573.1 we support proxy
protocol!
Let's enable this in my.cnfon all 3 nodes:
[mysqld]
...
proxy_protocol_networks=*
...
196
197. Load Balancers
HAProxy & Proxy Protocol
Restart them one by one:
[root@pxc1 ~]# systemctl restart mysql
...
[root@pxc2 ~]# systemctl restart mysql
...
[root@pxc3 ~]# systemctl restart mysql
on bootstrapped node, you have to do:
[root@pxc1 ~]# systemctl stop mysql@bootstrap
[root@pxc1 ~]# systemctl start mysql
197
198. Load Balancers
HAProxy & Proxy Protocol
On pxc1, we have HAProxy configured like this when listening on port
3310to support proxy protocol:
listen 3310-active-passive-writes 0.0.0.0:3310
mode tcp
balance roundrobin
option httpchk
server pxc1 pxc1:3306 send-proxy-v2 check port 9200 inter 1000 rise 3 fall 3
server pxc2 pxc2:3306 send-proxy-v2 check port 9200 inter 1000 backup
server pxc3 pxc3:3306 send-proxy-v2 check port 9200 inter 1000 backup
And restart HAProxy:
systemctl restart haproxy
198
199. Load Balancers
HAProxy & Proxy Protocol
On pxc2and pxc3, we connect to the loadbalancer (using the new port)
and run a SELECT:
mysql -h pxc1 -P 3310 -utest -ptest -e "select @@wsrep_node_name, sleep(10)"
And on pxc1while the previous command is running we check the
processlist:
pxc1 mysql> SELECT PROCESSLIST_ID AS id, PROCESSLIST_USER AS user,
PROCESSLIST_HOST AS host, PROCESSLIST_INFO
FROM performance_schema.threads
WHERE PROCESSLIST_INFO LIKE 'select @% sleep%';
+------+------+------+-------------------------------------+
| id | user | host | PROCESSLIST_INFO |
+------+------+------+-------------------------------------+
| 75 | test | pxc2 | select @@wsrep_node_name, sleep(10) |
| 76 | test | pxc3 | select @@wsrep_node_name, sleep(10) |
+------+------+------+-------------------------------------+
199
200. Load Balancers
HAProxy & Proxy Protocol
Try to connect from pxc1 to pxc1, not using a load balancer:
pxc1 # mysql -h pxc1 -P 3306
What happens?
200
201. Load Balancers
HAProxy & Proxy Protocol
Try to connect from pxc1 to pxc1, not using a load balancer:
pxc1 # mysql -h pxc1 -P 3306
What happens?
You can't connect to mysql anymore.
When proxy_protocol_networkis enabled it won't connect if you
don't send TCP proxy header!
201
202. Load Balancers
HAProxy & Proxy Protocol
Try to connect from pxc1 to pxc1, not using a load balancer:
pxc1 # mysql -h pxc1 -P 3306
What happens?
You can't connect to mysql anymore.
When proxy_protocol_networkis enabled it won't connect if you
don't send TCP proxy header!
Let's cleanup the proxy_protocol_networkand restart all nodes
before continuing.
202
205. Load Balancers
Maxscale - R/W split
Maxscale is configured to listen on port 4006for read/write split
Start MaxScale:
pxc1# systemctl start maxscale
To find which server is the master, run:
pxc1# maxadmin -pmariadb list servers
205
206. Load Balancers
Maxscale - R/W split
let's verify how it balance our writes query:
for i in `seq 1 10`
do
mysql -BN -h pxc1 -P 4006 -u test -ptest
-e "select @@hostname; select sleep(2);" 2>/dev/null &
done
206
207. Load Balancers
Maxscale - R/W split
let's verify how it balance our writes query:
for i in `seq 1 10`
do
mysql -BN -h pxc1 -P 4006 -u test -ptest
-e "select @@hostname; select sleep(2);" 2>/dev/null &
done
and now try the same but within a transaction:
for i in `seq 1 10`
do
mysql -BN -h pxc1 -P 4006 -u test -ptest
-e "begin; select @@hostname; select sleep(2); commit;" 2>/dev/null &
done
207
208. Load Balancers
Maxscale - R/W split
We could observe that the node that is considered as master doesn't
receive any reads. This is how Maxscale has been designed. All slaves get
the reads and the master gets the writes.
Every transaction is considered as a possible WRITE.
What about READ ONLY TRANSACTIONS ?
208
209. Load Balancers
Maxscale - R/W split
We could observe that the node that is considered as master doesn't
receive any reads. This is how Maxscale has been designed. All slaves get
the reads and the master gets the writes.
Every transaction is considered as a possible WRITE.
What about READ ONLY TRANSACTIONS ?
Test the following:
pxc1# mysql -h pxc1 -P 4006 -u test -ptest
mysql> select @@hostname;
mysql> start transaction read only;
mysql> select @@hostname;
209
210. Load Balancers
Maxscale - R/W split
You can also send reads to the master: service section:
/etc/maxscale.cnf:
[RW Split]
router_options=master_accept_reads=true
210
212. Kenny 'kenji' GrypFrédéric 'lefred' Descamps
Advanced Percona XtraDB Cluster
in a nutshell... la suite
Q&A
Remember to rate our talk on the mobile app!
212