Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Consistency between Engine and Binlog under Reduced Durability
1. Consistency between Engine and Binlog
under Reduced Durability
Yoshinori Matsunobu
Production Engineer, Facebook
Jan/2020
2. What we want to do
▪ When slave or master instances fail and recover, we want to make
them rejoin the replication chain (replica set), instead of dropping and
rebuilding
▪ Imaging a 10 minute network outage in one Availability Zone, and
want to recover MySQL instances in the AZ
3. Agenda
▪ When binlog and storage engine consistency gets broken
▪ What can go wrong on restarting replica
▪ What can go wrong on restarting master
▪ Challenges to support multiple transactional storage engines
4. Consistency between binlog and engine
▪ MySQL separates Replication logs (Binary Logs) and Transactional Storage Engine logs
(InnoDB/MyRocks/NDB)
▪ Internally handles XA
▪ Commit ordering:
▪ Binlog Prepare (doing nothing)
▪ Engine Prepare (in parallel)
▪ Binlog Commit (ordered)
▪ Engine Commit (ordered, if binlog_order_commits==ON)
▪ If MySQL instance or host dies in between, Engine and Binlog might become inconsistent
▪ Possibility of inconsistency will be bigger when operating with reduced durability (sync-binlog !=1 and
innodb-flush-log-at-trx-commit!=1)
▪ Some binlog events that were persisted in engine may be lost
▪ Engine may lose some transactions there were persisted in binlog
▪ This talk is about how to address consistency issues under reduced durability
5. 5.6 Single Threaded Slave, Binlog < Engine
▪ Unplanned OS reboot on slave may end up inconsistent state
between Binlog GTID sets and Engine state
▪ A big question is the slave can continue replication by START
SLAVE, without entirely replacing it
▪ Transactional Storage Engines (both InnoDB and MyRocks) store
last committed GTID, and it is visible from
mysql.slave_relay_log_info table. This table is updated for each
commit to the engine
▪ With Single Threaded Slave, you don’t have to think about out of
order execution
▪ Run with relay_log_recovery=1
▪ Slave discards relay logs, restart replication from engine max GTID
position from master
▪ Skips execution in engine if GTID < slave_relay_log_info
▪ Skips writing binlog events if binlog GTID overlaps
Master
GTID: 1-100
Replica
Binlog GTID: 1-98
Engine Max GTID: 99
6. 5.6 Single Threaded Slave, Binlog > Engine
▪ Replication will continue from GTID 95 or
less
▪ Executing Engine GTID 96-98 but not saving
binlog events
▪ Continuing normal replication flows after 99
Replica
Binlog GTID: 1-98
Engine Max GTID: 95
Master
GTID: 1-100
7. Multi Threaded Slave
Master
GTID: 1-100
Replica
Binlog GTID: 1-98
Engine Max GTID: 95
▪ mysql.slave_relay_log_info stores only max
executed GTID in the instance
▪ Under parallel database execution, MySQL has no
idea if GTID 94 is in engine or not
▪ Execution order might be 91 -> 92 -> 95
▪ In upstream 5.6, you can’t guarantee consistency
8. 5.7 gtid_executed table
Replica
Binlog GTID: 1-98
gtid_executed table: 1-93, 95-98
Master
GTID: 1-100
▪ 5.7 gtid_executed table stores GTID sets in InnoDB
(crash safe)
▪ However, executed GTIDs are not updated for each
commit
▪ It is updated on binlog rotate
▪ If it updates for each commit, you can figure out
GTID 94 is there or not. (you can’t right now)
9. FB Extension: Slave Idempotent Recovery
- Starting replication from old enough binlog GTID
- Re-executing binlog events to engine, then ignoring
all duplicate key error / row not found error during
catchup
- Eventual Consistency
- Must use RBR, and tables must have primary keys
Master
GTID: 1-100
Replica
Binlog GTID: 1-98
Engine GTID state: empty
10. What can go wrong when restarting master
▪ Master may go down unexpectedly by various reasons
▪ Hitting segfaults (SIG=11), assertion (SIG=6), forcing kill (SIG=9), out of
memory
▪ Kernel panic
▪ power outages then restarted after a while
▪ Nowadays dead master promotion kicks in (Orchestrator, MHA)
▪ A question is failed master can restart replication from the new master
▪ Dead master may be back before dead master promotion
▪ If the master lost some transactions that are already replicated, replicas may
not be able to continue replication
11. Master Promotion happening, Binlog < Engine
▪ “Loss-Less Semi-Synchronous Replication” guarantees semisync tailer gets binlog events before master engine commit (so Engine on
orig master <= Binlog/Engine on new master)
▪ You need to start replication from the last GTID in the engine
▪ In this case, GTID Executed Sets in master is 1-98, but replication should start after 99
▪ Master’s engine execution order is serialized (with binlog-order-commit=1) so its’ guaranteed 1~99 are in engine
▪ However, this information is not visible from MySQL commands (only printed in err log)
▪ Feature Request to Oracle: InnoDB should add information_schema to print current committed last GTID, binlog file and position
▪ With Slave Idempotent Recovery, fetching last committed GTID can be skipped so automation can be more simplified.
Instance 1
(Master)
Binlog: 1-98
Engine: 1-99
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Replica)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
Instance 1
(Dead)
Binlog: 1-98
Engine: 1-
99
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Master)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
12. Master Promotion happening, Binlog >
Engine
Instance 1
(Master)
Binlog: 1-100
Engine: 1-98
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Replica)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
Instance 1
(Dead)
Binlog: 1-
100
Engine: 1-
98
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Master)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
“100” should be discarded before
replicating from new master (instance 3)
InnoDB: Last binlog file position 79143, file name binlog.000005
InnoDB: Last MySQL Gtid UUID:98
▪ Binlog GTID 100 is on instance 1 only, and is not acked to client (with loss less semisync)
▪ If the original master (instance 1) applies Binlog 100, it can’t join as a replica
▪ We need some ways not to apply GTID 100 during recovery
13. FB Extention: Server Side Binlog
Truncation▪ At instance startup, truncating binlog events that don’t exist in storage
engine
▪ End of binlog position is the same or smaller than engine’s last committed GTID
▪ Retaining original binlog file as a backup
▪ All of the prepared state transactions in storage engines will be rolled back
14. Master Promotion not happening
Instance 1
(Master)
Binlog: 1-100
Engine: 1-98
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Replica)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
Instance 1
(Recovered)
Binlog: 1-98
Engine: 1-98
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Replica)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
▪ Unplanned reboot on master may end up losing transactions that were already replicated to slaves
▪ Instance1 should not serve write requests until catching up Binlog GTID 99 from instance 3
15. Common Replica errors
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from
binary log: 'Slave has more GTIDs than the master has, using the
master's SERVER_UUID. This may indicate that the end of the binary log
was truncated or that the last binary log file was lost, e.g., after a
power or disk failure when sync_binlog != 1. The master may or may not
have rolled back transactions that were already replica’
▪ Set read_only=1 by default
▪ Find the most advanced slave, catch up from there, then start serving write requests
16. Dual Engine Consistency
▪ Binlog GTID Sets
▪ InnoDB
▪ MyRocks
▪ Binlog, InnoDB and MyRocks (or NDB) need to be consistent
▪ Binlog: GTID 1-200, InnoDB: GTID 190, MyRocks: GTID 197
▪ It is unclear if 191-196 are committed
▪ Roll back all prepared transactions (server side binlog truncation)
▪ Idempotent recovery
▪ Recover from binlogs on semi-sync replica
17. Dual Engine consistency without binlog
▪ 8.0 DDL is transactional
▪ Table metadata info is stored in InnoDB
▪ It is common to run DDL outside of replication
▪ FB OSC changes schema without binlog
▪ MyRocks table changes without binlog may end up inconsistency
▪ There is no binlog to fix inconsistency
▪ DDL validation is our current workaround
18. Summary
▪ MySQL needs to be aware of executed engine GTID sets
▪ With low update costs
▪ We don’t have in upstream MySQL yet. It’s a nice feature
▪ We worked around by Slave Idempotent Recovery
▪ Binlog Truncation during recovery, so that an old master can rejoin
as a replica