SlideShare uma empresa Scribd logo
1 de 18
Consistency between Engine and Binlog
under Reduced Durability
Yoshinori Matsunobu
Production Engineer, Facebook
Jan/2020
What we want to do
▪ When slave or master instances fail and recover, we want to make
them rejoin the replication chain (replica set), instead of dropping and
rebuilding
▪ Imaging a 10 minute network outage in one Availability Zone, and
want to recover MySQL instances in the AZ
Agenda
▪ When binlog and storage engine consistency gets broken
▪ What can go wrong on restarting replica
▪ What can go wrong on restarting master
▪ Challenges to support multiple transactional storage engines
Consistency between binlog and engine
▪ MySQL separates Replication logs (Binary Logs) and Transactional Storage Engine logs
(InnoDB/MyRocks/NDB)
▪ Internally handles XA
▪ Commit ordering:
▪ Binlog Prepare (doing nothing)
▪ Engine Prepare (in parallel)
▪ Binlog Commit (ordered)
▪ Engine Commit (ordered, if binlog_order_commits==ON)
▪ If MySQL instance or host dies in between, Engine and Binlog might become inconsistent
▪ Possibility of inconsistency will be bigger when operating with reduced durability (sync-binlog !=1 and
innodb-flush-log-at-trx-commit!=1)
▪ Some binlog events that were persisted in engine may be lost
▪ Engine may lose some transactions there were persisted in binlog
▪ This talk is about how to address consistency issues under reduced durability
5.6 Single Threaded Slave, Binlog < Engine
▪ Unplanned OS reboot on slave may end up inconsistent state
between Binlog GTID sets and Engine state
▪ A big question is the slave can continue replication by START
SLAVE, without entirely replacing it
▪ Transactional Storage Engines (both InnoDB and MyRocks) store
last committed GTID, and it is visible from
mysql.slave_relay_log_info table. This table is updated for each
commit to the engine
▪ With Single Threaded Slave, you don’t have to think about out of
order execution
▪ Run with relay_log_recovery=1
▪ Slave discards relay logs, restart replication from engine max GTID
position from master
▪ Skips execution in engine if GTID < slave_relay_log_info
▪ Skips writing binlog events if binlog GTID overlaps
Master
GTID: 1-100
Replica
Binlog GTID: 1-98
Engine Max GTID: 99
5.6 Single Threaded Slave, Binlog > Engine
▪ Replication will continue from GTID 95 or
less
▪ Executing Engine GTID 96-98 but not saving
binlog events
▪ Continuing normal replication flows after 99
Replica
Binlog GTID: 1-98
Engine Max GTID: 95
Master
GTID: 1-100
Multi Threaded Slave
Master
GTID: 1-100
Replica
Binlog GTID: 1-98
Engine Max GTID: 95
▪ mysql.slave_relay_log_info stores only max
executed GTID in the instance
▪ Under parallel database execution, MySQL has no
idea if GTID 94 is in engine or not
▪ Execution order might be 91 -> 92 -> 95
▪ In upstream 5.6, you can’t guarantee consistency
5.7 gtid_executed table
Replica
Binlog GTID: 1-98
gtid_executed table: 1-93, 95-98
Master
GTID: 1-100
▪ 5.7 gtid_executed table stores GTID sets in InnoDB
(crash safe)
▪ However, executed GTIDs are not updated for each
commit
▪ It is updated on binlog rotate
▪ If it updates for each commit, you can figure out
GTID 94 is there or not. (you can’t right now)
FB Extension: Slave Idempotent Recovery
- Starting replication from old enough binlog GTID
- Re-executing binlog events to engine, then ignoring
all duplicate key error / row not found error during
catchup
- Eventual Consistency
- Must use RBR, and tables must have primary keys
Master
GTID: 1-100
Replica
Binlog GTID: 1-98
Engine GTID state: empty
What can go wrong when restarting master
▪ Master may go down unexpectedly by various reasons
▪ Hitting segfaults (SIG=11), assertion (SIG=6), forcing kill (SIG=9), out of
memory
▪ Kernel panic
▪ power outages then restarted after a while
▪ Nowadays dead master promotion kicks in (Orchestrator, MHA)
▪ A question is failed master can restart replication from the new master
▪ Dead master may be back before dead master promotion
▪ If the master lost some transactions that are already replicated, replicas may
not be able to continue replication
Master Promotion happening, Binlog < Engine
▪ “Loss-Less Semi-Synchronous Replication” guarantees semisync tailer gets binlog events before master engine commit (so Engine on
orig master <= Binlog/Engine on new master)
▪ You need to start replication from the last GTID in the engine
▪ In this case, GTID Executed Sets in master is 1-98, but replication should start after 99
▪ Master’s engine execution order is serialized (with binlog-order-commit=1) so its’ guaranteed 1~99 are in engine
▪ However, this information is not visible from MySQL commands (only printed in err log)
▪ Feature Request to Oracle: InnoDB should add information_schema to print current committed last GTID, binlog file and position
▪ With Slave Idempotent Recovery, fetching last committed GTID can be skipped so automation can be more simplified.
Instance 1
(Master)
Binlog: 1-98
Engine: 1-99
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Replica)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
Instance 1
(Dead)
Binlog: 1-98
Engine: 1-
99
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Master)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
Master Promotion happening, Binlog >
Engine
Instance 1
(Master)
Binlog: 1-100
Engine: 1-98
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Replica)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
Instance 1
(Dead)
Binlog: 1-
100
Engine: 1-
98
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Master)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
“100” should be discarded before
replicating from new master (instance 3)
InnoDB: Last binlog file position 79143, file name binlog.000005
InnoDB: Last MySQL Gtid UUID:98
▪ Binlog GTID 100 is on instance 1 only, and is not acked to client (with loss less semisync)
▪ If the original master (instance 1) applies Binlog 100, it can’t join as a replica
▪ We need some ways not to apply GTID 100 during recovery
FB Extention: Server Side Binlog
Truncation▪ At instance startup, truncating binlog events that don’t exist in storage
engine
▪ End of binlog position is the same or smaller than engine’s last committed GTID
▪ Retaining original binlog file as a backup
▪ All of the prepared state transactions in storage engines will be rolled back
Master Promotion not happening
Instance 1
(Master)
Binlog: 1-100
Engine: 1-98
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Replica)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
Instance 1
(Recovered)
Binlog: 1-98
Engine: 1-98
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Replica)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
▪ Unplanned reboot on master may end up losing transactions that were already replicated to slaves
▪ Instance1 should not serve write requests until catching up Binlog GTID 99 from instance 3
Common Replica errors
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from
binary log: 'Slave has more GTIDs than the master has, using the
master's SERVER_UUID. This may indicate that the end of the binary log
was truncated or that the last binary log file was lost, e.g., after a
power or disk failure when sync_binlog != 1. The master may or may not
have rolled back transactions that were already replica’
▪ Set read_only=1 by default
▪ Find the most advanced slave, catch up from there, then start serving write requests
Dual Engine Consistency
▪ Binlog GTID Sets
▪ InnoDB
▪ MyRocks
▪ Binlog, InnoDB and MyRocks (or NDB) need to be consistent
▪ Binlog: GTID 1-200, InnoDB: GTID 190, MyRocks: GTID 197
▪ It is unclear if 191-196 are committed
▪ Roll back all prepared transactions (server side binlog truncation)
▪ Idempotent recovery
▪ Recover from binlogs on semi-sync replica
Dual Engine consistency without binlog
▪ 8.0 DDL is transactional
▪ Table metadata info is stored in InnoDB
▪ It is common to run DDL outside of replication
▪ FB OSC changes schema without binlog
▪ MyRocks table changes without binlog may end up inconsistency
▪ There is no binlog to fix inconsistency
▪ DDL validation is our current workaround
Summary
▪ MySQL needs to be aware of executed engine GTID sets
▪ With low update costs
▪ We don’t have in upstream MySQL yet. It’s a nice feature
▪ We worked around by Slave Idempotent Recovery
▪ Binlog Truncation during recovery, so that an old master can rejoin
as a replica

Mais conteúdo relacionado

Mais procurados

Reducing Risk When Upgrading MySQL
Reducing Risk When Upgrading MySQLReducing Risk When Upgrading MySQL
Reducing Risk When Upgrading MySQL
Kenny Gryp
 

Mais procurados (20)

Advanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suiteAdvanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suite
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash Safety
 
Best practices for MySQL High Availability
Best practices for MySQL High AvailabilityBest practices for MySQL High Availability
Best practices for MySQL High Availability
 
MariaDB Galera Cluster - Simple, Transparent, Highly Available
MariaDB Galera Cluster - Simple, Transparent, Highly AvailableMariaDB Galera Cluster - Simple, Transparent, Highly Available
MariaDB Galera Cluster - Simple, Transparent, Highly Available
 
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and OrchestratorAlmost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash Safety
 
Using and Benchmarking Galera in different architectures (PLUK 2012)
Using and Benchmarking Galera in different architectures (PLUK 2012)Using and Benchmarking Galera in different architectures (PLUK 2012)
Using and Benchmarking Galera in different architectures (PLUK 2012)
 
Reducing Risk When Upgrading MySQL
Reducing Risk When Upgrading MySQLReducing Risk When Upgrading MySQL
Reducing Risk When Upgrading MySQL
 
Galera cluster for MySQL - Introduction Slides
Galera cluster for MySQL - Introduction SlidesGalera cluster for MySQL - Introduction Slides
Galera cluster for MySQL - Introduction Slides
 
Introduction to ClustrixDB
Introduction to ClustrixDBIntroduction to ClustrixDB
Introduction to ClustrixDB
 
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
 
Mysql replication @ gnugroup
Mysql replication @ gnugroupMysql replication @ gnugroup
Mysql replication @ gnugroup
 
Webinar slides: Introducing Galera 3.0 - Now supporting MySQL 5.6
Webinar slides: Introducing Galera 3.0 - Now supporting MySQL 5.6Webinar slides: Introducing Galera 3.0 - Now supporting MySQL 5.6
Webinar slides: Introducing Galera 3.0 - Now supporting MySQL 5.6
 
Introduction to Galera
Introduction to GaleraIntroduction to Galera
Introduction to Galera
 
Highly efficient backups with percona xtrabackup
Highly efficient backups with percona xtrabackupHighly efficient backups with percona xtrabackup
Highly efficient backups with percona xtrabackup
 
Introduction to XtraDB Cluster
Introduction to XtraDB ClusterIntroduction to XtraDB Cluster
Introduction to XtraDB Cluster
 
What's New in MySQL 5.7
What's New in MySQL 5.7What's New in MySQL 5.7
What's New in MySQL 5.7
 
Oss4b - pxc introduction
Oss4b   - pxc introductionOss4b   - pxc introduction
Oss4b - pxc introduction
 
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group ReplicationPercona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
 
Master master vs master-slave database
Master master vs master-slave databaseMaster master vs master-slave database
Master master vs master-slave database
 

Semelhante a Consistency between Engine and Binlog under Reduced Durability

Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDBWebinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Severalnines
 
Webinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera ClusterWebinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera Cluster
Severalnines
 

Semelhante a Consistency between Engine and Binlog under Reduced Durability (20)

MySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsMySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitations
 
MySQL Parallel Replication: inventory, use-cases and limitations
MySQL Parallel Replication: inventory, use-cases and limitationsMySQL Parallel Replication: inventory, use-cases and limitations
MySQL Parallel Replication: inventory, use-cases and limitations
 
Pseudo GTID and Easy MySQL Replication Topology Management
Pseudo GTID and Easy MySQL Replication Topology ManagementPseudo GTID and Easy MySQL Replication Topology Management
Pseudo GTID and Easy MySQL Replication Topology Management
 
MySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsMySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitations
 
The consequences of sync_binlog != 1
The consequences of sync_binlog != 1The consequences of sync_binlog != 1
The consequences of sync_binlog != 1
 
MySQL/MariaDB Parallel Replication: inventory, use-case and limitations
MySQL/MariaDB Parallel Replication: inventory, use-case and limitationsMySQL/MariaDB Parallel Replication: inventory, use-case and limitations
MySQL/MariaDB Parallel Replication: inventory, use-case and limitations
 
Troubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer PerspectiveTroubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer Perspective
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash Safety
 
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDBWebinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
 
Best practices for MySQL/MariaDB Server/Percona Server High Availability
Best practices for MySQL/MariaDB Server/Percona Server High AvailabilityBest practices for MySQL/MariaDB Server/Percona Server High Availability
Best practices for MySQL/MariaDB Server/Percona Server High Availability
 
MySQL 5.6 GTID in a nutshell
MySQL 5.6 GTID in a nutshellMySQL 5.6 GTID in a nutshell
MySQL 5.6 GTID in a nutshell
 
Percona Live 2012PPT: introduction-to-mysql-replication
Percona Live 2012PPT: introduction-to-mysql-replicationPercona Live 2012PPT: introduction-to-mysql-replication
Percona Live 2012PPT: introduction-to-mysql-replication
 
Running gtid replication in production
Running gtid replication in productionRunning gtid replication in production
Running gtid replication in production
 
Webinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera ClusterWebinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera Cluster
 
Managing and Visualizing your Replication Topologies with Orchestrator
Managing and Visualizing your Replication Topologies with OrchestratorManaging and Visualizing your Replication Topologies with Orchestrator
Managing and Visualizing your Replication Topologies with Orchestrator
 
Riding the Binlog: an in Deep Dissection of the Replication Stream
Riding the Binlog: an in Deep Dissection of the Replication StreamRiding the Binlog: an in Deep Dissection of the Replication Stream
Riding the Binlog: an in Deep Dissection of the Replication Stream
 
MySQL Parallel Replication by Booking.com
MySQL Parallel Replication by Booking.comMySQL Parallel Replication by Booking.com
MySQL Parallel Replication by Booking.com
 
Evolution of MySQL Parallel Replication
Evolution of MySQL Parallel Replication Evolution of MySQL Parallel Replication
Evolution of MySQL Parallel Replication
 
Upgrade to MySQL 5.6 without downtime
Upgrade to MySQL 5.6 without downtimeUpgrade to MySQL 5.6 without downtime
Upgrade to MySQL 5.6 without downtime
 
MySQL GTID Concepts, Implementation and troubleshooting
MySQL GTID Concepts, Implementation and troubleshooting MySQL GTID Concepts, Implementation and troubleshooting
MySQL GTID Concepts, Implementation and troubleshooting
 

Mais de Yoshinori Matsunobu

データベース技術の羅針盤
データベース技術の羅針盤データベース技術の羅針盤
データベース技術の羅針盤
Yoshinori Matsunobu
 
MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話
Yoshinori Matsunobu
 
Linux/DB Tuning (DevSumi2010, Japanese)
Linux/DB Tuning (DevSumi2010, Japanese)Linux/DB Tuning (DevSumi2010, Japanese)
Linux/DB Tuning (DevSumi2010, Japanese)
Yoshinori Matsunobu
 

Mais de Yoshinori Matsunobu (12)

RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deployment
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
データベース技術の羅針盤
データベース技術の羅針盤データベース技術の羅針盤
データベース技術の羅針盤
 
MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話
 
Introducing MySQL MHA (JP/LT)
Introducing MySQL MHA (JP/LT)Introducing MySQL MHA (JP/LT)
Introducing MySQL MHA (JP/LT)
 
Linux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQLLinux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQL
 
ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計
 
More mastering the art of indexing
More mastering the art of indexingMore mastering the art of indexing
More mastering the art of indexing
 
SSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQLSSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQL
 
Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)
 
Linux/DB Tuning (DevSumi2010, Japanese)
Linux/DB Tuning (DevSumi2010, Japanese)Linux/DB Tuning (DevSumi2010, Japanese)
Linux/DB Tuning (DevSumi2010, Japanese)
 

Último

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 

Consistency between Engine and Binlog under Reduced Durability

  • 1. Consistency between Engine and Binlog under Reduced Durability Yoshinori Matsunobu Production Engineer, Facebook Jan/2020
  • 2. What we want to do ▪ When slave or master instances fail and recover, we want to make them rejoin the replication chain (replica set), instead of dropping and rebuilding ▪ Imaging a 10 minute network outage in one Availability Zone, and want to recover MySQL instances in the AZ
  • 3. Agenda ▪ When binlog and storage engine consistency gets broken ▪ What can go wrong on restarting replica ▪ What can go wrong on restarting master ▪ Challenges to support multiple transactional storage engines
  • 4. Consistency between binlog and engine ▪ MySQL separates Replication logs (Binary Logs) and Transactional Storage Engine logs (InnoDB/MyRocks/NDB) ▪ Internally handles XA ▪ Commit ordering: ▪ Binlog Prepare (doing nothing) ▪ Engine Prepare (in parallel) ▪ Binlog Commit (ordered) ▪ Engine Commit (ordered, if binlog_order_commits==ON) ▪ If MySQL instance or host dies in between, Engine and Binlog might become inconsistent ▪ Possibility of inconsistency will be bigger when operating with reduced durability (sync-binlog !=1 and innodb-flush-log-at-trx-commit!=1) ▪ Some binlog events that were persisted in engine may be lost ▪ Engine may lose some transactions there were persisted in binlog ▪ This talk is about how to address consistency issues under reduced durability
  • 5. 5.6 Single Threaded Slave, Binlog < Engine ▪ Unplanned OS reboot on slave may end up inconsistent state between Binlog GTID sets and Engine state ▪ A big question is the slave can continue replication by START SLAVE, without entirely replacing it ▪ Transactional Storage Engines (both InnoDB and MyRocks) store last committed GTID, and it is visible from mysql.slave_relay_log_info table. This table is updated for each commit to the engine ▪ With Single Threaded Slave, you don’t have to think about out of order execution ▪ Run with relay_log_recovery=1 ▪ Slave discards relay logs, restart replication from engine max GTID position from master ▪ Skips execution in engine if GTID < slave_relay_log_info ▪ Skips writing binlog events if binlog GTID overlaps Master GTID: 1-100 Replica Binlog GTID: 1-98 Engine Max GTID: 99
  • 6. 5.6 Single Threaded Slave, Binlog > Engine ▪ Replication will continue from GTID 95 or less ▪ Executing Engine GTID 96-98 but not saving binlog events ▪ Continuing normal replication flows after 99 Replica Binlog GTID: 1-98 Engine Max GTID: 95 Master GTID: 1-100
  • 7. Multi Threaded Slave Master GTID: 1-100 Replica Binlog GTID: 1-98 Engine Max GTID: 95 ▪ mysql.slave_relay_log_info stores only max executed GTID in the instance ▪ Under parallel database execution, MySQL has no idea if GTID 94 is in engine or not ▪ Execution order might be 91 -> 92 -> 95 ▪ In upstream 5.6, you can’t guarantee consistency
  • 8. 5.7 gtid_executed table Replica Binlog GTID: 1-98 gtid_executed table: 1-93, 95-98 Master GTID: 1-100 ▪ 5.7 gtid_executed table stores GTID sets in InnoDB (crash safe) ▪ However, executed GTIDs are not updated for each commit ▪ It is updated on binlog rotate ▪ If it updates for each commit, you can figure out GTID 94 is there or not. (you can’t right now)
  • 9. FB Extension: Slave Idempotent Recovery - Starting replication from old enough binlog GTID - Re-executing binlog events to engine, then ignoring all duplicate key error / row not found error during catchup - Eventual Consistency - Must use RBR, and tables must have primary keys Master GTID: 1-100 Replica Binlog GTID: 1-98 Engine GTID state: empty
  • 10. What can go wrong when restarting master ▪ Master may go down unexpectedly by various reasons ▪ Hitting segfaults (SIG=11), assertion (SIG=6), forcing kill (SIG=9), out of memory ▪ Kernel panic ▪ power outages then restarted after a while ▪ Nowadays dead master promotion kicks in (Orchestrator, MHA) ▪ A question is failed master can restart replication from the new master ▪ Dead master may be back before dead master promotion ▪ If the master lost some transactions that are already replicated, replicas may not be able to continue replication
  • 11. Master Promotion happening, Binlog < Engine ▪ “Loss-Less Semi-Synchronous Replication” guarantees semisync tailer gets binlog events before master engine commit (so Engine on orig master <= Binlog/Engine on new master) ▪ You need to start replication from the last GTID in the engine ▪ In this case, GTID Executed Sets in master is 1-98, but replication should start after 99 ▪ Master’s engine execution order is serialized (with binlog-order-commit=1) so its’ guaranteed 1~99 are in engine ▪ However, this information is not visible from MySQL commands (only printed in err log) ▪ Feature Request to Oracle: InnoDB should add information_schema to print current committed last GTID, binlog file and position ▪ With Slave Idempotent Recovery, fetching last committed GTID can be skipped so automation can be more simplified. Instance 1 (Master) Binlog: 1-98 Engine: 1-99 Instance 2 (Replica) Binlog: 1-98 Instance 3 (Replica) Binlog: 1-99 Instance 4 (Replica) Binlog: 1-98 Instance 1 (Dead) Binlog: 1-98 Engine: 1- 99 Instance 2 (Replica) Binlog: 1-98 Instance 3 (Master) Binlog: 1-99 Instance 4 (Replica) Binlog: 1-98
  • 12. Master Promotion happening, Binlog > Engine Instance 1 (Master) Binlog: 1-100 Engine: 1-98 Instance 2 (Replica) Binlog: 1-98 Instance 3 (Replica) Binlog: 1-99 Instance 4 (Replica) Binlog: 1-98 Instance 1 (Dead) Binlog: 1- 100 Engine: 1- 98 Instance 2 (Replica) Binlog: 1-98 Instance 3 (Master) Binlog: 1-99 Instance 4 (Replica) Binlog: 1-98 “100” should be discarded before replicating from new master (instance 3) InnoDB: Last binlog file position 79143, file name binlog.000005 InnoDB: Last MySQL Gtid UUID:98 ▪ Binlog GTID 100 is on instance 1 only, and is not acked to client (with loss less semisync) ▪ If the original master (instance 1) applies Binlog 100, it can’t join as a replica ▪ We need some ways not to apply GTID 100 during recovery
  • 13. FB Extention: Server Side Binlog Truncation▪ At instance startup, truncating binlog events that don’t exist in storage engine ▪ End of binlog position is the same or smaller than engine’s last committed GTID ▪ Retaining original binlog file as a backup ▪ All of the prepared state transactions in storage engines will be rolled back
  • 14. Master Promotion not happening Instance 1 (Master) Binlog: 1-100 Engine: 1-98 Instance 2 (Replica) Binlog: 1-98 Instance 3 (Replica) Binlog: 1-99 Instance 4 (Replica) Binlog: 1-98 Instance 1 (Recovered) Binlog: 1-98 Engine: 1-98 Instance 2 (Replica) Binlog: 1-98 Instance 3 (Replica) Binlog: 1-99 Instance 4 (Replica) Binlog: 1-98 ▪ Unplanned reboot on master may end up losing transactions that were already replicated to slaves ▪ Instance1 should not serve write requests until catching up Binlog GTID 99 from instance 3
  • 15. Common Replica errors Last_IO_Errno: 1236 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Slave has more GTIDs than the master has, using the master's SERVER_UUID. This may indicate that the end of the binary log was truncated or that the last binary log file was lost, e.g., after a power or disk failure when sync_binlog != 1. The master may or may not have rolled back transactions that were already replica’ ▪ Set read_only=1 by default ▪ Find the most advanced slave, catch up from there, then start serving write requests
  • 16. Dual Engine Consistency ▪ Binlog GTID Sets ▪ InnoDB ▪ MyRocks ▪ Binlog, InnoDB and MyRocks (or NDB) need to be consistent ▪ Binlog: GTID 1-200, InnoDB: GTID 190, MyRocks: GTID 197 ▪ It is unclear if 191-196 are committed ▪ Roll back all prepared transactions (server side binlog truncation) ▪ Idempotent recovery ▪ Recover from binlogs on semi-sync replica
  • 17. Dual Engine consistency without binlog ▪ 8.0 DDL is transactional ▪ Table metadata info is stored in InnoDB ▪ It is common to run DDL outside of replication ▪ FB OSC changes schema without binlog ▪ MyRocks table changes without binlog may end up inconsistency ▪ There is no binlog to fix inconsistency ▪ DDL validation is our current workaround
  • 18. Summary ▪ MySQL needs to be aware of executed engine GTID sets ▪ With low update costs ▪ We don’t have in upstream MySQL yet. It’s a nice feature ▪ We worked around by Slave Idempotent Recovery ▪ Binlog Truncation during recovery, so that an old master can rejoin as a replica