SlideShare uma empresa Scribd logo
1 de 38
Baixar para ler offline
How people build software!
MySQL Infrastructure
Testing Automation 

@ GitHub
IkeWalker
GitHub
Boston MySQL Meetup
December 11, 2017
1
!
How people build software!
Agenda
• About
• MySQL @ GitHub
• Automation
• Backup/restores
• Failovers
• Schema migrations
2
!
How people build software!
About me
• Database Architect
• Working with MySQL since 2006
• Organizer of Boston MySQL Meetup
github.com/ikewalker
@iowalker
3
!
How people build software! 4
• The world’s largest Octocat T-shirt and stickers store
• And water bottles
• And hoodies
• We also do stuff related to things
• Word is new swag is coming up
GitHub
How people build software!
GitHub
• 66M repositories
• 24M developers
• 117K businesses
• More than a million teams
• World’s largest open source hosting
• Alexa top 100
• Critical path in build flows
5
!
How people build software!
MySQL at GitHub
• GitHub stores repositories in git, and uses MySQL
as the backend database for all related metadata:
• Repository metadata, users, issues, pull
requests, comments etc.
• Website/API/Auth/more all use MySQL.
• We run a few (growing number of) clusters, totaling
over 100 MySQL servers.
• The setup isn’t very large but very busy.
6
!
How people build software!
MySQL at GitHub
• Our MySQL servers must be available, responsive
and in good state
• GitHub has 99.95% SLA
• Availability issues must be handled quickly, as
automatically as possible.
7
!
How people build software!
github/database-infrastructure
• @ggunson, @jessbreckenridge, @jonahberquist,
@shlomi-noach, @tomkrouper, @gtowey
• Concerned with:
• Data availability
• Data integrity
8
!
How people build software!
Testing
9
!
How people build software!
Backups/restores
that ^
10
How people build software!
Your data
It’s important
11
!
How people build software!
Restores
• Dedicated restore servers.
• One per cluster.
• Continuously restores, catches up with replication,
restores, catches up with replication, restores, …
• Sending a “success” event at the end of each cycle.
• We monitor for number of “success” events in past
24-ish hours, per cluster.
12
!
How people build software! 13
!
!
!
!
!
production replicas
auto-restore replica
master
!
auto-restore replicas
""""""
backup replica
How people build software!
Restores
• New host provisioning uses same flow as restore.
• A human may kick a restore/reclone manually.
• Chatops: 

.mysql backup-restore -H restore.this.host -r
14
!
How people build software!
Restore failure
• A specific backup/restore may fail because
computers.
• No reason for panic.
• Previous backup/restores proven to be working
• At most we lose time
• Two sequential failures, or failures across clusters
are incidents to be investigated
15
!
How people build software!
Restore: delayed replica
• One delayed replica per cluster
• Lagging at 4 hours
• Chatops: .mysql panic
16
!
How people build software!
Failovers
^ that, too
17
How people build software!
MySQL setup @ GitHub
• Plain-old single writer master-replicas
asynchronous replication.
• Not yet semi-sync
• Cross DC, multiple data centers
• 5.7, RBR
• Servers with special roles: production replica,
backup, auto-restore, migration-test, analytics, …
• 2-3 tiers of replication
• Occasional cluster split (functional sharding)
• Very dynamic, always changing
18
!
How people build software!
Points of failure
• Master failure, sev1
• Intermediate masters failure
19
!
! !
!
!
!
! !
!
!
How people build software!
orchestrator
• Topology discovery
• Refactoring
• Failovers for masters and intermediate masters
• Open source, Apache 2 license
• github.com/github/orchestrator
20
!
How people build software!
orchestrator failovers @ GitHub
• Automated master & intermediate master failovers
for all clusters.
• On failover, runs GitHub-specific hooks
• Grabbing VIP/DNS
• Updating server role
• Kicking services (e.g. pt-heartbeat)
• Notifying chat
• Running puppet
21
!
How people build software!
Testing cluster
• Dedicated testing cluster in production
• Does not take production traffic
• “load-test” traffic
• Resembles a production topology:
• OS, MySQL Versions
• Data centers
• Server roles
• DNS
• Proxy
• Used for many of our deployment tests
22
!
How people build software!
Failover testing
• Multiple times per day:
• Setup the cluster in desired topology layout
• Inject failure (kill/block/reject)
• Wait, expect recovery
• Check topology:
• Expect new master, correct DNS changes,
replica capacity, …
• Restore old master from backup
• (an implicit backup/restore test)
• “success/failure” event
23
!
How people build software!
Failover in production
• We expect < 30s failover
• Intermediate master failover has low impact on
subset of users, depending on cluster/DC/server
• Master failover implies outage
• Planned master switchover takes a few seconds
24
!
How people build software!
A moment of reflection
25
How people build software!
What builds trust in failovers?
A testing environment?
26
!
How people build software!
Chaos testing in production
• First steps into regular testing
• Manual
• Supported by our peers
• Learning, understanding impact
27
!
How people build software!
Tests that go wrong
• Many things can go wrong
• Corrupt replication
• Invalidated servers
• Unassigned DNS
• Cleanups
28
!
How people build software!
Schema migrations
29
How people build software!
Is your data correct?
The data you see is merely a ghost of your original data
30
!
How people build software!
gh-ost
• Young. 16 months old.
• In production at GitHub since born.
• Software
• Bugs
• Development
• Bugs
31
How people build software!
gh-ost testing
• gh-ost works perfectly well on our data
• Tested, re-tested, and tested again
• Full coverage of production tables
32
How people build software!
gh-ost testing servers
• Dedicated servers that run continuous tests
33
How people build software! 34
!
!
!
#
!
!
production replicas
testing replica
master
!
gh-ost testing replicas
!
!
!
#
!
!
production replicas
testing replica
master
!
How people build software!
gh-ost testing
• Trivial ENGINE=INNODB migration
• Stop replication
• Cut-over, cut-back
• Checksum both tables, compare
• Checksum failure: stop the world, alert
• Success/failure: event
• Drop ghost table
• Catch up
• Next table
35
How people build software!
gh-ost development cycle
• Work on branch

.deploy gh-ost/mybranch to prod/mysql_role=ghost_testing
• Let continuous tests run
• Depending on nature of change, observe hours/days/more.
• Merge
• Tests run regardless of deployed branch
36
How people build software!
Conclusion
• Backup & restore
• Failovers
• Schema migrations
37
How people build software!
Thank you!
Questions?
github.com/ikewalker
@iowalker
38
!

Mais conteúdo relacionado

Mais procurados

MySQL Performance Tuning
MySQL Performance TuningMySQL Performance Tuning
MySQL Performance Tuning
FromDual GmbH
 
MySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architecturesMySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architectures
FromDual GmbH
 

Mais procurados (20)

Severalnines Self-Training: MySQL® Cluster - Part II
Severalnines Self-Training: MySQL® Cluster - Part IISeveralnines Self-Training: MySQL® Cluster - Part II
Severalnines Self-Training: MySQL® Cluster - Part II
 
MySQL Performance Tuning
MySQL Performance TuningMySQL Performance Tuning
MySQL Performance Tuning
 
Severalnines Self-Training: MySQL® Cluster - Part V
Severalnines Self-Training: MySQL® Cluster - Part VSeveralnines Self-Training: MySQL® Cluster - Part V
Severalnines Self-Training: MySQL® Cluster - Part V
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
Run Cloud Native MySQL NDB Cluster in Kubernetes
Run Cloud Native MySQL NDB Cluster in KubernetesRun Cloud Native MySQL NDB Cluster in Kubernetes
Run Cloud Native MySQL NDB Cluster in Kubernetes
 
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document StoreConnector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
 
MySQL Performance - Best practices
MySQL Performance - Best practices MySQL Performance - Best practices
MySQL Performance - Best practices
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
MySQL 5.7: Focus on InnoDB
MySQL 5.7: Focus on InnoDBMySQL 5.7: Focus on InnoDB
MySQL 5.7: Focus on InnoDB
 
01 upgrade to my sql8
01 upgrade to my sql8 01 upgrade to my sql8
01 upgrade to my sql8
 
Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7
 
Super cluster oracleday cl 7
Super cluster oracleday cl 7Super cluster oracleday cl 7
Super cluster oracleday cl 7
 
MySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architecturesMySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architectures
 
MySQL configuration - The most important Variables
MySQL configuration - The most important VariablesMySQL configuration - The most important Variables
MySQL configuration - The most important Variables
 
Simplify IT: Oracle SuperCluster
Simplify IT: Oracle SuperCluster Simplify IT: Oracle SuperCluster
Simplify IT: Oracle SuperCluster
 
Security a SPARC M7 CPU
Security a SPARC M7 CPUSecurity a SPARC M7 CPU
Security a SPARC M7 CPU
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
 
MOUG17 Keynote: Oracle OpenWorld Major Announcements
MOUG17 Keynote: Oracle OpenWorld Major AnnouncementsMOUG17 Keynote: Oracle OpenWorld Major Announcements
MOUG17 Keynote: Oracle OpenWorld Major Announcements
 
Understanding the IBM Power Systems Advantage
Understanding the IBM Power Systems AdvantageUnderstanding the IBM Power Systems Advantage
Understanding the IBM Power Systems Advantage
 
Presenta completaoow2013
Presenta completaoow2013Presenta completaoow2013
Presenta completaoow2013
 

Semelhante a MySQL Infrastructure Testing Automation at GitHub

Crash reports pycodeconf
Crash reports pycodeconfCrash reports pycodeconf
Crash reports pycodeconf
lauraxthomson
 
Tuning Your SharePoint Environment
Tuning Your SharePoint EnvironmentTuning Your SharePoint Environment
Tuning Your SharePoint Environment
vmaximiuk
 
5 Common Mistakes You are Making on your Website
 5 Common Mistakes You are Making on your Website 5 Common Mistakes You are Making on your Website
5 Common Mistakes You are Making on your Website
Acquia
 
Automatize everything
Automatize everythingAutomatize everything
Automatize everything
Boris Bucha
 
Firefox Crash Reporting (@ Open Source Bridge)
Firefox Crash Reporting (@ Open Source Bridge)Firefox Crash Reporting (@ Open Source Bridge)
Firefox Crash Reporting (@ Open Source Bridge)
lauraxthomson
 

Semelhante a MySQL Infrastructure Testing Automation at GitHub (20)

Migrating big data
Migrating big dataMigrating big data
Migrating big data
 
OSDC 2013 | Introduction into Chef by Andy Hawkins
OSDC 2013 | Introduction into Chef by Andy HawkinsOSDC 2013 | Introduction into Chef by Andy Hawkins
OSDC 2013 | Introduction into Chef by Andy Hawkins
 
Case study
Case studyCase study
Case study
 
Crash reports pycodeconf
Crash reports pycodeconfCrash reports pycodeconf
Crash reports pycodeconf
 
Testing Below the Application
Testing Below the ApplicationTesting Below the Application
Testing Below the Application
 
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedPerformance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons Learned
 
Benchmarking at Parse
Benchmarking at ParseBenchmarking at Parse
Benchmarking at Parse
 
Advanced Benchmarking at Parse
Advanced Benchmarking at ParseAdvanced Benchmarking at Parse
Advanced Benchmarking at Parse
 
Tuning Your SharePoint Environment
Tuning Your SharePoint EnvironmentTuning Your SharePoint Environment
Tuning Your SharePoint Environment
 
Tuenti Release Workflow
Tuenti Release WorkflowTuenti Release Workflow
Tuenti Release Workflow
 
Ensuring Consistency in a Replicated World
Ensuring Consistency in a Replicated WorldEnsuring Consistency in a Replicated World
Ensuring Consistency in a Replicated World
 
5 Common Mistakes You are Making on your Website
 5 Common Mistakes You are Making on your Website 5 Common Mistakes You are Making on your Website
5 Common Mistakes You are Making on your Website
 
Dibi Conference 2012
Dibi Conference 2012Dibi Conference 2012
Dibi Conference 2012
 
Automatize everything
Automatize everythingAutomatize everything
Automatize everything
 
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
 
Application Deployment at UC Riverside
Application Deployment at UC RiversideApplication Deployment at UC Riverside
Application Deployment at UC Riverside
 
There and Back Again: How We Drank the Chef Kool-Aid, Sobered Up, and Learned...
There and Back Again: How We Drank the Chef Kool-Aid, Sobered Up, and Learned...There and Back Again: How We Drank the Chef Kool-Aid, Sobered Up, and Learned...
There and Back Again: How We Drank the Chef Kool-Aid, Sobered Up, and Learned...
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage Systems
 
Machine Data to Readable Reports - System Monitoring, Alerting and Reporting ...
Machine Data to Readable Reports - System Monitoring, Alerting and Reporting ...Machine Data to Readable Reports - System Monitoring, Alerting and Reporting ...
Machine Data to Readable Reports - System Monitoring, Alerting and Reporting ...
 
Firefox Crash Reporting (@ Open Source Bridge)
Firefox Crash Reporting (@ Open Source Bridge)Firefox Crash Reporting (@ Open Source Bridge)
Firefox Crash Reporting (@ Open Source Bridge)
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

MySQL Infrastructure Testing Automation at GitHub

  • 1. How people build software! MySQL Infrastructure Testing Automation 
 @ GitHub IkeWalker GitHub Boston MySQL Meetup December 11, 2017 1 !
  • 2. How people build software! Agenda • About • MySQL @ GitHub • Automation • Backup/restores • Failovers • Schema migrations 2 !
  • 3. How people build software! About me • Database Architect • Working with MySQL since 2006 • Organizer of Boston MySQL Meetup github.com/ikewalker @iowalker 3 !
  • 4. How people build software! 4 • The world’s largest Octocat T-shirt and stickers store • And water bottles • And hoodies • We also do stuff related to things • Word is new swag is coming up GitHub
  • 5. How people build software! GitHub • 66M repositories • 24M developers • 117K businesses • More than a million teams • World’s largest open source hosting • Alexa top 100 • Critical path in build flows 5 !
  • 6. How people build software! MySQL at GitHub • GitHub stores repositories in git, and uses MySQL as the backend database for all related metadata: • Repository metadata, users, issues, pull requests, comments etc. • Website/API/Auth/more all use MySQL. • We run a few (growing number of) clusters, totaling over 100 MySQL servers. • The setup isn’t very large but very busy. 6 !
  • 7. How people build software! MySQL at GitHub • Our MySQL servers must be available, responsive and in good state • GitHub has 99.95% SLA • Availability issues must be handled quickly, as automatically as possible. 7 !
  • 8. How people build software! github/database-infrastructure • @ggunson, @jessbreckenridge, @jonahberquist, @shlomi-noach, @tomkrouper, @gtowey • Concerned with: • Data availability • Data integrity 8 !
  • 9. How people build software! Testing 9 !
  • 10. How people build software! Backups/restores that ^ 10
  • 11. How people build software! Your data It’s important 11 !
  • 12. How people build software! Restores • Dedicated restore servers. • One per cluster. • Continuously restores, catches up with replication, restores, catches up with replication, restores, … • Sending a “success” event at the end of each cycle. • We monitor for number of “success” events in past 24-ish hours, per cluster. 12 !
  • 13. How people build software! 13 ! ! ! ! ! production replicas auto-restore replica master ! auto-restore replicas """""" backup replica
  • 14. How people build software! Restores • New host provisioning uses same flow as restore. • A human may kick a restore/reclone manually. • Chatops: 
 .mysql backup-restore -H restore.this.host -r 14 !
  • 15. How people build software! Restore failure • A specific backup/restore may fail because computers. • No reason for panic. • Previous backup/restores proven to be working • At most we lose time • Two sequential failures, or failures across clusters are incidents to be investigated 15 !
  • 16. How people build software! Restore: delayed replica • One delayed replica per cluster • Lagging at 4 hours • Chatops: .mysql panic 16 !
  • 17. How people build software! Failovers ^ that, too 17
  • 18. How people build software! MySQL setup @ GitHub • Plain-old single writer master-replicas asynchronous replication. • Not yet semi-sync • Cross DC, multiple data centers • 5.7, RBR • Servers with special roles: production replica, backup, auto-restore, migration-test, analytics, … • 2-3 tiers of replication • Occasional cluster split (functional sharding) • Very dynamic, always changing 18 !
  • 19. How people build software! Points of failure • Master failure, sev1 • Intermediate masters failure 19 ! ! ! ! ! ! ! ! ! !
  • 20. How people build software! orchestrator • Topology discovery • Refactoring • Failovers for masters and intermediate masters • Open source, Apache 2 license • github.com/github/orchestrator 20 !
  • 21. How people build software! orchestrator failovers @ GitHub • Automated master & intermediate master failovers for all clusters. • On failover, runs GitHub-specific hooks • Grabbing VIP/DNS • Updating server role • Kicking services (e.g. pt-heartbeat) • Notifying chat • Running puppet 21 !
  • 22. How people build software! Testing cluster • Dedicated testing cluster in production • Does not take production traffic • “load-test” traffic • Resembles a production topology: • OS, MySQL Versions • Data centers • Server roles • DNS • Proxy • Used for many of our deployment tests 22 !
  • 23. How people build software! Failover testing • Multiple times per day: • Setup the cluster in desired topology layout • Inject failure (kill/block/reject) • Wait, expect recovery • Check topology: • Expect new master, correct DNS changes, replica capacity, … • Restore old master from backup • (an implicit backup/restore test) • “success/failure” event 23 !
  • 24. How people build software! Failover in production • We expect < 30s failover • Intermediate master failover has low impact on subset of users, depending on cluster/DC/server • Master failover implies outage • Planned master switchover takes a few seconds 24 !
  • 25. How people build software! A moment of reflection 25
  • 26. How people build software! What builds trust in failovers? A testing environment? 26 !
  • 27. How people build software! Chaos testing in production • First steps into regular testing • Manual • Supported by our peers • Learning, understanding impact 27 !
  • 28. How people build software! Tests that go wrong • Many things can go wrong • Corrupt replication • Invalidated servers • Unassigned DNS • Cleanups 28 !
  • 29. How people build software! Schema migrations 29
  • 30. How people build software! Is your data correct? The data you see is merely a ghost of your original data 30 !
  • 31. How people build software! gh-ost • Young. 16 months old. • In production at GitHub since born. • Software • Bugs • Development • Bugs 31
  • 32. How people build software! gh-ost testing • gh-ost works perfectly well on our data • Tested, re-tested, and tested again • Full coverage of production tables 32
  • 33. How people build software! gh-ost testing servers • Dedicated servers that run continuous tests 33
  • 34. How people build software! 34 ! ! ! # ! ! production replicas testing replica master ! gh-ost testing replicas ! ! ! # ! ! production replicas testing replica master !
  • 35. How people build software! gh-ost testing • Trivial ENGINE=INNODB migration • Stop replication • Cut-over, cut-back • Checksum both tables, compare • Checksum failure: stop the world, alert • Success/failure: event • Drop ghost table • Catch up • Next table 35
  • 36. How people build software! gh-ost development cycle • Work on branch
 .deploy gh-ost/mybranch to prod/mysql_role=ghost_testing • Let continuous tests run • Depending on nature of change, observe hours/days/more. • Merge • Tests run regardless of deployed branch 36
  • 37. How people build software! Conclusion • Backup & restore • Failovers • Schema migrations 37
  • 38. How people build software! Thank you! Questions? github.com/ikewalker @iowalker 38 !