Do you think that High Availability is all about MySQL Replication? Have you tried to alter your tables to NDB to drink at the holy grail of the shared nothing architectures? High Availability is #1 request for MySQL Servers, even more popular than scalability and performance. In this presentation we will talk about old and new tools to provide HA, automatic failover and disaster recovery for MySQL - there is a solution for every need.
MySQL HA reloaded - old tricks and cool new tools to guarantee high availability to your MySQL Servers
1. High Availability
Reloaded
IVAN ZORATTI
Chief Technology Officer
Oracle, MySQL and InnoDB are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. 1201.01.01
Tuesday, 24 January 12
2. Agenda
• SkySQL - 3 (+1) slides!
• A bit of theory
• High availability solutions
• ...and the famous last words!
2
Tuesday, 24 January 12
3. SkySQL Ab
• Funded by:
–MySQL® AB founders Monty Widenius and
David Axmark
–US Investment group OnCorps.org
• A team of 40, operating in 14 countries, 90%
from MySQL® AB
• Backed by:
–Product Engineering MontyProgram Ab
–Top Community contributors, commercial partners
and end users
3
Tuesday, 24 January 12
4. SkySQL Offering
• SkySQL Enterprise Subscriptions
– Monitoring, Administration and End User tools
– Specialised modules for High Availability and performance
improvements1
• SkySQL Enterprise Cluster and SkySQL Enterprise HA
– Up to L3 Technical and Consultative Support for the most used
MySQL® distributions and branches
• SkySQL Consulting
– Top class team for MySQL® technology
– Extended service offering from Health Check to continuous
administration
• SkySQL Training
– MySQL® Training and Certification
1 - Option 4
Tuesday, 24 January 12
5. The SkySQL Reference Architecture
Components
Integra&on
Integra&on
Tools
Tools
Migra&on
Migra&on
Tools
Tools
5
Tuesday, 24 January 12
7. High Availability
“High availability is a system design protocol and
associated implementation that ensures a
certain degree of operational continuity during a
given measurement period.”
7
Tuesday, 24 January 12
8. Fault-tolerant?
“Fault-tolerant design enables a system
to continue operation,
possibly at a reduced level
(also known as graceful degradation),
rather than failing completely,
when some part of the system fails.”
8
Tuesday, 24 January 12
9. Switchover / Failover
• Switchover
– “Switchover is the capability to manually switch over from
one system to a redundant or standby computer server,
system, or network upon the failure or abnormal termination
of the previously active server, system, or network.”
• Failover
– “Failover is the capability to switch over automatically to a
redundant or standby computer server, system, or network
upon the failure or abnormal termination of the previously
active application, server, system, or network.”
• Aided Switchover?
• Failback?
9
Tuesday, 24 January 12
10. Downtime
• Planned/Scheduled
• Unplanned/Unscheduled
• “Downtime or outage duration refers to a
period of time that a system fails to provide or
perform its primary function.”
10
Tuesday, 24 January 12
11. Single Point Of Failure - SPOF
“A Single Point of
Failure, (SPOF),
is a part of a system
which, if it fails,
will stop the entire
system from working.”
11
Tuesday, 24 January 12
12. Disaster Recovery and
Business Continuity
“Disaster recovery is the “Disaster recovery
process, policies and planning is a subset of a
procedures related to larger process known
preparing for recovery as business continuity
or continuation of planning and should
technology include planning for
infrastructure critical to resumption of
an organization after a applications, data,
natural or human- hardware,
induced disaster.” communications (such
as networking) and
other IT infrastructure.”
12
Tuesday, 24 January 12
13. Disaster Recovery and Business
Continuity
“Disaster recovery is the
process, policies and
procedures related to
preparing for recovery
or continuation of
technology
infrastructure critical to
an organization after a
natural or human-
induced disaster.”
13
Tuesday, 24 January 12
14. Designing a Highly Available System
• Which level of High Availability do I need?
• Do I require no loss of data?
• Do I need failover or is switchover enough?
• Can I provide a reasonable service when a
component is down?
14
Tuesday, 24 January 12
15. Something to clarify...
• Availability vs Scalability
• HA Costs
• HA for your systems, not only for your
database
• Review your SLAs
15
Tuesday, 24 January 12
17. High Availability with MySQL
Higher
Availability
• Combined solutions
• Shared nothing distributed cluster with MySQL
Cluster
• Geographical Replication for disaster recovery
• Virtualised Environments
• Active/Passive Clusters through shared storage
• MySQL synchronous replication
• Generic synchronous replication
• MySQL Replication with agents and failover
• MySQL Replication
17
Tuesday, 24 January 12
18. MySQL Replication
• Something you may have missed...
–Asynchronous or Semi-synchronous
–Pros and Cons of RBR vs SBR
–Mono-thread pull from
the slaves
–sync_binlog = 0/1
–Antilope vs Barracuda Read-Write
–Group Commit Read-Only Read-Only
–Multi-engines
–Rolling upgrades binlog
99
18
relaylog relaylog relaylog relaylog
Tuesday, 24 January 12
20. MySQL Replication with MHA
• Something to consider...
–read-only=1 and
log-bin on slaves
–Master IP failover
–Filtering rules
–multi-tier replication
http://code.google.com/p/mysql-master-ha/ 20
Tuesday, 24 January 12
21. Tungsten Replicator
• Open Source, heterogenous replication
• Truly multi-master
and fan-in with
Global ID
• Per-schema Read-Write
multi-thread
Replicator Replicator
agent agent
Replicator Replicator
agent agent
http://code.google.com/p/tungsten-replicator/ 21
Tuesday, 24 January 12
23. Synchronous Replication with DRBD
• Typical Active/Standby
• Cross active/active servers implementations
• Possible issues:
–Dependencies
–Infrastructure SPOFs
–Write performance
impact
Active/Hot Passive/Std-by
Server Server
–InnoDB only
• DRBD in a virtualized
environment Block Block 23
Device Device
Tuesday, 24 January 12
24. Synchronous Replication through DRBD
Configuration
Gateway
192.168.1.1
192.168.1.X
Active/Hot VIP
192.168.1.2 Passive/Std-by
Server
Server
HB1: 10.0.3.X
HB2: 10.0.4.X
15 16
DRBD: 10.0.5.X
/dev/sdb /mysqldata /dev/sdb /mysqldata
Block Device Block Device
24
Tuesday, 24 January 12
25. Synchronous Replication with Galera
• Synchronous replication for InnoDB
• Multi-master, no SPOF
• Application Read-Write Read-Write
failover must be
managed
• Conflict resolution
wsrep wsrep wsrep
http://www.codership.com 25
Tuesday, 24 January 12
26. Percona XtraDB Cluster
• Alpha version of Galera + XtraDB
• Multi-master, no SPOF
• Application
failover must be Read-Write Read-Write
managed
• Conflict resolution with
aborted COMMITs
• Auto Increment
• No XA TXN
• NoPK operations issues wsrep wsrep wsrep
http://www.percona.com/doc/percona-xtradb-cluster 26
Tuesday, 24 January 12
27. SchoonerSQL
• Synchronous master-slave replication for InnoDB
• Retrieve/Inject in
the transaction
log and buffer
pool
• Monitoring/
Administration
tool
• Closed source
27
Tuesday, 24 January 12
28. Active/Passive Clusters using
Shared Storage
• Points to consider: Active/Hot Passive/Std-by
–Redundancy and replication Server Server
must be guaranteed by
the shared storage
(and this is not trivial)
–InnoDB only
–File Systems
Shared
Storage 28
Tuesday, 24 January 12
31. Virtualised Environments
• Data storage, high availability and load balancing are
provided and managed by the virtualised software
• In case of fault, the virtualised software restarts on
any other
physical
server
• MySQL Replication
for disaster 01 03 05 07
recovery
• InnoDB only 02 04 06 08
01 02 03
04 05 06
07 08
31
Shared Storage
Tuesday, 24 January 12
32. Geographical Replication for Disaster
Recovery
• Master-Master Asynchronous
Replication is used to update the
backup data centre
• In case of fault, the network
traffic is redirected to the
backup data centre. Failback
must be executed manually
• Cross-platform and cross-
engine
Active
Backup Data
Data Centre
Centre
32
Tuesday, 24 January 12
33. Storage Snapshots for Disaster
Recovery
• Snapshots are managed by the NAS and SAN
firmware. There is usually a short read-only freeze
Active Data • Snapshots can be used as run-time
Centre backup
• InnoDB only, NetApp NASs and
firmware are certified using
Snapshot and Snapmirror
Backup Data
Centre
33
Tuesday, 24 January 12
34. MySQL Cluster
• Shared-nothing, fully transactional and distributed architecture used for high volume and
small transactions.
• MySQL Cluster is based on the NDB (Network DataBase) Storage Engine
• Data is distributed for scalability and performance, and it is replicated for redundancy on
multiple data nodes.
Application Nodes
• Nodes in a cluster:
– SQL Nodes: provide the
SQL interface to NDB
– API Nodes: provide the
native NDB API
– Data Nodes: store and
retrieve data, manage
NDB API, ClusterJ/JPA
SQL Nodes
transactions
– Management Nodes:
manage the Cluster
• Load balanced
• Memory or disk-based
• Geographically replicated
for disaster recovery with
conflict resolution
• Full online operation for
maintenance and Management
administration Nodes
34
Data Nodes
Tuesday, 24 January 12
35. Client-based Failover and Proxies
• Connector/J
–jdbc:mysql://[host][,failoverhost...][:port]
• mysqlnd_ms for PHP
–connection pooling for mysqli, mysql and
PDO_MYSQL
• ScaleBase
35
Tuesday, 24 January 12
37. The famous last words...
• I need 5 nines
–
• Everything must be automatic
–
• I want to migrate to MySQL Cluster
–
• I can’t afford to lose any data
–
• I need a sub-second failover
–
37
Tuesday, 24 January 12
38. The famous last words...
• I need 5 nines
–Implement what you really need
• Everything must be automatic
–Aided switchover is sometimes more effective,
inexpensive and easy to implement/administer
• I want to migrate to MySQL Cluster
–Is your application designed for Cluster?
• I can’t afford to lose any data
–People lose data every day. Is the drop in
performance worth it?
• I need a sub-second failover
–Check the timeout periods and the caching warm-
ups 38
Tuesday, 24 January 12
39. SkySQL Enterprise HA
• Full HA solution, supported on
–Platforms: Linux, Windsows Solaris X86
–DB Servers: Oracle MySQL, MariaDB, Percona Server
–2 to 3 days implementation guaranteed with
acceptance tests
• Technologies:
–MySQL Replication
–DRBD Active/Passive or Cross Active
–MHA Tool with/without Multi-tier Replication
–Linux or Windows Shared Storage
–MySQL Cluster
–Tungsten Enterprise
39
Tuesday, 24 January 12