Database availability is notoriously hard to measure and report on, although it is an important KPI in any SLA between you and your customer. We often define availability in terms of 9’s (e.g. 99.9% or 99.999%), although there is often a lack of understanding of what these numbers might mean, or how we can measure them.
Is the database available if an instance is up and running, but it is unable to serve any requests? Or if response times are excessively long, so that users consider the service unusable? Is the impact of one longer outage the same as multiple shorter outages? How do partial outages affect database availability, where some users are unable to use the service while others are completely unaffected?
Not agreeing on precise definitions with your customer might lead to dissatisfaction. The database team might be reporting that they have met their availability goals, while the customer is dissatisfied with the service. In this webinar, we will discuss the different factors that affect database availability. We will then see how you can measure your database availability in a realistic way.
AGENDA
- Defining availability targets
- Critical business functions
- Customer needs
- Duration and frequency of downtime
- Planned vs unplanned downtime
- SLA
- Measuring the database availability
- Failover/Switchover time
- Recovery time
- Upgrade time
- Queries latency
- Restoration time from backup
- Service outage time
- Instrumentation and tools to measure database availability:
- Free & open-source tools
- CC's Operational Report
- Paid tools
SPEAKER
Bartlomiej Oles is a MySQL and Oracle DBA, with over 15 years experience in managing highly available production systems at IBM, Nordea Bank, Acxiom, Lufthansa, and other Fortune 500 companies. In the past five years, his focus has been on building and applying automation tools to manage multi-datacenter database environments.
Webinar slides: How to Measure Database Availability?
1. April 2018
How to Measure Database
Availability?
Bart Oleś, Support Engineer
Presenter
bart@severalnines.com
2. Copyright 2017 Severalnines AB
I'm Jean-Jérôme from the Severalnines Team and
I'm your host for today's webinar!
Feel free to ask any questions in the Questions
section of this application or via the Chat box.
You can also contact me directly via the chat box
or via email: info@severalnines.com during or
after the webinar.
Your host & some logistics
11. What is a high availability database?
Copyright 2018 Severalnines AB
High availability databases use an architecture
that is designed to continue to function normally
even after hardware, software or network failures.
14. Availability vs reliability
Copyright 2018 Severalnines AB
Availability
A measure of % of time a service is in a usable state.
Also measured in 9s.
Reliability
A measure of the probability of the service being in a usable state for a period of
time. Measured as MTBF (Mean Time Between Failures), and the Failure Rate.
Availability
Reliability
15. Duration and frequency of downtime
Copyright 2018 Severalnines AB
Planned downtime is for scheduled upgrades and routine maintenance of
hardware and software.
Unplanned downtime is when your systems crash unexpectedly. Usually due to
hardware/software failure, natural disaster or human error.
Scheduled downtimes do not count towards availability,
but may impact customer satisfaction metrics. Nevertheless
there are some exceptions like Telcos, 911,...
16. Connecting availability and reliability
Copyright 2018 Severalnines AB
During a year a database goes down for an one hour
Availability = 99.99% (or four nines)
Reliability (MTBF) = 8759 hours
Percentages of a particular order of
magnitude are sometimes referred
to by the number of nines or "class
of nines" in the digits.
17. Availability as a percentage
Copyright 2018 Severalnines AB
Source: https://en.wikipedia.org/wiki/High_availability
20. Why measure availability?
Copyright 2018 Severalnines AB
The need for availability is governed by business objectives, and the primary
goal of its measurement is:
● To provide an availability baseline
● To help identify where to improve the systems
● To monitor and control improvement projects
Improvement
Monitor and
control
Availability
Baseline
21. Calculating availability
Copyright 2018 Severalnines AB
AST: Agreed Service Time
DT: Downtime
If AST is 100 hours and downtime is 2 hours then the
availability would be:
22. Calculating availability
Copyright 2018 Severalnines AB
The trouble with this is that, while this calculation is easy enough to
perform, and collecting the data to do it seems straightforward, it’s really
not at all clear what the number you end up with is actually telling you.
Define customer needs and availability targets!
23. Another method of calculating availability
Copyright 2018 Severalnines AB
In this example, you would calculate the availability as:
24. Classifying business functions by criticality
Copyright 2018 Severalnines AB
Identify the critical business functions of
your business.
Classify these critical business functions
into the following categories: high,
medium, and low
Complete the critical business functions
chart with each critical business
function.
Function` Criticality Maximum
Downtime
Person/Team Required
Resources
Impacted
Functions
Brief process
to complete
functions
Example:
Insurance
claims
High 2 Days DBA Team 1 10 employees,
claim, mgt
software, paper
forms
Claims
assessing filing
Take calls,
document in
system, file
Example Open
new savings
act.
Low 1 Week DBA Team 2 1 employee,
account mgmt,
software
New accounts Customer
compleates
form onsite
26. Mean Time To Failure (MTTF)
Copyright 2018 Severalnines AB
27. Service Level Agreement (SLA)
Copyright 2018 Severalnines AB
The SLA is a contract negotiated and agreed between a
customer and a service provider
28. SLA objectives and lifecycle
Copyright 2018 Severalnines AB
• Service description
• Reliability
• Responsiveness
• Procedure for reporting problems
• Monitoring & reporting service level
• Consequences for not meeting
service obligations
• Escape clauses or constraints
Service Level Agreement
1. Select service
provider
2. Define SLA
3. Establish
agreement
4. Monitor SLA
violation
5.Terminate
SLA
6. Enforce
penalties for
SLA violation
29. SLA – Lifecycle
Copyright 2018 Severalnines AB
1. Select
Service Provider
2. Define SLA
3. Establish
Agreement
4. Monitor SLA
Violation
5.Terminate
SLA
6. Enforce
Penalties for
SLA Violation
30. SLA common mistakes
Copyright 2018 Severalnines AB
Do not:
• Allow the service level agreement to become a
marketing document.
• Leave preparation of the Service Level Agreement
until the last minute.
• Have service levels without a compensation regime of
some sort.
• Have overly long service level measurement periods.
• Lose sight of your objectives.
33. Failure detection
Copyright 2018 Severalnines AB
• Check frequency
○ heartbeat check,
○ number of occurrences, counters
○ timeouts
• Notification delay
• Dashboard
• Service desk response time
34. Designing failover mechanisms
Copyright 2018 Severalnines AB
Failover is the operational process of switching between primary and secondary
systems or system components in the event of failure.
When designing failover mechanisms, organizations generally calculate
• RTO (Recovery Time Objective)
• RPO (Recovery Point Objective)
35. RTO & RPO
Copyright 2018 Severalnines AB
● RTO (Recovery Time Objective)
Time period within which service must be restored to avoid unacceptable
consequences.
● RPO (Recovery Point Objective)
Maximum tolerable period in which data may be lost. RPO defines how
much data an organisation can afford to lose. Based on this, optimum
backup frequency and recovery speed can be determined.
36. Defining RTO & RPO
Copyright 2018 Severalnines AB
● RTO
Time to:
○ Recall backup media,
○ Travel time for on-call engineers,
○ Bring up infrastructure,
○ Restore data,
○ Bring up services,
○ Configure application,
○ Test and validate.
● RPO
○ Guaranteed last restorable point (PITR) (DEMO)
○ Delayed replication (DEMO)
RPO RTO
37. ● If RPO = 4 hours, backups of data no older than 4 hours.
● If it takes 2 hours to restore the last backup that was done 4 hours ago, RTO is >= 2 hours and RPO is 4 hours.
● If a master fails and the slave is 10 minutes behind, your RPO cannot be < 10 minutes.
● If the application needs to be bounced and it takes 10 minutes, then the RTO cannot be < 10 minutes.
Can RPO + RTO = 0 ?
Copyright 2018 Severalnines AB
38. Failure handling - replication
Copyright 2018 Severalnines AB
● Failure Detection
● Pre-failover
- find most advanced slave
- wait until replication lag
- failover master
● Post-failover
- update application connection
(or use proxy)
- re-slave to new master
Additionally:
How much data you can lose
Master (RW) Slave (RO)
A B
39. Failure handling - Galera cluster
Copyright 2018 Severalnines AB
Reads/Writes Reads/Writes
A B
Reads/Writes
*https://severalnines.com/blog/using-galera-replication-window-advisor-avoid-sst
• Single node failure leads to partial app outage
• SST vs storage snapshot
• Non-blocking donor node & performance impact
• Bootstrap time
○ Determining the most advanced node
○ Bootstrap process
• IST & Galera cache size (Replication Window*)
C
40. Failure handling - Load balancers
Copyright 2018 Severalnines AB
● Need to be able to handle transaction failures and retry them.
● Ability to check the health of the database servers.
● Keepalived & VIP failover.
Benchmarked failover times*:
ProxySQL 1.4.6 : 11 seconds
HAProxy 1.5.14 : 12 seconds
MaxScale 2.1.9 : 15 seconds
Load
Balancer
*https://severalnines.com/blog/comparing-database-proxy-failover-times-proxysql-maxscale-and-haproxy
Node A
Node B
41. Failure handling - InnoDB recovery time
Copyright 2018 Severalnines AB
● Checkpoint interval
● Size of the logs
● Data Access Locality
● Database size
● Buffer Pool Size
● Number of dirty buffers during the crash
42. Upgrade time
Copyright 2018 Severalnines AB
● Size of the database
● Backup time
● Buffer pool size
It can be minimised with:
● Rolling restart (in case of distributed setup)
● Upgrade combined with replication switchover (DEMO)
43. Query latency
Copyright 2018 Severalnines AB
Mysql users have a number of options for monitoring
query latency (DEMO):
Performance schema
events_statements_summary_by_digest
Sys schema
sys schema provides an organized set of metrics in a more
human-readable format:
SELECT * FROM
sys.statements_with_runtimes_in_95th_percentile;
Slow queries
SHOW VARIABLES LIKE 'long_query_time';
44. What impacts RTO:
● Database size
● Network throughput
● Backup type
● Standalone or Cluster
Restoration time from a backup
Copyright 2018 Severalnines AB
Type of failure:
● Backup type – logical, physical, disk snapshot
● Partial restore on single node (DEMO)
● Cluster restore and bootstrap
● Datacenter
45. Other services that can affect the database:
● Networking
● OS upgrade
● Disk resize or other system maintenance
● Application upgrade
Note: Define separately if not within control of database team
Service outage time
Copyright 2018 Severalnines AB
46. Copyright 2017 Severalnines AB
Copyright 2018 Severalnines AB
Instrumentation and tools to measure database availability
48. ClusterControl Operational Report
Copyright 2018 Severalnines AB
The idea behind creating Operational Reports is to put all of the most important data into a single document,
which can be quickly reviewed to get an understanding of the state of the databases.
● Availability Summary
● Cluster - Availability Details
● Cluster State History
50. Additional Resources
Copyright 2018 Severalnines AB
● Repair and recovery for your MySQL, MariaDB and
MongoDB Clusters
● Designing Open Source Databases for High Availability
● HA & Load Balancing Tutorials
● Download ClusterControl
● Contact us: info@severalnines.com