Webinar slides: How to Measure Database Availability?

April 2018
How to Measure Database
Availability?
Bart Oleś, Support Engineer
Presenter
bart@severalnines.com

Copyright 2017 Severalnines AB
I'm Jean-Jérôme from the Severalnines Team and
I'm your host for today's webinar!
Feel free to ask any questions in the Questions
section of this application or via the Chat box.
You can also contact me directly via the chat box
or via email: info@severalnines.com during or
after the webinar.
Your host & some logistics

About Severalnines and ClusterControl

What We Do
Manage Scale
MonitorDeploy

ClusterControl Automation & Management
Management
● Multi-Cluster / Multi-DC
● Automate Repair &
Recovery
● Database Upgrades
● Backups
● Configuration Management
● Database Cloning
● One-Click Scaling
Deployment
● Deploy a Cluster in Minutes
● On-Premises or in the Cloud (AWS)
Monitoring
● Systems View with 1sec Resolution
● DB / OS stats & Performance Advisors
● Configurable Dashboards
● Query Analyzer
● Real-time / historical

Supported Databases

Copyright 2012 Severalnines ABCopyright 2012 Severalnines AB
Our Customers

Agenda
● Introduction
● Defining availability targets
● Database Availability - What to measure?
● Database Availability - How to measure?

Introduction

What is a high availability database?
High availability databases use an architecture
that is designed to continue to function normally
even after hardware, software or network failures.

Example high availability architecture

Outage visibility

Availability vs reliability
Availability
A measure of % of time a service is in a usable state.
Also measured in 9s.
Reliability
A measure of the probability of the service being in a usable state for a period of
time. Measured as MTBF (Mean Time Between Failures), and the Failure Rate.
Availability
Reliability

Duration and frequency of downtime
Planned downtime is for scheduled upgrades and routine maintenance of
hardware and software.
Unplanned downtime is when your systems crash unexpectedly. Usually due to
hardware/software failure, natural disaster or human error.
Scheduled downtimes do not count towards availability,
but may impact customer satisfaction metrics. Nevertheless
there are some exceptions like Telcos, 911,...

Connecting availability and reliability
During a year a database goes down for an one hour
Availability = 99.99% (or four nines)
Reliability (MTBF) = 8759 hours
Percentages of a particular order of
magnitude are sometimes referred
to by the number of nines or "class
of nines" in the digits.

Availability as a percentage
Source: https://en.wikipedia.org/wiki/High_availability

Defining availability targets

Why measure availability?
The need for availability is governed by business objectives, and the primary
goal of its measurement is:
● To provide an availability baseline
● To help identify where to improve the systems
● To monitor and control improvement projects
Improvement
Monitor and
control
Availability
Baseline

Calculating availability
AST: Agreed Service Time
DT: Downtime
If AST is 100 hours and downtime is 2 hours then the
availability would be:

Calculating availability
The trouble with this is that, while this calculation is easy enough to
perform, and collecting the data to do it seems straightforward, it’s really
not at all clear what the number you end up with is actually telling you.
Define customer needs and availability targets!

Another method of calculating availability
In this example, you would calculate the availability as:

Classifying business functions by criticality
Identify the critical business functions of
your business.
Classify these critical business functions
into the following categories: high,
medium, and low
Complete the critical business functions
chart with each critical business
function.
Function` Criticality Maximum
Downtime
Person/Team Required
Resources
Impacted
Functions
Brief process
to complete
functions
Example:
Insurance
claims
High 2 Days DBA Team 1 10 employees,
claim, mgt
software, paper
forms
Claims
assessing filing
Take calls,
document in
system, file
Example Open
new savings
act.
Low 1 Week DBA Team 2 1 employee,
account mgmt,
software
New accounts Customer
compleates
form onsite

Mean Time Between Failure (MTBF)

Mean Time To Failure (MTTF)

Service Level Agreement (SLA)
The SLA is a contract negotiated and agreed between a
customer and a service provider

SLA objectives and lifecycle
• Service description
• Reliability
• Responsiveness
• Procedure for reporting problems
• Monitoring & reporting service level
• Consequences for not meeting
service obligations
• Escape clauses or constraints
Service Level Agreement
1. Select service
provider
2. Define SLA
3. Establish
agreement
4. Monitor SLA
violation
5.Terminate
SLA
6. Enforce
penalties for
SLA violation

SLA – Lifecycle
1. Select
Service Provider
2. Define SLA
3. Establish
Agreement
4. Monitor SLA
Violation
5.Terminate
SLA
6. Enforce
Penalties for
SLA Violation

SLA common mistakes
Do not:
• Allow the service level agreement to become a
marketing document.
• Leave preparation of the Service Level Agreement
until the last minute.
• Have service levels without a compensation regime of
some sort.
• Have overly long service level measurement periods.
• Lose sight of your objectives.

Database Availability - What to measure?

Outage timeline

Failure detection
• Check frequency
○ heartbeat check,
○ number of occurrences, counters
○ timeouts
• Notification delay
• Dashboard
• Service desk response time

Designing failover mechanisms
Failover is the operational process of switching between primary and secondary
systems or system components in the event of failure.
When designing failover mechanisms, organizations generally calculate
• RTO (Recovery Time Objective)
• RPO (Recovery Point Objective)

RTO & RPO
● RTO (Recovery Time Objective)
Time period within which service must be restored to avoid unacceptable
consequences.
● RPO (Recovery Point Objective)
Maximum tolerable period in which data may be lost. RPO defines how
much data an organisation can afford to lose. Based on this, optimum
backup frequency and recovery speed can be determined.

Defining RTO & RPO
● RTO
Time to:
○ Recall backup media,
○ Travel time for on-call engineers,
○ Bring up infrastructure,
○ Restore data,
○ Bring up services,
○ Configure application,
○ Test and validate.
● RPO
○ Guaranteed last restorable point (PITR) (DEMO)
○ Delayed replication (DEMO)
RPO RTO

● If RPO = 4 hours, backups of data no older than 4 hours.
● If it takes 2 hours to restore the last backup that was done 4 hours ago, RTO is >= 2 hours and RPO is 4 hours.
● If a master fails and the slave is 10 minutes behind, your RPO cannot be < 10 minutes.
● If the application needs to be bounced and it takes 10 minutes, then the RTO cannot be < 10 minutes.
Can RPO + RTO = 0 ?

Failure handling - replication
● Failure Detection
● Pre-failover
- find most advanced slave
- wait until replication lag
- failover master
● Post-failover
- update application connection
(or use proxy)
- re-slave to new master
Additionally:
How much data you can lose
Master (RW) Slave (RO)
A B

Failure handling - Galera cluster
Reads/Writes Reads/Writes
A B
Reads/Writes
*https://severalnines.com/blog/using-galera-replication-window-advisor-avoid-sst
• Single node failure leads to partial app outage
• SST vs storage snapshot
• Non-blocking donor node & performance impact
• Bootstrap time
○ Determining the most advanced node
○ Bootstrap process
• IST & Galera cache size (Replication Window*)
C

Failure handling - Load balancers
● Need to be able to handle transaction failures and retry them.
● Ability to check the health of the database servers.
● Keepalived & VIP failover.
Benchmarked failover times*:
ProxySQL 1.4.6 : 11 seconds
HAProxy 1.5.14 : 12 seconds
MaxScale 2.1.9 : 15 seconds
Load
Balancer
*https://severalnines.com/blog/comparing-database-proxy-failover-times-proxysql-maxscale-and-haproxy
Node A
Node B

Failure handling - InnoDB recovery time
● Checkpoint interval
● Size of the logs
● Data Access Locality
● Database size
● Buffer Pool Size
● Number of dirty buffers during the crash

Upgrade time
● Size of the database
● Backup time
● Buffer pool size
It can be minimised with:
● Rolling restart (in case of distributed setup)
● Upgrade combined with replication switchover (DEMO)

Query latency
Mysql users have a number of options for monitoring
query latency (DEMO):
Performance schema
events_statements_summary_by_digest
Sys schema
sys schema provides an organized set of metrics in a more
human-readable format:
SELECT * FROM
sys.statements_with_runtimes_in_95th_percentile;
Slow queries
SHOW VARIABLES LIKE 'long_query_time';

What impacts RTO:
● Database size
● Network throughput
● Backup type
● Standalone or Cluster
Restoration time from a backup
Type of failure:
● Backup type – logical, physical, disk snapshot
● Partial restore on single node (DEMO)
● Cluster restore and bootstrap
● Datacenter

Other services that can affect the database:
● Networking
● OS upgrade
● Disk resize or other system maintenance
● Application upgrade
Note: Define separately if not within control of database team
Service outage time

Instrumentation and tools to measure database availability

Open-source and paid tools
● Nagios
● ClusterControl Community
● Zabbix
● PMM
● Grafana
● Cacti
● OpenNMS
● Icinga
● Oracle Enterprise Manager
● Monyog
● MongoDB Ops Manager
● ClusterControl Enterprise

ClusterControl Operational Report
The idea behind creating Operational Reports is to put all of the most important data into a single document,
which can be quickly reviewed to get an understanding of the state of the databases.
● Availability Summary
● Cluster - Availability Details
● Cluster State History

Q & A

Additional Resources
● Repair and recovery for your MySQL, MariaDB and
MongoDB Clusters
● Designing Open Source Databases for High Availability
● HA & Load Balancing Tutorials
● Download ClusterControl
● Contact us: info@severalnines.com

Webinar slides: How to Measure Database Availability?

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Webinar slides: How to Measure Database Availability?

Semelhante a Webinar slides: How to Measure Database Availability? (20)

Mais de Severalnines

Mais de Severalnines (20)

Último

Último (20)

Webinar slides: How to Measure Database Availability?