Percona Live 2014 - Scaling MySQL in AWS

Scaling MySQL in AWS
Presented by: Laine Campbell
April 3rd, 2014

Agenda
1. Overview of options: RDS and EC2/MySQL
2. MySQL scaling patterns
3. Performance/Availability
4. Implementation choices
5. Common failure patterns

Who the *&%^#$ am I?
Laine Campbell
Co-Founder and CEO, Blackbird (formerly PalominoDB)
9 years building the DB team/infrastructure at
Travelocity.
7 years at PalominoDB/Blackbird, supporting 50+
companies, 1000s of databases and way too much
coffee.

AWS Options for MySQL:
RDS and EC2/MySQL
A love story...

AWS Relational Database Service
(RDS)
Basic Operations Managed
Ease of Deployment
Supports Scaling via Replication
Reliable via Replication, EBS RAID, Multi-AZ

Managed Operations
Backups and Recovery
Provisioning
Patching
Auto Failover
Replication

RDS Backup and Recovery
Storage is done via EBS
Snapshot and binlog based (point in time)
A Non Multi-AZ implementation creates spikes in
latency during backups
Avoided in Multi-AZ via backups on the secondary
Snapshots only

Advanced Backup and Recovery
Creating non-RDS backups done via mysqldump,
mydumper, custom extraction
You can create non-RDS replicas using a logical
backup in 5.6 only
non-RDS replicas will break during AZ failovers - thus
not useful for production or for large datasets

Disaster Recovery
Cross region replication is
supported in 5.6
Cross region replication incurs
cross-region data transfer
costs
Relay replicas recommended if
you wish to minimize expenses

Provisioning
Initial creation of single or multi-AZ
masters
Single command replica creation
(serialized)
via snapshots, multi-AZ avoids a
one minute IO suspension.

Patching
Automatically managed in
maintenance windows
Alerts sent for the coming week, so
you can determine impact,
reschedule, etc…
Multi-AZ mitigates impact of
invasive maintenance

RDS Challenges (Opportunities?)
Abstraction from kernel, OS processlist, OS commands
etc...
No SUPER access, changes to management via Stored
Procedure (minimal but annoying)
Log access becomes more challenging (but
manageable)
The more experienced of an operator you are, the
grumpier you will be!

RDS Challenges (Opportunities?)
Snapshot backups not
portable/accessible outside of
RDS
Multi-AZ failover can strand
replicas when relaxing binlog
consistency for performance.
(sync_binlog=0).
Without the ability to manually
CHANGE MASTER, one must
rebuild all replicas after a failover.

RDS Visibility Impacts
Agent based instrumentation that requires localhost
installation won’t work
No access to TCPDUMP/Port listening
SAR, processlist for swapping, vmstat, iostat etc...
Log forensics become harder but manageable (must
download first)

EC2 and MySQL
All the MySQL you’ve come to love and hate
Any topologies you can dream
Access to many more types of instances and storage

Why RDS or EC2?
You can’t run 5.6, and you can’t tolerate the risk of
single region? (~99.65% SLA per month) Use EC2
You don’t have operational expertise to manage
backups, provisioning and replication? Use RDS
pro-tip, if you can’t manage a system, how can you
troubleshoot advanced performance issues with the
visibility issues in RDS?

Why RDS or EC2?
Want MariaDB, XtraDB? Use EC2
Large data-sets generally require file level backups and
portability? Use EC2
pro-tip, if you can’t get a mysqldump or a parallel dump
to load/export in a timely fashion, you probably don’t
want RDS

Scaling Patterns for MySQL in AWS

Scaling in RDS - Vertical
RAM up to 244 GB per instance, creating excellent
ability to put large datasets in RAM
Network performance up to 10 GB
CPU up to 32 cores
Provisioned IOPs are game changers, and mandatory
for production, performance sensitive applications.

Scaling in RDS - Provisioned IOPs
1,000 - 30,000 IOPS
100 GB to 3 TB
Stable, predictable IO
Realizing Max IOPS - 20,000
● cr1.8xlarge Instance Type
● MySQL 16 KB Page Size
● Full Duplex IO Channel
● 50% reads, 50% writes

Scaling in RDS - Provisioned IOPs
Overprovisioning from realized, can create latency
reductions
● In an unbalanced workload, for instance reads
consuming channel limits
● Write channel bandwidth remains unsaturated
● By doubling IOPS, you increase concurrency, thus
reducing latency. Transaction rates increase
● Consumption of IOPS can reduce as transaction
rates increase, and manifest as:
○ Improved use of group commit
○ larger log writes

Scaling in RDS - Reads
Native replication allows for scale out of reads, just as
in EC2 or your own datacenter
RAM up to 244 GB per instance, creating much better
ability to put large datasets in RAM
5.6 allows for the memcache plugin

Scaling in RDS - Writes
Like any system, you must split workloads if writes
consume max capacity of PIOPS.
● Functional Partitioning
● Sharding

Scaling in RDS - Concerns
Sharding:
● Management of RDS instances to roll shards up and
down can be a new paradigm.
● Overall, this can be done, but does require a logical
shift.
Resource Constraints:
● No access to SSDs (up to 91,250 read or 78,750
write IOPS of 14KB size)
Data Movements:
● No access to data copies outside of replica builds
can dramatically increase data movement time

Scaling in EC2 - Vertical
Higher variety of instances. Similar top level
constraints of:
● RAM
● CPU
● PIOPS
● Network
Ephemeral storage SSD create a whole new class of IO
performance: (up to 91,250 read or 78,750 write IOPS
of 14KB size)

Scaling in EC2 - Reads
In addition to standard MySQL replication, you have
new options
● Galera, MariaDB/Galera and XtraDB Cluster
● Tungsten Replicator and Cluster

Scaling in EC2 - Writes
Sharding still becomes necessary, but in EC2 over
RDS, one has access to snapshots:
● Management of large datasets becomes much
easier
● Shard management functions in more typical
paradigms

Scaling in EC2 - Concerns
SSD and Ephemeral Storage
● Instances become even more volatile
● Backups via EBS snapshot are impossible, requiring
LVMs or similar
● One might consider keeping writes to PIOPs max
(20,000) for writes and leverage SSD for reads

AWS Availability: Regions and Zones

Amazon Regions equate to data-centers in different
geographical regions.
Availability zones are isolated from one another in the
same region to minimize impact of failures.

Amazon states AZs do not share :
•Cooling
•Network
•Security
•Generators
•Facilities

Apr, 2011 - US East Region EBS Failed
● Incorrect network failover.
● Saturated intra-node communications.
● Cascading failures impacted EBS in all AZs.
Jul, 2012 - US East Partial Impact
● Electrical storms impacted multiple sites.
● Failover of metadata DB took too long.
● EBS I/O was frozen to minimize corruption.

99.95% Monthly SLA for a region (multiple AZs)
● Implies multiple AZ is mandatory
● Implies multi-region is necessary for 99.99% or
higher

Availability in RDS - Multi-AZ
The core of an HA solution
Block level replication, active/passive
Saves you from most master crashes
Reduces impact of backups, upgrades, locks for
provisioning replicas
When not in 5.6, and using log_sync != 1, you often
lose replicas during failover

Availability in RDS - Multi-AZ
IO impact from
replication
You do not get to choose
the failover AZ, meaning
you must be ready to
move app servers

Availability in RDS - Replicas
Redundant replicas make total sense. N+1 meets most
needs with the ease of provisioning
You must have replicas in every AZ you have app
servers in (if using replicas for reads)
AWS states cross-AZ latency impact of low single digit
millisecond impact. Real world indicates occasional
much larger spikes

Availability in EC2 - Options
You can use Galera, XtraDB Cluster, or similar for a
read/write anywhere solution
MySQL MHA can be used to do failovers
Continuent’s Tungsten product can also manage
failovers

Type of Change EC2 RDS Master
(Non Multi-AZ)
RDS Master
(Multi-AZ)
RDS Replica
Instance resize
up/down
Rolling
Migrations
Moderate
Downtime
Minimal
Downtime
Moderate
Downtime (take out
of service)
EBS <-> PIOPS Severe
Performance
impact.
Severe
Performance
impact.
Minor
Performance
impact.
Severe
Performance
Impact (take out of
service)
PIOPS Amount
Change
Minor
Performance
impact.
Minor
Performance
impact.
Minor
Performance
impact.
Performance
Impact (take out of
service)
Disk Space Change
(add)
Performance
impact.
Performance
impact.
Minor
Performance
impact.
Performance
Impact (take out of
service)
Disk Space Change
(reduce)
Rolling
Migrations
Moderate
Downtime
Moderate
Downtime
Moderate
Downtime (take out
of service)

Predicting and Managing Failure
Operations is about managing
change and mitigating risk

Local Failures
• Database crashes
• Human error
o Misconfigure
o Write to a replica
o Drop a table/database/career
• Localized EBS hangs and corruption
• Unacceptable/unpredictable performance

Local Failures
● When it goes bad, don’t waste time diagnosing.
o Shoot it in the head!
● Plan!
○ Simulate availability and region level failures
○ Wipe storage, reduce IOPS, shut down
○ Chaos monkey is your friend
● Observe!
○ Monitor for early failures, predict

Mitigation
In RDS:
Use Multi-AZ
Use replicas in multiple AZs
Replicate to multiple regions, and out of AWS
In EC2:
Use a failover (Galera, Tungsten, MHA/HAProxy)
Use multiple AZs and regions
Frequent Backups (practicing restores)

Percona Live 2014 - Scaling MySQL in AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Percona Live 2014 - Scaling MySQL in AWS

Similar to Percona Live 2014 - Scaling MySQL in AWS (20)

More from Pythian

More from Pythian (9)

Recently uploaded

Recently uploaded (20)

Percona Live 2014 - Scaling MySQL in AWS