In this session we will look at best practices for administering large MongoDB deployments in the cloud. We will discuss tips and tools for capacity planning, fully scripted provisioning using chef and knife-ec2, and snapshotting your data safely, as well as using replica sets for high availability across AZs. We will cover the good, the bad and the ugly of disk performance options on EC2, as well as several filesystem tricks for wringing more performance out of your block devices. And finally we will talk about some ways to prevent Mongo disaster spirals and minimize your downtime. This session is appropriate for anyone who already has experience administering MongoDB. Some experience with AWS or cloud computing is useful, but not required, for all of the material.
2. Topics:
• Replica sets
• Resources and capacity planning
• Provisioning with chef
• Snapshotting
• Scaling tips
• Monitoring
• Disaster mitigation
3. Replica sets
• Always use replica sets
• Distribute across Availability Zones
• Avoid situations where you have even # voters
• 50% is not a majority!
• More votes are better than fewer (max is 7)
• Add an arbiter for more flexibility
• Always explicitly set the priority of your nodes.
Surprise elections are terrible.
4. Basic sane replica set config
• Each node has one vote (default)
• Snapshot node does not serve read queries, cannot become master
• This configuration can survive any single node or Availability Zone outage
5. Or manage votes with arbiters
• Three separate arbiter processes on each AZ arbiter node, one per cluster
• Maximum of seven votes per replica set
• Now you can survive all secondaries dying, or an AZ outage
• If you have even one healthy node, you can continue to serve traffic
• Arbiters tend to be more reliable than nodes because they have less to do.
6. Provisioning
• Memory is your primary constraint, spend your
money there
• Especially for read-heavy workloads
• Your working set should fit into RAM
• lots of page faults means it doesn’t fit
• 2.4 has a working set estimator in db.serverStatus!
• Your snapshot host can usually be smaller, if cost is
a concern
7. Disk options
• EBS -- just kidding, EBS is not an option
• EBS with Provisioned IOPS
• Ephemeral storage
• SSD
9. PIOPS
• Guaranteed # of IOPS, up to 2000/volume
• Variability of <0.1%
• Raid together multiple volumes for higher
performance
• Supports EBS snapshots
• Costs 2x regular EBS
• Can only attach to certain instance types
10. Estimating PIOPS
• estimate how many IOPS to provision with the “tps”
column of sar -d 1
• multiply that by 2-3x depending on your spikiness
• when you exceed your PIOPS limit, your disk stops
for a few seconds Avoid this.
11. Ephemeral storage
• Cheap
• Fast
• No network latency
• You can snapshot with LVM + S3
• Data is lost forever if you stop or resize the instance
• Can use EBS on your snapshot node to take
advantage of EBS tools
• makes restore a little more complicated
12. Filesystem
• Use ext4
• Raise file descriptor limits (cat /proc/<mongo
pid>/limits to verify)
• If you’re using ubuntu, use upstart
• Set your blockdev --set-ra to something sane, or
you won’t use all your RAM
• If you’re using mdadm, make sure your md device
and its volumes have a small enough block size
• RAID 10 is the safest and best-performing, RAID 0
is fine if you understand the risks
13. Chef everything
• Role attributes for backup volumes, cluster names
• Nodes are effectively disposable
• Provision and attach EBS RAID arrays via AWS
cookbook
• Delete volumes and AWS attributes, run chef-
client to re-provision
• Restore from snapshot automatically with our
backup scripts
Our mongo cookbook and backup scripts: https://github.com/ParsePlatform/Ops/
14. Bringing up a new node from the most recent mongo
snapshot is as simple as this:
It’s faster for us to re-provision a node from scratch
than to repair a RAID array or fix most problems.
15. Each replica set has its own role, where it sets the
cluster name, the snapshot host name, and the EBS
volumes to snapshot.
When you provision a new node for this role,
mongodb::raid_data will build it off the most recent
completed set of snapshots for the volumes specified in
backups => mongo_volumes.
16. Snapshots
• Snapshot often
• Set snapshot node to priority = 0, hidden = 1
• Lock Mongo OR stop mongod during snapshot
• Snapshot all RAID volumes
• We use ec2-consistent-snapshot:
http://eric.lubow.org/2011/databases/mongodb/ec2-consistent-snapshot-with-mongo
, with a wrapper script for chef to generate the backup volume ids
• Always warm up a snapshot before promoting
17. Warming a secondary
• Warm up both indexes and data
• Use dd or vmtouch to load files from S3
• Scan for most commonly used collections on
primary, read those into memory on secondary
• Read collections into memory
• Natural sort
• Full table scan
• Search for something that doesn’t exist
http://blog.parse.com/2013/03/07/techniques-for-warming-up-mongodb/
18. Fragmentation
• Your RAM gets fragmented too!
• Leads to underuse of memory
• Deletes are not the only source of fragmentation
• db.<collection>.stats to find the padding factor
(between 1 - 2, the higher the more
fragmentation)
• Repair, compact, or reslave regularly
(db.printReplicationInfo() to get the length of your
oplog to see if repair is a viable option)
20. Compaction
• We recommend running a continuous compaction
script on your snapshot host
• Every time you provision a new host, it will be
freshly compacted.
• Plan to rotate in a compacted primary regularly
(quarterly, yearly depending on rate of decay)
• If you also delete a lot of collections, you may need
to periodically run db.repairDatabase() on each db
http://blog.parse.com/2013/03/26/always-be-compacting/
21. Scaling strategies
• Horizontal scaling
• Query optimization, index optimization
• Throw money at it (hardware)
• Upgrade to > 2.2 to get rid of global lock
• Read from secondaries
• Put the journal on a different volume
• Repair, compact, or reslave
22. Monitoring
• MMS
• Ganglia + nagios
• correlate graphs with local metrics like disk i/o
• graph your own index ops
• graph your own aggregate lock percentages
• alert on replication lag, replication error
• alert if the primary changes, connection limit
• Use chef! Generate all your monitoring from roles
23. fun with MMS
opcounters are color-coded by op type!
big bgflush spike means there was an
EBS event
lots of page faults means reading
lots of cold data into memory from
disk
lock percentage is your single best gauge of
fragility.
24. so ... what can go wrong?
• Your queues are rising and queries are piling up
• Everything seems to be getting vaguely slower
• Your secondaries are in a crash loop
• You run out of available connections
• You can’t elect a primary
• You have an AWS or EBS outage or degradation
• You have terrible latency spikes
• Replication stops
25. ... when queries pile up ...
• Know what your healthy cluster looks like
• Don’t switch your primary or restart when
overloaded
• Do kill queries before the tipping point
• Write your kill script before you need it
• Read your mongodb.log. Enable profiling!
• Check db.currentOp():
• check to see if you’re building any indexes
• check queries with a high numYields
• check for long running queries
• use explain() on them, check for full table scans
• sort by number of queries/write locks per namespace
26. ... everything getting slower ...
• Is your RAID array degraded?
• Do you need to compact your collections or databases?
• Are you having EBS problems? Check bgflush
• Are you reaching your PIOPS limit?
• Are you snapshotting while serving traffic?
... terrible latency spikes ...
28. ... AWS or EBS outage ...
• Full outages are often less painful than degradation
• Take down the degraded nodes
• Stop mongodb to close all connections
• Hopefully you have balanced across AZs and are
coasting
• If you are down and can’t elect a primary, bring up
a new node with the same hostname and port as a
downed node