MongoDB Management Service (MMS) is is a cloud-based suite of services for managing MongoDB deployments, providing both monitoring and backup capabilities. In this webinar we'll outline 5 alerts you should set up in MMS to keep your MongoDB deployment on track. We’ll explore what each alert means for a MongoDB instance, as well as how to calibrate the alert triggers to be relevant to your environment.
11. •
•
•
•
•
Is there an absolute limit to alert on?
What is normal (baseline) ?
What is worrying (warning) ?
What is a definite problem (critical) ?
Likelihood of false positives ?
... there is no magic formula
12. Five recommended alerts
• Host Recovering (All, but by definition
Secondary)
• Replication Lag (Secondary)
• Connections (All mongos, mongod)
• Lock % (Primary, Secondary)
• Replica (Primary, Secondary)
13. Host Recovering
• General alert triggered if any instance
enters RECOVERING mode
• Required for all use-cases
• All Replica Sets should have this.
• Sometimes, during maintenance this
may be expected
15. Replication Lag
•
•
•
•
No secondary should be behind
Secondary reads affected
All Replica Sets should have this
Only exception is configured slaveDelay
16. Replication Lag
Absolute Limit?
Yes, about 1 or 2s. To prevent false positives absolute
threshold > 240s should be alerted
Normal
Lag is ideally 0s
Worrying
< 60s, some false positives
Critical
> 240s
False positives
Above 240s likelihood low.
18. Example: replication lag
• Secondaries under specified vs primaries
• Access patterns between primary /
secondaries
• Insufficient bandwidth
• Foreground index builds on secondaries
“…when you have eliminated the impossible, whatever remains,
however improbable, must be the truth…” -- Sherlock Holmes
Sir Arthur Conan Doyle, The Sign of the Four
19. Example: replication lag
Example:
• ~1500 ops per minute (opcounters)
• 0.1 MB per object (average object size,
local db)
~1500 ops/min / 60 seconds * 0.1 MB/op *
8b/B =~ 20 mbps required bandwidth
20. Connections
• Each connection consumes ~ 1MB and
a file descriptor
• 5000 connections => 5GB of RAM
• Stability and predictability are key
21. Pro-Tip: know thyself
You have to recognize normal to know when it isn’t.
Source: http://www.flickr.com/photos/skippy/6853920/
22. Connections
Absolute Limit?
Yes, but this is too high. We need to alert before that
Normal
TBD based on deployment, number of nodes, connection
pool settings, app servers, load etc.
Say, X during peak load
Worrying
50% increase, so, 1.5X
Critical
Double, so 2X
23. Lock %
• Lock contention degrades performance
• High lock % starves replication, reads.
• Bounds need to be determined
24. Lock %
Absolute Limit?
Yes, >80% occasional degraded performance, 90% major
impact regularly
Normal
TBD. Write heavy loads see higher values. Normal, say
X% during peak load
Worrying
Double, so approximately 2X%
Critical
TBD. For Prod > 80%
25. Replica
• Represents oplog window
• Depends on
– Rate of operations inserted into oplog
– Size of operations
– Size of oplog capped collection
• Normal maintenance window X 3
• Resizing the oplog is non-trivial
27. Summary
• Use similar approach for other metrics
• Different audiences for alerts
– Worrying alerts ops team
– Critical goes out to a wider audience
• Get started with MMS Monitoring and
alerts!
A member of a replica set enters RECOVERING state when it is not ready to accept reads. The RECOVERING state can occur during normal operation, and doesn’t necessarily reflect an error condition. Members in the RECOVERING state are eligible to vote in elections, but is not eligible to enter the PRIMARY state.
A member of a replica set enters RECOVERING state when it is not ready to accept reads. The RECOVERING state can occur during normal operation, and doesn’t necessarily reflect an error condition. Members in the RECOVERING state are eligible to vote in elections, but is not eligible to enter the PRIMARY state.----- Meeting Notes (1/9/14 11:17) -----Initial Sync, rollback, stale