2. 2
Sensu and Sensibility
I’m part of the SRE team at Yelp.
One of my jobs is “don’t break the site, ever”
Another job is to enable developer productivity and fast innovation.
These two things can be in conflict.
3. Cycle of failure and
disappointment
• Manually edited and deployed monitoring
• Changes require two teams
• Low developer visibility about production
3
This talk is about one particular instance of this conflict - monitoring.
We used nagios. It sucked. This is half to do with nagios, half to do with the way we used it.
4. 4
This leads to developers being separated from production.
Pager details out of date. Not all hosts running a service monitored as services move.
Permissions issues so developers can’t ack alerts. No sane acks system.
5. Cycle of failure and
disappointment
• Manually edited and deployed monitoring
• Changes require two teams
• Low developer visibility about production
• Escalation of issues is hard
• Ops ignore alerts from services
• Postmortems
5
Ops have a lot of pain too. Alerts are too noisy, when they’re for services we can’t triage them. Host issues end up with ops sending email to developers@
and praying.
Ops get alert fatigue, stuff gets missed, everything is terrible
6. 6
If monitoring is ‘ops problem’, everything looks on fire all the time.
It’s very hard to know what’s actually broken.
Lack of situational awareness, expecting broken windows stops people taking responsibility.
7. Cycle of failure and
disappointment
• Manually edited and deployed monitoring
• Changes require two teams
• Low developer visibility about production
• Escalation of issues is hard
• Ops ignore alerts from services
• Postmortems
• High friction, low trust, low visibility.
7
Both sides are actually being reasonable.
This isn’t even a Hanlon’s razor situation - everyone is really trying.
8. “Normality”
8
-‐
http://gunshowcomic.com/648
It’s just the way we’ve built our monitoring system is killing us with a thousand cuts.
And we’ve got Stockholm syndrome.
9. “Normality”
dysfunctional
9
This is
-‐
http://gunshowcomic.com/648
I’m painting a bleak picture here - not actually saying that everything was _this_ bad in our organization.
But these were the types of problems we identified.
11. 11
Sensibility
One of our core competencies is getting monitoring right!
So, we decided to change everything!!!!1111
12. “51 % viewed their ERP implementation as
unsuccessful”
12
The Robbins-Gioia Survey (2001)
Why the hell would we do that? It’s clearly a massive project
13. The Conference Board Survey (2001)
“40 % of the projects failed to achieve their
business case within one year of going live”
13
And pretty high risk.
If we screw the monitoring up, well, lets just not do that?
14. McKinsey & Company in conjunction
with the University of Oxford (2012)
• “17 percent of large IT projects go so
badly that they can threaten the very
existence of the company”
• “On average, large IT projects run 45
percent over budget and 7 percent over
time, while delivering 56 percent less
value than predicted”
14
This is actually really scary..
15. Failure is an option
-‐
blog.parasoft.com/single-‐greatest-‐barrier-‐with-‐sw-‐delivery
15
You’re not gonna get it right first time
Different teams want to work in different ways.
Different environments are different
How do you test your monitoring system?
16. Sensibility
16
Large team + many teams - decentralized (multiple time zones for some teams)
Integration - we can’t pick a product off of a shelf (and get the level of value we need)
17. 17
Sensibility
No big bang change, has to be incremental.
We don’t know what our requirements are (beyond that the current system doesn’t meet them)
Iteration is absolutely key to project success
18. Why Sensu?
• Designed to be pluggable / extensible
• Arbitrary check metadata
• Simple model
• Components do exactly one thing
• Ruby
• Not afraid to extend (or fork!)
18
So why did we choose Sensu - Nagios is workable, right?
Want to work with the monitoring system to integrate it into our infra, not hack around it.
19. ‘industry standard’
‘enterprise class’
19
So we do have / did have nagios.
It’s workable. In fact, it works fine, and scales pretty well (to a point).
This is not a hate on nagios. It _could_ do all the things I talk about here….
21. 21
It tries to solve the full-stack monitoring problem.
We’d already migrated most contacting to pager duty, rest to follow.
Half the objects useless to us. Monolithic.
24. 24
Centralized
Ephemeral clients are a problem.
Whitelisting (needing to explicitly add hosts/services) is a problem
Exported resources are horrible (slow + bad for ephemeral envs)
25. 25
To be fair, this diagram does Sensu no favors at all :)
26. How we use Sensu
• Don’t use all of this!
• ‘Standalone’ checks only
• Default in the puppet module
26
We don’t use it like this, much simpler model!
27. Sensu data flow
• Sensu client runs checks on each machine
• Pushes results to RabbitMQ
• Clustered, clients/messages will fail over.
• Sensu server (multiple, ha)
• Processes check results, invokes handlers
• Writes state to redis
• Redis + sentinel
• Read by API (2 instances)
• All layers behind haproxy
27
28. Quis custodiet ipsos custodes?
28
“Sensu
has
so
many
moving
parts
that
I
wouldn’t
be
able
to
sleep
at
night
unless
I
set
up
a
Nagios
instance
to
make
sure
they
were
all
running.”
Nagios does all of these things, itself.
With no introspection - ‘how deep are my queues, why are things not getting scheduled’
29. Mutually assured monitoring
• Multiple independent Sensu installs (per-datacenter)
• Monitor each other!
29
We have a big environment, we run a Sensu per DC, they can monitor each other.
30. Machine readable config
• /etc/sensu/conf.d/checks/check_name.json
• Extensible with arbitrary metadata
• Hash merge
• Never edit by hand!
30
One of (IMO) the nice decisions is the use of JSON for config.
JSON is a terrible format for hand-edited config, but we deploy all the config by puppet.
31. monitoring_check
monitoring_check { 'systems-apache-external':
page => true,
command => "/usr/lib/nagios/plugins/
check_tcp -H ${external_ip_address} -p 443",
check_every => ‘5m',
alert_after => '30m',
realert_every => 10,
runbook => 'y/apache',
}
31
This is our interface to Sensu in puppet.
It’s a custom define which applies our business rules.
32. monitoring_check
monitoring_check { 'systems-apache-external':
page => true,
command => "/usr/lib/nagios/plugins/
check_tcp -H ${external_ip_address} -p 443",
check_every => ‘5m',
alert_after => '30m',
realert_every => 10,
runbook => 'y/apache',
}
32
Default to not paging people (for sanity), but turn that on easily.
Automatically uses the default team (whoever owns the box). Can be overridden.
33. monitoring_check
monitoring_check { 'systems-apache-external':
page => true,
command => "/usr/lib/nagios/plugins/
check_tcp -H ${external_ip_address} -p 443",
check_every => ‘5m',
alert_after => '30m',
realert_every => 10,
runbook => 'y/apache',
}
33
We didn’t like Sensu’s alert scheduling logic. So we rewrote it :) (This is easy - just in the base class)
35. sensu::check
• monitoring_check wraps this
• Writes a JSON file for each check
• Comment safe
35
We do use the Sensu official puppet module.
“Comment safe” - if you comment the puppet code out, the check goes away.
Working on auto-resolving checks that are deleted now!
36. "disk_ro_mounts": {
"standalone": true, "handlers": [“default"], "subscribers": [],
"command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts",
"interval": 60,
"alert_after": 0, "realert_every": “-1",
"dependencies": [],
"runbook": "http://lmgtfy.com/?q=linux+read+only+disk",
"annotation": "https://gitweb.yelpcorp.com/?
p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80",
"team": "operations",
"irc_channels": "operations-notifications",
"notification_email": "undef",
"ticket": true,
"project": “OPS”,
"page": false,
"tip": false
}
36
This is what an actual auto generated check JSON looks like
BIG BLOB OF JSON!
Don’t stress, we’ll work through it.
37. "disk_ro_mounts": {
"standalone": true, "handlers": [“default"], "subscribers": [],
"command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts",
"interval": 60,
"alert_after": 0, "realert_every": “-1",
"dependencies": [],
"runbook": "http://lmgtfy.com/?q=linux+read+only+disk",
"annotation": "https://gitweb.yelpcorp.com/?
p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80",
"team": "operations",
"irc_channels": "operations-notifications",
"notification_email": "undef",
"ticket": true,
"project": “OPS”,
"page": false,
"tip": false
}
37
This looks the same for all of our Sensu checks.
This is the using ‘simple mode’ and turning off half the features - servers can’t/don’t trigger checks on clients, it’s all client scheduled
38. "disk_ro_mounts": {
"standalone": true, "handlers": [“default"], "subscribers": [],
"command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts",
"interval": 60,
"alert_after": 0, "realert_every": “-1",
"dependencies": [],
"runbook": "http://lmgtfy.com/?q=linux+read+only+disk",
"annotation": "https://gitweb.yelpcorp.com/?
p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80",
"team": "operations",
"irc_channels": "operations-notifications",
"notification_email": "undef",
"ticket": true,
"project": “OPS”,
"page": false,
"tip": false
}
38
These are custom (in our base handler) - as noted before in the define.
Times are converted to seconds (in puppet) so that all time intervals in JSON are seconds.
39. "disk_ro_mounts": {
"standalone": true, "handlers": [“default"], "subscribers": [],
"command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts",
"interval": 60,
"alert_after": 0, "realert_every": “-1",
"dependencies": [],
"runbook": "http://lmgtfy.com/?q=linux+read+only+disk",
"annotation": "https://gitweb.yelpcorp.com/?
p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80",
"team": "operations",
"irc_channels": "operations-notifications",
"notification_email": "undef",
"ticket": true,
"project": “OPS”,
"page": false,
"tip": false
}
39
Every check has to have a run book!
40. "disk_ro_mounts": {
"standalone": true, "handlers": [“default"], "subscribers": [],
"command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts",
"interval": 60,
"alert_after": 0, "realert_every": “-1",
"dependencies": [],
"runbook": "http://lmgtfy.com/?q=linux+read+only+disk",
"annotation": "https://gitweb.yelpcorp.com/?
p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80",
"team": "operations",
"irc_channels": "operations-notifications",
"notification_email": "undef",
"ticket": true,
"project": “OPS”,
"page": false,
"tip": false
}
40
Generated by a custom function.
Goes up the parser stack and finds where it was called from.
41. "disk_ro_mounts": {
"standalone": true, "handlers": [“default"], "subscribers": [],
"command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts",
"interval": 60,
"alert_after": 0, "realert_every": “-1",
"dependencies": [],
"runbook": "http://lmgtfy.com/?q=linux+read+only+disk",
"annotation": "https://gitweb.yelpcorp.com/?
p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80",
"team": "operations",
"irc_channels": "operations-notifications",
"notification_email": "undef",
"ticket": true,
"project": “OPS”,
"page": false,
"tip": false
}
41
This stuff (more than half the check!) is the custom metadata
Every alert has a team owning it.
We can report in irc, JIRA, email (why? but some people do want this) or page!
42. Check scripts
• Same as nagios checks
• Simple (text) output
• Exit code
• Result sent to server, along with check definition
• Including all the custom metadata
• Our handlers use the extra data.
42
So, to recap - checks are scheduled and run on the client.
It pushes the results to RabbitMQ, sends it’s results and definitions to the server.
This is then all piped to the handlers setup.
44. How do checks get run?
• Every machine runs the client.
• Client managed by puppet
• Client has a TCP socket you can send JSON to
• Custom checks + pysensu-yelp
44
Check scripts are simple (as per nagios). Can write them in shell/ruby/python/whatever.
More complex things can send data to the local socket. We have a python library for this (also use the ruby libraries from the sensu project)
45. 45
Sensu servers know which machine is the master right now (their own leadership election).
Deploy some checks to sensu servers (e.g. cloudwatch checks!), run on the master.
Fake hostname!
46. Situational awareness
46
Send alerts about dev box resource usage to the developers using that box.
Why page OPS as a developer used 90% of the disk?
47. Single source of truth
• DNS is canonical for sensu servers
• Configure things in one place!
47
One place can be DNS, or hiera, or whatever - but not multiple places.
DNS AND hiera sucks
48. Single source of truth
• DNS is canonical for sensu servers
• Configure things in one place!
48
puppet-netstdlib
structured facts
49. Automatic monitoring
• E.g. cron jobs - check successful recently!
• cron::d
49
There are a bunch of general patterns where you can automate monitoring.
Who hates ‘cron spam’?
We use a custom define which defaults to /dev/null
Check jobs completed successfully (with Sensu) - make JIRA tickets!
50. Automatic monitoring
• E.g. cron jobs - check successful recently!
• cron::d
50
Generic handling!
Annotations!
51. Generate monitoring_check
51
And under the hood this runs create_resources to generate monitoring_checks
create_resources is your friend!
52. User specified monitoring
52
This is a cunning one.
The check returns OK (assuming it can hit graphite), but also emits a bunch of additional check results to the local socket
53. User specified monitoring
53
• Data lives in the service config
• Next to the code to emit metrics!
This is awesome, as it reads our service configs.
Developers can add their own alerts.
54. • Simple checks for free!
54
User specified monitoring
This example is in ruby :)
55. User specified monitoring
• Data lives in the service config
• Next to the code to emit metrics
• Next to metadata about SLAs and LB timeouts
• Developers can push without OPS
55
Allowing developers to add their own monitoring is awesome.
Putting the config for the monitoring in their application codebase is awesome.
56. Cluster checks
• We’re working on this currently
• Assert some % of machines are healthy.
• Use to reduce alert noise.
• If a service becomes fully unavailable to clients,
you want to page someone.
• If one machine goes belly up, you don’t (make
a JIRA ticket for handling later!)
56
57. WIP
• This is all still a work in progress.
• We’ve not 100% migrated off of Nagios
• Open sourcing the pieces
57
58. Thanks!
• Slides will be online shortly:
• slideshare.net/bobtfish
• @bobtfish
• Some (most?) of our code is open source:
• https://github.com/Yelp/sensu/commit/
aa5c43c2fdfde5e8739952c0b8082000934f3ad2
• https://github.com/Yelp/puppet-monitoring_check
• https://github.com/Yelp/puppet-netstdlib
• https://github.com/Yelp/sensu_handlers
• https://github.com/Yelp/pysensu-yelp
58