SlideShare uma empresa Scribd logo
1 de 80
Baixar para ler offline
SENSE AND
SENSU-BILITY
Painless Metrics And Monitoring
In The Cloud with Sensu
Bethany Erskine
Velocity NYC 2013
http://github.com/skymob/sensu-tutorial

Monday, October 14, 13
BEFORE I BEGIN...
IF YOU DID NOT SET UP SENSU-TUTORIAL
BEFORE THE CLASS:
1. grab a USB key
2. follow the instructions on the README
If you don’t have a computer, no sweat!

Monday, October 14, 13
DO YOU LOVE
YOUR
MONITORING
SETUP?
Monday, October 14, 13
#MONITORINGLOVE

Monday, October 14, 13
MY STORY

+

(╯︵╰,)

Monday, October 14, 13
Monday, October 14, 13
Monday, October 14, 13
+

Monday, October 14, 13
WHY SENSU
✓Ruby
Plugins can be written in any
✓language
✓
✓community

sensu-chef cookbook

Monday, October 14, 13
WHY SENSU
✓re-use Nagios checks!
metrics and checks all collected by
✓one system
✓
✓easy to scale

Graphite integration

Monday, October 14, 13
WHY SENSU

✓“Can I do X with Sensu?” probably!

Monday, October 14, 13
WHY SENSU

Monday, October 14, 13
WHY SENSU?
✓

Sensu source is well-written and
easy to parse

✓

Monday, October 14, 13

https://github.com/sensu
WHY SENSU?
✓sensu-community-plugins
80 contributors
✓
✓over 600 plugins
https://github.com/sensu/sensu✓community-plugins
Monday, October 14, 13
TODAY at
PAPERLESS
Two Sensu environments (prod/testing)
~ 250 - 275 instances of sensu-client
4-6 Sensu-server instances
25k Metrics/Hour to Graphite
1 custom dashboard
1 custom CLI

Monday, October 14, 13
RESOURCES
All of our
✓virtualized.Sensu infrastructure is
We typically give a
✓box 1.5GB RAM and sensu-server
2 processors,
scaling up RAM for any box running
more than one Sensu service on it.
4GB
✓install RAM for a monolithic Sensu
(Rabbit, Redis, all Sensu
components on one)
Monday, October 14, 13
AS WE GREW
Growing pains and lessons learned...

Monday, October 14, 13
NEEDS MORE
SENSU
✓High load on Sensu server
Backed-up queues in RabbitMQ
✓
TIP: set up check to monitor the
✓RabbitMQ ready queue size, you'll
want an email when the queue
grows about 10K and stays there

Monday, October 14, 13
HOW TO SCALE
✓Add more sensu-server instances
No special configuration needed
✓
checks will be
✓robin fashion todistributed in roundthe sensu-servers

Monday, October 14, 13
GRAPHITE PAINS
symptoms: backed up queues in
✓RabbitMQ, spotty graphs
cluster couldn’t
with the
✓large amount of keep upwe were
metrics
now serving it via AMQP

Monday, October 14, 13
GRAPHITE PAINS
✓

Solution: stop collecting metrics
every 10 seconds (excessive!)

✓

moved staging metrics to staging
Graphite cluster

✓

Moved prod Graphite cluster to
SSD

Monday, October 14, 13
THE MIGRATION
or, How To Quit Nagios in Ten Easy Steps

Monday, October 14, 13
STEP 1: NUKE AND
PAVE

Monday, October 14, 13
STEP 2: PLAN
METRICS AND MONITORING SURVEY

Monday, October 14, 13
METRICS AND MONITORING SURVEY

Monday, October 14, 13
STEP 3: DEFINE
GLOBALS
✓CHECKS: must be actionable!
✓METRICS: go nuts
HANDLERS: EMAIL for everything
✓initially, added Pagerduty later.

Monday, October 14, 13
OUR GLOBALS
✓

CHECKS: disk usage, swap usage,
zombie processes, RO filesystems

✓

METRICS: vmstat, disk usage, cpu,
memory, interface and disk perf

✓

HANDLERS: Email, Campfire,
Pagerduty

Monday, October 14, 13
STEP 4: DEFINE
SPECIFICS
✓

For each server role, define
additional states to be checked and
alerted on:

✓Process Checks
✓System Checks
✓Service Checks
✓Service Metrics
Monday, October 14, 13
STEP 5: SET UP A
PLACE TO TEST
✓

Set up a permanent testing Sensu
stack using your CM tool of choice

✓

Monday, October 14, 13

we used sensu-chef cookbook
STEP 6: SET A
WORKFLOW
✓

Develop and document a workflow
for implementing, testing,
deploying and signing off on
checks

✓

You’ll get the best coverage if
anyone (developers or ops) can
easily add checks and metrics to
Sensu

Monday, October 14, 13
EXAMPLE
WORKFLOW
add new sensu_check
✓appropriate cookbook definitions to the
in Chef
deploy
✓Chef new check to staging env using

✓Pull Request with sample graphs or alerts
✓Code Review from colleague
✓Deploy to Prod
Monday, October 14, 13
SENSU IN CHEF

Monday, October 14, 13
STEP 7: EXECUTE
WORKFLOW
Starting with the low-hanging
✓(plugins that already existed infruit
sensu-community-plugins
repository), configure and deploy
each check in the worksheet to the
testing Sensu server
deploy sensu-client to a few select
✓machines
Monday, October 14, 13
STEP 8: WATCH
THE WATCHER
Set up some bare-minimum 3rd
✓party monitoring for the Sensu
servers

✓

We use Panopta’s agent to check
for aliveness, disk usage and CPU
usage.

Monday, October 14, 13
Monday, October 14, 13
MONITOR THE
MONITOR
✓

Other ideas: have Testing Sensu
monitor Prod Sensu

✓

Sensu can collect metrics about
itself

Monday, October 14, 13
STEP 9: ROLLOUT
Deploy your
✓infrastructureProduction server
Roll out the client
✓the rest of the yourand checks to
prod
environments. 

Monday, October 14, 13
STEP 10: TUNE
✓
Expect to need to tune
✓and alert occurrences. thresholds
Laissez le bon alertes roulent!

Monday, October 14, 13
SENSU
ARCHITECTURE

Monday, October 14, 13
SENSU
ARCHITECTURE

Monday, October 14, 13
OMNIBUS
INSTALLER
is awesome

Monday, October 14, 13
LET’S PLAY WITH
SENSU
If you haven’t been able to get your
sandboxes up and running,
please pair with someone near you.

Monday, October 14, 13
SANDBOX GOALS
✓

Get familiar with Sensu
configuration

✓
✓Deploy a check
Trigger an alert on that check
✓
Give you something to take home
✓and hack on
Install a Handler

Monday, October 14, 13
OOPS
If you mess anything up:
vagrant halt; vagrant up
Worst case:
vagrant destroy; vagrant up

Monday, October 14, 13
TWO
VIRTUALBOXES
Sensu-Server and Sensu-Client
Vagrant/Chef
Centos 6.4
Sensu Version 0.10.2

Monday, October 14, 13
SENSU
CONFIGURATION
Please open up a terminal
✓into both your sensu-serverand SSH
and
sensu-client VMs

✓sudo su ✓cd /etc/sensu
Monday, October 14, 13
SENSU
CONFIGURATION
✓/etc/sensu/config.json - config for
redis, rabbitmq, api and dashboard

✓/etc/sensu/conf.d/ - checks go here
✓/etc/sensu/conf.d/client.json client configuration, subscriptions

✓

/etc/sensu/{extensions|handlers|
mutators|plugins}

Monday, October 14, 13
TRIGGER AN
ALERT!
On sensu-client:
service sensu-client stop

Monday, October 14, 13
CHECK YOUR
DASHBOARD
Open a web browser and
✓http://10.254.254.10:8080 go to
username:
✓secret admin / password:

Monday, October 14, 13
HANDLERS
✓

A HANDLER takes action on an
event using a pipe, TCP, UDP,
AMQP, or a set of other handlers

Examples: send an
send
✓event to Pagerduty,email,metrics to
send
Graphite

✓
Monday, October 14, 13

Default is “debug”
HANDLER
EXAMPLES
✓BASIC: send an email to ops@
ADVANCED: attempt to remediate
✓the alert (i.e. run a custom script
that spins up additional ec2
instances)

Monday, October 14, 13
HANDLERS
Let’s configure an EMAIL handler
✓to send a informative email for an
event.

✓

/etc/sensu/handlers/mailer.rb
plugin is installed for you, we just
need to configure and install it

Monday, October 14, 13
CONFIGURE THE
PLUGIN
ON SENSU SERVER:
vim /etc/sensu/conf.d/handlers/
mailer.json
{
"mailer": {
"mail_from": "sensu@you.com",
"mail_to": "you@yourdomain.com"
}
}
Monday, October 14, 13
CONFIGURE THE
HANDLER
cp /etc/sensu/conf.d/handlers/
default.json
/etc/sensu/conf.d/handlers/
email.json
vim /etc/sensu/conf.d/handlers/
email.json

Monday, October 14, 13
EMAIL.JSON
"handlers": {
"email": {
"type": "pipe",
"command": "/etc/sensu/handlers/
mailer.rb"
}
}

Monday, October 14, 13
CHECK GEM
DEPENDENCIES
/opt/sensu/embedded/bin/gem list | grep mail

Monday, October 14, 13
FIX PERMISSIONS

chown -R .sensu /etc/sensu/conf.d/

Monday, October 14, 13
RESTART
SERVICES
service sensu-server restart
tail -100 /var/log/sensu/sensu-server.log
| grep mail

Monday, October 14, 13
CHECKS
Sensu-client runs CHECKS that
✓defined and scheduled either are
locally (standalone) or on the
sensu-server (subscription).
A CHECK sends a RESULT as
✓EVENT to a HANDLER - this an
applies to anything - service
checks, metrics, etc

Monday, October 14, 13
CHECK
EXECUTION
✓

Either scheduled by the server
(subscription) or scheduled by the
client (standalone)

Today we will configure a
✓subscription-based check on the
server that will run on our client

Monday, October 14, 13
LETS CONFIGURE
A CHECK
✓

Use check-procs.rb to make sure
at least one instance of cornbread
is running

Monday, October 14, 13
DETERMINE OUR
CHECK COMMAND
On your SENSU CLIENT:
/opt/sensu/embedded/bin/ruby /etc/sensu/plugins/check-procs.rb -p
cornbread -W1

Monday, October 14, 13
INSTALL OUR
CHECK
✓On your SENSU SERVER:
vim /etc/sensu/conf.d/checks/
✓cornbread_process.json

Monday, October 14, 13
CORNBREAD_PRO
CESS.JSON

Monday, October 14, 13
RESTART
SERVICES
service sensu-server restart
tail -100 /var/log/sensu/sensu-server.log
| grep cornbread

Monday, October 14, 13
CHECK YOUR
DASHBOARD

Monday, October 14, 13
CHECK YOUR
EMAIL

Monday, October 14, 13
SENSU API
✓
✓HTTP/4567
on SENSU SERVER try:
✓
REST API

curl -l http://localhost:4567/events 
| python -mjson.tool

Monday, October 14, 13
SENSU SERVICES
✓Sensu API
Sensu Server
✓
✓Sensu Client
Sensu Dashboard
✓
Monday, October 14, 13
EVERYTHING OK?
✓

/etc/init.d/sensu-service {client|
server|api|dashboard} {start|stop|
status|restart}

✓ps -ef | grep sensu
tail -f /var/log/sensu/*.log
✓
✓curl -l localhost:4567/info
Monday, October 14, 13
COOL SENSU
TRICKS

Monday, October 14, 13
SEND DIRECTLY
TO SENSU
netcat to: 127.0.0.0:3030

Monday, October 14, 13
AGGREGATE
ALERTS
✓
Alert when
✓not OK X% of checks are are

Handy for preventing alert floods

Monday, October 14, 13
MY SENSU TIPS
install the RabbitMQ management
✓web interface and bookmark it (see
http://10.254.254.10:15672/#/ )

✓

lock your plugins’ gem
dependency versions

Monday, October 14, 13
TIPS TIPS TIPS
✓

have alternate ways to access your
Dashboard information

✓

we integrated our command-line
developer tools with Sensu API

✓

we also created our own Ops
dashboard that queries Sensu,
Graphite and our app for data

Monday, October 14, 13
MORE TIPS

✓

Put NGINX in front of sensudashboard

Monday, October 14, 13
HA SENSU
✓

Redundancy is easy (bring up
more sensu-servers)

✓

Making Redis and RabbitMQ HA
more challenging

✓

We’re still running one solitary
Redis and RabbitMQ but are OK
with this risk for now

Monday, October 14, 13
WHERE TO GO
FOR HELP
✓
✓IRC: #sensu - freenode
sensu-users mailing list
✓

http://docs.sensuapp.org

Monday, October 14, 13
QUESTIONS

Monday, October 14, 13
THANK YOU
bethany@paperlesspost.com
@skymob - twitter
robotwitharose - #sensu on IRC (freenode)

Monday, October 14, 13

Mais conteúdo relacionado

Destaque

Puppet Development Workflow
Puppet Development WorkflowPuppet Development Workflow
Puppet Development WorkflowJeffery Smith
 
Redis in a Multi Tenant Environment–High Availability, Monitoring & Much More!
Redis in a Multi Tenant Environment–High Availability, Monitoring & Much More! Redis in a Multi Tenant Environment–High Availability, Monitoring & Much More!
Redis in a Multi Tenant Environment–High Availability, Monitoring & Much More! Redis Labs
 
Cf summit-2016-monitoring-cf-sensu-graphite
Cf summit-2016-monitoring-cf-sensu-graphiteCf summit-2016-monitoring-cf-sensu-graphite
Cf summit-2016-monitoring-cf-sensu-graphiteJeff Barrows
 
Superb Supervision of Short-lived Servers with Sensu
Superb Supervision of Short-lived Servers with SensuSuperb Supervision of Short-lived Servers with Sensu
Superb Supervision of Short-lived Servers with SensuPaul O'Connor
 
Open Source Monitoring in 2014, from #monitoringssucks to #monitoringlove and...
Open Source Monitoring in 2014, from #monitoringssucks to #monitoringlove and...Open Source Monitoring in 2014, from #monitoringssucks to #monitoringlove and...
Open Source Monitoring in 2014, from #monitoringssucks to #monitoringlove and...Kris Buytaert
 
Time to say goodbye to your Nagios based setup
Time to say goodbye to your Nagios based setupTime to say goodbye to your Nagios based setup
Time to say goodbye to your Nagios based setupCheck my Website
 
Beautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBBeautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBleesjensen
 
Grafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and ChallengesGrafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and ChallengesPhilip Wernersbach
 
Four pillars of DevOps - John Shaw - Agile Cambridge 2014
Four pillars of DevOps - John Shaw - Agile Cambridge 2014Four pillars of DevOps - John Shaw - Agile Cambridge 2014
Four pillars of DevOps - John Shaw - Agile Cambridge 2014johnfcshaw
 

Destaque (12)

Puppet Development Workflow
Puppet Development WorkflowPuppet Development Workflow
Puppet Development Workflow
 
Redis in a Multi Tenant Environment–High Availability, Monitoring & Much More!
Redis in a Multi Tenant Environment–High Availability, Monitoring & Much More! Redis in a Multi Tenant Environment–High Availability, Monitoring & Much More!
Redis in a Multi Tenant Environment–High Availability, Monitoring & Much More!
 
Cf summit-2016-monitoring-cf-sensu-graphite
Cf summit-2016-monitoring-cf-sensu-graphiteCf summit-2016-monitoring-cf-sensu-graphite
Cf summit-2016-monitoring-cf-sensu-graphite
 
Sensu
SensuSensu
Sensu
 
Superb Supervision of Short-lived Servers with Sensu
Superb Supervision of Short-lived Servers with SensuSuperb Supervision of Short-lived Servers with Sensu
Superb Supervision of Short-lived Servers with Sensu
 
Influxdb and time series data
Influxdb and time series dataInfluxdb and time series data
Influxdb and time series data
 
Open Source Monitoring in 2014, from #monitoringssucks to #monitoringlove and...
Open Source Monitoring in 2014, from #monitoringssucks to #monitoringlove and...Open Source Monitoring in 2014, from #monitoringssucks to #monitoringlove and...
Open Source Monitoring in 2014, from #monitoringssucks to #monitoringlove and...
 
InfluxDB & Grafana
InfluxDB & GrafanaInfluxDB & Grafana
InfluxDB & Grafana
 
Time to say goodbye to your Nagios based setup
Time to say goodbye to your Nagios based setupTime to say goodbye to your Nagios based setup
Time to say goodbye to your Nagios based setup
 
Beautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBBeautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDB
 
Grafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and ChallengesGrafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and Challenges
 
Four pillars of DevOps - John Shaw - Agile Cambridge 2014
Four pillars of DevOps - John Shaw - Agile Cambridge 2014Four pillars of DevOps - John Shaw - Agile Cambridge 2014
Four pillars of DevOps - John Shaw - Agile Cambridge 2014
 

Semelhante a Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with Sensu

Cooking an Omelette with Chef
Cooking an Omelette with ChefCooking an Omelette with Chef
Cooking an Omelette with Chefctaintor
 
Real Developer Tools for WordPress by Stefan Didak
Real Developer Tools for WordPress by Stefan DidakReal Developer Tools for WordPress by Stefan Didak
Real Developer Tools for WordPress by Stefan DidakEast Bay WordPress Meetup
 
Performance and optimization
Performance and optimizationPerformance and optimization
Performance and optimizationmarkstory
 
Show an Open Source Project Some Love and Start Using Travis-CI
Show an Open Source Project Some Love and Start Using Travis-CIShow an Open Source Project Some Love and Start Using Travis-CI
Show an Open Source Project Some Love and Start Using Travis-CIJoel Byler
 
Recommender Systems with Ruby (adding machine learning, statistics, etc)
Recommender Systems with Ruby (adding machine learning, statistics, etc)Recommender Systems with Ruby (adding machine learning, statistics, etc)
Recommender Systems with Ruby (adding machine learning, statistics, etc)Marcel Caraciolo
 
Building a Startup Stack with AngularJS
Building a Startup Stack with AngularJSBuilding a Startup Stack with AngularJS
Building a Startup Stack with AngularJSFITC
 
DevOps: Getting Started with Puppet on Windows
DevOps: Getting Started with Puppet on WindowsDevOps: Getting Started with Puppet on Windows
DevOps: Getting Started with Puppet on WindowsRob Reynolds
 
Writing Prefork Workers / Servers
Writing Prefork Workers / ServersWriting Prefork Workers / Servers
Writing Prefork Workers / ServersKazuho Oku
 
5 Ways to Awesome-ize Your (PHP) Code
5 Ways to Awesome-ize Your (PHP) Code5 Ways to Awesome-ize Your (PHP) Code
5 Ways to Awesome-ize Your (PHP) CodeJeremy Kendall
 
The State of Puppet
The State of PuppetThe State of Puppet
The State of PuppetPuppet
 
At Your Service: Using Jenkins in Operations
At Your Service: Using Jenkins in OperationsAt Your Service: Using Jenkins in Operations
At Your Service: Using Jenkins in OperationsMandi Walls
 
Microservices and functional programming
Microservices and functional programmingMicroservices and functional programming
Microservices and functional programmingMichael Neale
 
Android meetup
Android meetupAndroid meetup
Android meetupTy Smith
 
How we setup Rsync-powered Incremental Backups
How we setup Rsync-powered Incremental BackupsHow we setup Rsync-powered Incremental Backups
How we setup Rsync-powered Incremental Backupsnicholaspaun
 
OpenSolaris On EeePc at Osc Spring
OpenSolaris On EeePc at Osc SpringOpenSolaris On EeePc at Osc Spring
OpenSolaris On EeePc at Osc SpringMasafumi Ohta
 
Dist::Zilla - Maximum Overkill for CPAN Distributions
Dist::Zilla - Maximum Overkill for CPAN DistributionsDist::Zilla - Maximum Overkill for CPAN Distributions
Dist::Zilla - Maximum Overkill for CPAN DistributionsRicardo Signes
 
Architecture patterns and practices
Architecture patterns and practicesArchitecture patterns and practices
Architecture patterns and practicesFuqiang Wang
 

Semelhante a Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with Sensu (20)

Smartgears
SmartgearsSmartgears
Smartgears
 
Cooking an Omelette with Chef
Cooking an Omelette with ChefCooking an Omelette with Chef
Cooking an Omelette with Chef
 
Real Developer Tools for WordPress by Stefan Didak
Real Developer Tools for WordPress by Stefan DidakReal Developer Tools for WordPress by Stefan Didak
Real Developer Tools for WordPress by Stefan Didak
 
Performance and optimization
Performance and optimizationPerformance and optimization
Performance and optimization
 
Show an Open Source Project Some Love and Start Using Travis-CI
Show an Open Source Project Some Love and Start Using Travis-CIShow an Open Source Project Some Love and Start Using Travis-CI
Show an Open Source Project Some Love and Start Using Travis-CI
 
Recommender Systems with Ruby (adding machine learning, statistics, etc)
Recommender Systems with Ruby (adding machine learning, statistics, etc)Recommender Systems with Ruby (adding machine learning, statistics, etc)
Recommender Systems with Ruby (adding machine learning, statistics, etc)
 
Building a Startup Stack with AngularJS
Building a Startup Stack with AngularJSBuilding a Startup Stack with AngularJS
Building a Startup Stack with AngularJS
 
DevOps: Getting Started with Puppet on Windows
DevOps: Getting Started with Puppet on WindowsDevOps: Getting Started with Puppet on Windows
DevOps: Getting Started with Puppet on Windows
 
Writing Prefork Workers / Servers
Writing Prefork Workers / ServersWriting Prefork Workers / Servers
Writing Prefork Workers / Servers
 
5 Ways to Awesome-ize Your (PHP) Code
5 Ways to Awesome-ize Your (PHP) Code5 Ways to Awesome-ize Your (PHP) Code
5 Ways to Awesome-ize Your (PHP) Code
 
The State of Puppet
The State of PuppetThe State of Puppet
The State of Puppet
 
At Your Service: Using Jenkins in Operations
At Your Service: Using Jenkins in OperationsAt Your Service: Using Jenkins in Operations
At Your Service: Using Jenkins in Operations
 
Microservices and functional programming
Microservices and functional programmingMicroservices and functional programming
Microservices and functional programming
 
CloudInit Introduction
CloudInit IntroductionCloudInit Introduction
CloudInit Introduction
 
Android meetup
Android meetupAndroid meetup
Android meetup
 
How we setup Rsync-powered Incremental Backups
How we setup Rsync-powered Incremental BackupsHow we setup Rsync-powered Incremental Backups
How we setup Rsync-powered Incremental Backups
 
OpenSolaris On EeePc at Osc Spring
OpenSolaris On EeePc at Osc SpringOpenSolaris On EeePc at Osc Spring
OpenSolaris On EeePc at Osc Spring
 
Dist::Zilla - Maximum Overkill for CPAN Distributions
Dist::Zilla - Maximum Overkill for CPAN DistributionsDist::Zilla - Maximum Overkill for CPAN Distributions
Dist::Zilla - Maximum Overkill for CPAN Distributions
 
Architecture patterns and practices
Architecture patterns and practicesArchitecture patterns and practices
Architecture patterns and practices
 
Cartoset
CartosetCartoset
Cartoset
 

Último

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 

Último (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 

Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with Sensu

Notas do Editor

  1. I’m curious, How many Nagios users do we have here? Anyone here running Sensu in production? How many people here are happy with their monitoring system? And how many of you are here today because you think monitoring sucks?
  2. So, the reason I’m here today is because can safely say that I LOVE monitoring. But it hasn’t always been this way....
  3. In 2011 I joined Paperless Post as the second ever member of the Operations team. Our new small team faced many challenges, one of which was fixing our monitoring infrastructure’s sad state. We had one monolithic Nagios server with no version control and no configuration management. Every time we needed to add hosts, services, etc, we manually edited the files and restarted the Nagios server. Even though I had years of experience living with Nagios and could write configs in my sleep, I dreaded ever having to add new hosts or checks.
  4. Our metrics collection setup was in even worse shape. Munin was deployed for a handful of servers, but was so awkward to work with it’d all but been abandoned. We were sending data directly from our Rails app to Graphite, but no server system metrics were making it there at all. This was no way to be. But I don’t want to spent all morning telling you how much Nagios sucks, let me tell you a little monitoring love story about Paperless Post and Sensu.
  5. In the fall of 2012 we’d outgrown our old managed hosting service and found a new provider and were making preparations to move to a new datacenter. At this point our entire infrastructure with the exception of Nagios were managed by Chef, and our plan was to bring up the new datacenter infrastructure entirely using Chef. We saw an opportunity to start fresh and explored our options, and quickly fell in love with Sensu
  6. Sensu is Ruby, which we know and love and Paperless. Although the Sensu components are written in Ruby, checks and plugins can be written in any language There was already a fully-featured Chef recipe for Sensu - in fact, Sensu was designed with configuration management in mind We saw an opportunity to get involved with a young project that we could potentially contribute back to. The sensu-community-plugins were my first real open-source contributions, and after nearly two years with it, I feel strongly enough about the project to keeps supporting it in any way I can, which is why I’m here today.
  7. Because we were on a tight deadline to deploy Sensu, the prospect of re-using existing Nagios checks appealed to me, with the option of re-writing them in Ruby using the Sensu plugin libraries later on down the line Metrics and Checks all handled by one system. We were fully sold on being able to gather metrics using the same client that ran our health checks, and were excited about the prospect of seeing our system-level metrics on the same system as our application-level metrics via Graphite/Graphiti. Sensu had potential to scale easily, something we’d end up needing to call on later
  8. Sensu is incredibly flexible tool. I’ve yet to come up with a device or situation that couldn’t somehow be handled by Sensu. It’s sometimes referred to as a monitoring “router”, which is a very accurate description. It can handle any input and pass it off to any other script, system, or handler that you want.
  9. Sensu LOVES the cloud and deals beautifully with ephemeral machine environments. We simply added an API call to our devtools so that deleting a node is as simple as saying `pp sensu delete_client foo`. This command can also be run from Jenkins or even theoretically from a client node itself before shutting down. We're able to silence entire environments at a time using one simple command: `pp sensu silence production` collects all production nodes from Chef and then silences them using the Sensu API.
  10. most use sensu-plugin gem and are written in Ruby, but all languages are welcome
  11. A little about our Sensu setup at Paperless Post. We have two Sensu environments: production and testing. Production runs 3, sometimes 4 instances of sensu-server, and testing 1, sometimes 2. We do not have this elasticity automated, but I’ll touch a little later on when we know to scale out by adding another sensu-server to the cluster. We’re pushing 25K metrics per hour through Sensu to our Graphite cluster using Sensu’s AMQP handler.
  12. Overall our transition from Nagios to Sensu was incredibly smooth. But as we grew there were of course problems here and there...
  13. Initially we’d deployed a single Sensu server to handle all of production, but it became obvious it was time to scale when we saw some of these symptoms: high load on sensu server and backed-up queues in RabbitMQ. We have a Sensu check set up to alert us if the RabbitMQ queue size grows over 10K messages and stays there for longer than five minutes.
  14. How do you scale? it’s a simple as bootstrapping another Sensu-server. In our case, Chef role[sensu-server] (which brings up a box running just sensu-server - no API or Dashboard). No other special configuration is needed, just use the same config as the rest of the environment, and checks will be distributed in a round-robin fashion to your sensu servers.
  15. The only major pains we experienced with Sensu have been related to Graphite. We started seeing backed-up queues and spotty graphs in RabbitMQ. Throwing more sensu-servers at the problem didn’t help in this case, and it turns out that our Graphite cluster just couldn’t keep up with the large amount of metrics we were now serving it via AMQP. AMQP works, but in some ways isn’t ideal - in our case, AMQP bypasses carbon-relay and thus the replication schema, and sends every metric to every cluster node, which is overkill for a six-node cluster.
  16. We experimented with writing our own consumer, but ended up with the following solution: we stopped collecting metrics every 10 seconds (which was overkill anyway), and moved our staging metrics off of the production Graphite cluster and onto their own staging Graphite cluster. We then moved the production Graphite cluster’s VMs on to SSDs. In fact, I spent most of last week writing scripts to migrate Whisper files off of a six-node VM ware Graphite cluster on a a 2-node dedicated hardware cluster w/ SSDs.
  17. Now, I want to tell you our tried and true method for a successful and happy transition from Nagios, or your monitoring system of choice, to Sensu.
  18. There is a lot of talk in the Ops community about Alert Fatigue, and moving to a new monitoring system is a golden opportunity to clear your slate, clean up your alerts and determine what your REALLY care about. Also, because of differences in the way each monitoring system implements checks, it usually makes sense to just start from scratch rather than try to port existing check schemas over to a new system. This is a great opportunity to stop sending emails for things that don't matter - do you really need an email every time your CPU is pegged? probably not.
  19. Metrics and Monitoring planning spreadsheet is a tool we used to survey all of our servers and determine what needed to be gathered and monitored.
  20. I’ve shared this document with you on my Github in the “sensu-tutorial” repository. This spreadsheet contains a column for ... Example:
  21. DETERMINE YOUR BASELINE - For ‘base’ role we made a list of things we wanted to know about every single machine. Our criteria for a CHECK is it must be actionable IF it’s something we want to know but don’t necessary need to act on, make a METRIC
  22. disk usage, swap usage, zombie processes, RO filesystems for METRICS, we gather vmstat, disk usage, cpu, memory, interface and disk performance metrics on every machine. HANDLERS, we chose email for everything initially, then added Pagerduty later for only the most critical, must-wake-up-at-3am type alerts. We have a dedicated room in Campfire for receiving Sensu alerts.
  23. DEFINE SPECIFICS For each role (in our case, Chef roles, but could be any machine, device or server role), we gathered the following: Process Checks (at least 4 Unicorn workers should be running but no more than 20)System Checks (anything beyond our baseline system checks - say maybe we want to check for RO mounts only on servers that actually mount something)Service Checks (database locks, database connections, HTTP response) Service Metrics (haproxy bytes in/out)Other
  24. SET UP A TESTING ENVIRONMENT: This will get you familiar with deploying and administrating Sensu, I strongly recommend having a permanent place to test all of your Sensu checks and configuration changes using your CM tool of choice. It can be dual purpose and serve your staging environments, and is a good place to test things like Sensu package upgrades. We set up a Testing sensu infrastructure in the old datacenter, deploying using sensu-chef cookbook, which we customized as needed
  25. Develop a workflow for implementing, testing, deploying and signing off on checks. You’ll get the best check coverage if anyone on your team (developers, ops) can easily add checks or metrics to Sensu.
  26. Our workflow at Paperless Post: using Chef (which we’re deploying using our devtools with the help of Jenkins), we develop and deploy our checks to testing environment. We then do a pull-request, including any notes about how we tested or metrics sample graphs or outputs. We have a colleague do a quick code review and approve that pull request, then we deploy to prod.
  27. now the fun part: START DEPLOYING CHECKS! Starting with the low-hanging fruit (checks that utilized plugins that already existed in sensu-community-plugins repository), started deploying each check that you defined in the worksheet to the testing sensu server. If a suitable plugin didn’t already exist in sensu-community-plugins, we had two choices: 1) re-use a Nagios check or 2) write our own in Ruby or Bash.
  28. Monitor your monitoring system! This should be self-explanatory. Set up some bare-minimum 3rd party monitoring for the Sensu servers themselves so you’ll know if the VM goes completely down (this has not yet happened to us!) or runs out of disk space.
  29. We use Panopta’s agent-based monitor to check for aliveness, disk usage and CPU usage.
  30. Other ideas: have your Testing sensu set up monitor Production sensu. Sensu can collect metrics about itself so there’s no need for a 3rd party system there.
  31. This step is simple: Deploy your now well-tested server infastructure using your now well-tested Configuration Management recipes. This should go smoothly because you’ve had plenty of practice rolling out and administering your testing setup as well as all of your checks. First you’ll want to stand up the production Sensu server stack, then you’ll roll out sensu-client to the rest of your production servers or VMs.
  32. Let the alerts roll in! You’ll likely need to tune thresholds, alert occurrences, etc once you have your checks running against actual production traffic.
  33. Quick overview of the Sensu architecture and how it’s deployed on your VirtualBoxes. Sensu uses RabbitMQ for all communication between the client and the server. RabbitMQ and Redis are all running on your sensu-server VM, as well as the Dashboard (not pictured here), the API, and the Server. Redis is used to persist data for use by the API.
  34. Sensu package contains all of it's dependencies in an "omnibus" installer, meaning it embeds everything it needs into /opt/sensu. This is great because you don’t need to worry about whether your system ruby is going to work with it, and you don’t even need to install system-wide ruby if you don’t need it.
  35. BREAK HERE if needed :)
  36. A little background on the Sandbox. I used Vagrant and Chef to bring up these boxes. The original Vagrantfile will be available online for you. I didn’t want to spend too much time showing you how to deploy Sensu with Chef because I didn’t want to give the impression that Chef is your only option for deploying Sensu. However, if you are already familiar with Chef, you can check my sensu-tutorial github to see (and use) the recipes used to build these boxes. Today we’re going to do some hand-configuration, just for you to get familiar with how Checks and Handlers work, but in reality, you’d be using your configuration management system of choice to deploy all of these.
  37. If you open config.json on both the sensu-server and sensu-client VMs, you’ll see they are exactly the same.
  38. Let’s jump right in and trigger an alert! By default, a Keepalive warning alert will be raised if the server doesn’t hear from the client after 120 seconds, critical threshold is 180. This is tunable on a per-client basis.
  39. A handler is what takes action on an event, basically how the alert reaches a human. All events are displayed to the Dashboard, regardless of handler. Handlers can be sent through pipe, tcp, udp, amqp, to a set of other handlers.
  40. So let’s configure a handler to send an email notification out for an event. I went ahead and installed the `mailer.rb` plugin and gem deps for you. Make sure you are on the server for all of the following config steps.
  41. Now let’s install the handler. Let’s use the ‘default’ handler config as a template, and copy it over to email.json
  42. I’ve acutally already installed the `mail` gem dependency for you, which you can see by issuing the above command.
  43. Now we need to set up a check to use the handler we just set up.
  44. If you want to try this on your sensu sandbox, you’ll need to `yum install nc`, please don’t all try this right now :)
  45. guest/guest
  46. Put Nginx In front of sensu-dashboard Sensu dashboard runs on port 8080 and requires authentication, neither of which are yet configurable. We resolved this minor annoyance by running Nginx in front of the dashboard, proxying to 8080 and injecting authentication headers into Sensu so we don’t need to log in when viewing Sensu on our VPN.
  47. Making sensu-server redundant is easy - all you need to do is bring up more instances of sensu-server - but scaling out and making Redis and RabbitMQ highly available can be more challenging from an operational perspective. At Paperless, we are still running one solitary Redis instance for Sensu, but are comfortable with this because a) bootstrapping a new one with Chef would be trivial and b) the data it contains is not mission critical and could be easily re-generated and c) we’ve had zero performance or stability issues with it thus far. Because RabbitMQ is a mission-critical piece of Sensu, we would like to, at some point, separate out Rabbit into a cluster with one disk node and one RAM node with HAProxy in front. However, I’ve never quite been able to get HAProxy tuned for Sensu’s liking. When and if I do, expect a blog post. If anyone here has experience running RabbitMQ clusters, I’d love to hear from you!