SlideShare uma empresa Scribd logo
1 de 41
Baixar para ler offline
Icinga at Hyves.nl
Jeffrey Lensen
System Engineer
2
Hyves
• Dutch social network website
• 3 billion pageviews / month
• 10M dutch members (17M population)
• ~7M unique visitors / month (Comscore 09/2011)
• ~2.3M unique visitors / day
• 800.000 photo uploads / day
• 7M chat messages / day
• 6Gbps daily outgoing traffic
3
Hyves environment
• 3000 hosts running Gentoo
• 3 Datacenters
• 190 types of server functions
• 160 Employees
• System Engineering team: 12
• Developers: 45
4
Back in the day
• 1 Datacenter
• 150 servers
• 4 System Engineers
• 1 Nagios instance
• Manual configuration
5
Keep up with serverpark growth
• Popularity required expansion
• Receiving 100 - 200 servers at a time
• Manual configuration became unmanageable
6
Solutions to growth
• Templates for host and hostgroup configurations
• Servicechecks defined per hostgroup
• Automated configuration with scripts (hosts, hostgroups,
servicedependencies)
• Server management database as source
• Servicedependencies generated based on check_name prefix
7
Keep up with more serverpark growth
• From 1 to 3 datacenters
• Serverpark grew to 1500 hosts
• 1 Nagios host isn’t enough anymore
8
Solutions to more growth
• Distributed Nagios setup consisting of:
• 1 Central Nagios server for alerting and webinterface
• 9 Distributed Nagios servers
• Required little changes to configuration scripting
• Distribution based on location and function
9
Watching the watchers
• Monitoring Nagios hosts with Nagios on NOC
• NOC monitored by one of the Nagios hosts
• Monitoring all datacenters from HQ
10
Distributed Nagios scaling problems
• Long reloads due to large configuration (mainly Central server)
• Freezes during large (network) fall-outs -> No alerting!
• Webinterface could no longer load
11
Icinga
• Switched in November 2010
• No more central monitoring server needed
• Standalone web interface
• Database backend
• API
• Rapid development
• Painless migration:
• sed -i ‘s/nagios/icinga/g’ /etc/nagios/*cfg
• mv /etc/nagios/* /etc/icinga/
• 12 Icinga hosts
• 1 NOC Icinga host
• 100.000 service checks
• 3.500 hosts
12
Icinga setup
• 2 Icinga-web + database hosts
• Loadbalanced database and API
• Easy failover
13
Icinga setup
14
Make use of the API: Overview checks
• Overview checks for hostgroups and services
• Minimizes alerts during large failures
• Python script using API
• Example:
python check_monitoring_overview.py --hostgroup webserver
--service HTTP,HipHop -w 5% -c 10%
All 472 'HTTP', 'HipHop' services for 'mainweb' are OK
15
Missing monitoring
• Is everything that should be monitored, being monitored?
• Won’t realize until it’s too late
• Angry people..
16
Solution: Puppet
Puppet is an open-source next-generation server automation
tool. It is composed of a declarative language for expressing
system configuration, a client and server for distributing it, and a
library for realizing the configuration.
• Modules for each application (Nginx, Postfix, SNMP etc.)
• Roles based on function as set in server management database
• Everything is defined in Puppet
17
Example: Nginx module
class nginx {
tag("nginx")
package { "nginx":
ensure => "latest",
category => "www-servers"
}
service { "nginx":
enable => true,
ensure => running
}
}
18
Example: Role module
class role::webserver inherits role {
include nginx
}
19
Using Puppet to generate configs
• Supports “Nagios” Exported Resources
• Exported Resources stored in MySQL backend
• Define nagios_services in the matching modules
20
Include monitoring in NGINX module
modules/nginx/manifests/init.pp:
class nginx {
tag("nginx")
<snip>
@@nagios_service { "HTTP $hostname":
service_description => "HTTP",
check_command => "check_web_http",
event_handler => "service_restart!nginx!CRITICAL",
contact_groups => "sysadmins"
}
}
21
Predefine defaults in defines.pp
$__notifications_enabled = $systemstatus ? {
operational => "1",
fail => "0"
}
Nagios_service {
ensure => present,
host_name => $hostname.$domain,
use => "generic-service",
notifications_enabled => $__notifications_enabled,
target => "/etc/icinga/puppetgenerated/services/$hostname.cfg",
notes => $monitoringhost
}
22
Nagios_host {
ensure => present,
host_name => $hostname.$domain,
hostgroups => $role,
use => "generic-host",
alias => $hostname,
notifications_enabled => $__notifications_enabled,
target => "/etc/icinga/puppetgenerated/hosts/$hostname.cfg",
notes => $monitoringhost
}
Predefine defaults in defines.pp
Define host in monitoring module
23
modules/monitoring/manifests/init.pp:
class monitoring {
@@nagios_host { "$hostname":
address => $ip
}
}
modules/role/manifests/init.pp:
class role {
include monitoring
}
24
Retrieving resources
class icinga {
tag("icinga")
Nagios_host <<| notes == "$hostname" |>> {
require => File["/etc/icinga/puppetgenerated/hosts"]
}
Nagios_service <<| notes == "$hostname" |>> {
require => File["/etc/icinga/puppetgenerated/services"]
}
}
25
Checking generated configuration
class icinga {
<snip>
exec { "verify new cfg":
command => "/usr/bin/icinga -v /etc/icinga/verify-puppetgenerated.cfg",
require => Class["get_icinga_puppet_resources"]
}
exec { "mv cfgs":
command => "rm -rf /etc/icinga/puppet/*; mv /etc/icinga/puppetgenerated/* /etc/icinga/
puppet/",
require => Exec["verify new cfg"]
}
exec { "restart icinga":
command => ""/usr/bin/printf '[] RESTART_PROGRAMn' > /var/icinga/rw/icinga.cmd"",
require => [
Exec["mv cfgs"],
Service["icinga"]
]
}
}
26
Problems exporting resources
• Puppet runs on Icinga hosts took between 10 and 30 minutes!
• Makes it hard to quickly change monitoring
• Most time spent retrieving and processing (Nagios) resources
27
get_icinga_puppet_resources.py
• Determined queries used by Puppet
• Get all resource IDs
• For each ID get parameter name and value
• Write to defined file (“target”)
• Finishes in 15 seconds!
28
Retrieving resources ourselves
class icinga {
<snip>
exec { "get_icinga_puppet_resources":
command => "/usr/bin/python
/usr/local/bin/get_icinga_puppet_resources.py",
require => [
File["/etc/icinga/puppetgenerated/hosts"],
File["/etc/icinga/puppetgenerated/services"]
]
}
}
29
Other cool stuff to do with Puppet
• Generate daemon checks for servers based on config file
• Generate overview daemon checks using Icinga API
30
Retrieve daemons from config
modules/role/lib/facter/hyvesfacters.rb:
Facter.add("hyves_daemons") do
daemons = ["None"]
if File::exists?( "/<path_to_config>/daemons.conf" )
daemons = []
daemonarray = []
daemonconf = %x{grep name /<path_to_config>/
daemons.conf}
for daemon in daemonconf
daemon.sub!(/.** name:/, '')
daemonarray.push(daemon.chomp)
end
end
setcode do
daemonarray.uniq
end
end
31
Create services for daemons
modules/daemons/manifests/init.pp:
class daemons {
define add_daemon_check {
@@nagios_service { "$name Daemon $hostname":
use => "Daemon-check",
service_description => "$name Daemon",
check_command => "check_daemon!$name"
}
}
add_daemon_check { $hyves_daemons: }
}
32
Retrieving unique daemons from API
require 'net/http'
module Puppet::Parser::Functions
newfunction(:get_daemons, :type => :rvalue, :docs => "
This function returns an array of all current daemons, based on the Icinga API
") do |args|
domain = "<icinga-web_url>"
url = "/icinga-web/web/api/service/filter[AND(SERVICE_NAME%7Clike%7C*Daemon)]/
columns[SERVICE_NAME]/order[SERVICE_NAME;ASC]/authkey=<api_key>/json"
response = Net::HTTP.get_response(domain, url)
data = response.body
results = PSON.parse(data)
daemons = Array.new
results.each { |result|
daemon = result['SERVICE_NAME']
daemon.sub!(/ Daemon/, '')
daemons << daemon
}
daemons.uniq
end
end
33
Create overview services for daemons
modules/icinga/manifests/noc.pp:
$__daemons = get_daemons()
templatefile { "/etc/icinga/puppetgenerated/other/daemons.cfg":
template => template("icinga/hyvesdaemons.cfg.erb")
}
hyvesdaemons.cfg.erb:
<% __daemons.each do |daemon| -%>
define service{
use DaemonOverview-check
host_name daemons
service_description <%= daemon %>
}
<% end -%>
34
Deployment
• Deploy script to start Puppet runs on all monitoring hosts
• Reports status of Puppet runs once they’re finished
• Starts Puppet run on NOC monitoring host
What if a machine doesn’t run Puppet?
35
• Check to check configuration
• Retrieve all operational hosts from servermanagent DB
• Retrieve all hosts from Icinga API
• Alert if something is missing or notifications are off
What about failover?
36
• Requires puppet run on all server
• Speed up puppet “runs” with --noop
• Redeploy Icinga
37
ICL (Icinga CommandLine)
• Python based script
• Libraries for access to Icinga API and MK_Livestatus
• Library for things like translating exit codes, and statuses
• See host/service status information
• Control monitoring and alerting
• Quickly see open problems
38
Integration with other tools
• Integration with server administration script to change status
• Fail -> disable notifications
• Operational -> check if everything is OK + enable notifications
• Deprecated -> disable notification + remove from Puppet DB
• Integration with failover scripts
• Deploy monitoring when adding new servers
• Scripts can check status of hosts and services before continuing
Demo time
39
40
Plans for the (near) future
• Upgrade Icinga to 1.6
• Clean up ICL and make compatible with Icinga 1.6
• Put ICL on GitHub
• Expose API to developers
• Trend analysis / integration with Ganglia/Graphite
41
Thank you, questions?
Puppet: http://puppetlabs.com/
Github: https://github.com/hyves-org/
Email: jeffrey@hyves.nl
Hyves: http://skyler.hyves.nl/
Twitter: @0skyler0

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Automation with Ansible and Containers
Automation with Ansible and ContainersAutomation with Ansible and Containers
Automation with Ansible and Containers
 
Automating complex infrastructures with Puppet
Automating complex infrastructures with PuppetAutomating complex infrastructures with Puppet
Automating complex infrastructures with Puppet
 
Puppet and the HashiStack
Puppet and the HashiStackPuppet and the HashiStack
Puppet and the HashiStack
 
Bosh 2.0
Bosh 2.0Bosh 2.0
Bosh 2.0
 
Automating Complex Setups with Puppet
Automating Complex Setups with PuppetAutomating Complex Setups with Puppet
Automating Complex Setups with Puppet
 
Ansible Meetup Hamburg / Quickstart
Ansible Meetup Hamburg / QuickstartAnsible Meetup Hamburg / Quickstart
Ansible Meetup Hamburg / Quickstart
 
Network Automation: Ansible 102
Network Automation: Ansible 102Network Automation: Ansible 102
Network Automation: Ansible 102
 
HTTP Caching and PHP
HTTP Caching and PHPHTTP Caching and PHP
HTTP Caching and PHP
 
[JCConf 2020] 用 Kotlin 跨入 Serverless 世代
[JCConf 2020] 用 Kotlin 跨入 Serverless 世代[JCConf 2020] 用 Kotlin 跨入 Serverless 世代
[JCConf 2020] 用 Kotlin 跨入 Serverless 世代
 
Ansible leveraging 2.0
Ansible leveraging 2.0Ansible leveraging 2.0
Ansible leveraging 2.0
 
Hacking ansible
Hacking ansibleHacking ansible
Hacking ansible
 
Ansible fest Presentation slides
Ansible fest Presentation slidesAnsible fest Presentation slides
Ansible fest Presentation slides
 
Infrastructure as Code in Google Cloud
Infrastructure as Code in Google CloudInfrastructure as Code in Google Cloud
Infrastructure as Code in Google Cloud
 
PuppetCamp SEA 1 - Version Control with Puppet
PuppetCamp SEA 1 - Version Control with PuppetPuppetCamp SEA 1 - Version Control with Puppet
PuppetCamp SEA 1 - Version Control with Puppet
 
Puppet and the HashiCorp Suite
Puppet and the HashiCorp SuitePuppet and the HashiCorp Suite
Puppet and the HashiCorp Suite
 
PuppetCamp SEA 1 - Use of Puppet
PuppetCamp SEA 1 - Use of PuppetPuppetCamp SEA 1 - Use of Puppet
PuppetCamp SEA 1 - Use of Puppet
 
PuppetCamp SEA 1 - Puppet Deployment at OnApp
PuppetCamp SEA 1 - Puppet Deployment  at OnAppPuppetCamp SEA 1 - Puppet Deployment  at OnApp
PuppetCamp SEA 1 - Puppet Deployment at OnApp
 
More tips n tricks
More tips n tricksMore tips n tricks
More tips n tricks
 
Hopping in clouds: a tale of migration from one cloud provider to another
Hopping in clouds: a tale of migration from one cloud provider to anotherHopping in clouds: a tale of migration from one cloud provider to another
Hopping in clouds: a tale of migration from one cloud provider to another
 
Ground Control to Nomad Job Dispatch
Ground Control to Nomad Job DispatchGround Control to Nomad Job Dispatch
Ground Control to Nomad Job Dispatch
 

Semelhante a OSMC 2011 | Case Study - Icinga at Hyves.nl by Jeffrey Lensen

Facebook的缓存系统
Facebook的缓存系统Facebook的缓存系统
Facebook的缓存系统
yiditushe
 
Distributed monitoring at Hyves- Puppet
Distributed monitoring at Hyves- PuppetDistributed monitoring at Hyves- Puppet
Distributed monitoring at Hyves- Puppet
Puppet
 

Semelhante a OSMC 2011 | Case Study - Icinga at Hyves.nl by Jeffrey Lensen (20)

AAI-3218 Production Deployment Best Practices for WebSphere Liberty Profile
AAI-3218 Production Deployment Best Practices for WebSphere Liberty ProfileAAI-3218 Production Deployment Best Practices for WebSphere Liberty Profile
AAI-3218 Production Deployment Best Practices for WebSphere Liberty Profile
 
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...
 
Why favour Icinga over Nagios @ FrOSCon 2015
Why favour Icinga over Nagios @ FrOSCon 2015Why favour Icinga over Nagios @ FrOSCon 2015
Why favour Icinga over Nagios @ FrOSCon 2015
 
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
 
Session: A Reference Architecture for Running Modern APIs with NGINX Unit and...
Session: A Reference Architecture for Running Modern APIs with NGINX Unit and...Session: A Reference Architecture for Running Modern APIs with NGINX Unit and...
Session: A Reference Architecture for Running Modern APIs with NGINX Unit and...
 
Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015
 
High Availability Content Caching with NGINX
High Availability Content Caching with NGINXHigh Availability Content Caching with NGINX
High Availability Content Caching with NGINX
 
Dancing with websocket
Dancing with websocketDancing with websocket
Dancing with websocket
 
OSMC 2014: MonitoringLove with Sensu | Jochen Lillich
OSMC 2014: MonitoringLove with Sensu | Jochen LillichOSMC 2014: MonitoringLove with Sensu | Jochen Lillich
OSMC 2014: MonitoringLove with Sensu | Jochen Lillich
 
Orchestration Tool Roundup - Arthur Berezin & Trammell Scruggs
Orchestration Tool Roundup - Arthur Berezin & Trammell ScruggsOrchestration Tool Roundup - Arthur Berezin & Trammell Scruggs
Orchestration Tool Roundup - Arthur Berezin & Trammell Scruggs
 
High Availability Content Caching with NGINX
High Availability Content Caching with NGINXHigh Availability Content Caching with NGINX
High Availability Content Caching with NGINX
 
Iac d.damyanov 4.pptx
Iac d.damyanov 4.pptxIac d.damyanov 4.pptx
Iac d.damyanov 4.pptx
 
Beyond Puppet
Beyond PuppetBeyond Puppet
Beyond Puppet
 
Facebook的缓存系统
Facebook的缓存系统Facebook的缓存系统
Facebook的缓存系统
 
OSMC 2014 | Monitoring Love with Sensu by Jochen Lillich
OSMC 2014 | Monitoring Love with Sensu by Jochen LillichOSMC 2014 | Monitoring Love with Sensu by Jochen Lillich
OSMC 2014 | Monitoring Love with Sensu by Jochen Lillich
 
Distributed monitoring at Hyves- Puppet
Distributed monitoring at Hyves- PuppetDistributed monitoring at Hyves- Puppet
Distributed monitoring at Hyves- Puppet
 
Open Source Logging and Metrics Tools
Open Source Logging and Metrics ToolsOpen Source Logging and Metrics Tools
Open Source Logging and Metrics Tools
 
Open Source Logging and Monitoring Tools
Open Source Logging and Monitoring ToolsOpen Source Logging and Monitoring Tools
Open Source Logging and Monitoring Tools
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
 
OSMC 2009 | Icinga by Icinga Team
OSMC 2009 | Icinga by Icinga TeamOSMC 2009 | Icinga by Icinga Team
OSMC 2009 | Icinga by Icinga Team
 

Último

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 

Último (20)

AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 

OSMC 2011 | Case Study - Icinga at Hyves.nl by Jeffrey Lensen

  • 1. Icinga at Hyves.nl Jeffrey Lensen System Engineer
  • 2. 2 Hyves • Dutch social network website • 3 billion pageviews / month • 10M dutch members (17M population) • ~7M unique visitors / month (Comscore 09/2011) • ~2.3M unique visitors / day • 800.000 photo uploads / day • 7M chat messages / day • 6Gbps daily outgoing traffic
  • 3. 3 Hyves environment • 3000 hosts running Gentoo • 3 Datacenters • 190 types of server functions • 160 Employees • System Engineering team: 12 • Developers: 45
  • 4. 4 Back in the day • 1 Datacenter • 150 servers • 4 System Engineers • 1 Nagios instance • Manual configuration
  • 5. 5 Keep up with serverpark growth • Popularity required expansion • Receiving 100 - 200 servers at a time • Manual configuration became unmanageable
  • 6. 6 Solutions to growth • Templates for host and hostgroup configurations • Servicechecks defined per hostgroup • Automated configuration with scripts (hosts, hostgroups, servicedependencies) • Server management database as source • Servicedependencies generated based on check_name prefix
  • 7. 7 Keep up with more serverpark growth • From 1 to 3 datacenters • Serverpark grew to 1500 hosts • 1 Nagios host isn’t enough anymore
  • 8. 8 Solutions to more growth • Distributed Nagios setup consisting of: • 1 Central Nagios server for alerting and webinterface • 9 Distributed Nagios servers • Required little changes to configuration scripting • Distribution based on location and function
  • 9. 9 Watching the watchers • Monitoring Nagios hosts with Nagios on NOC • NOC monitored by one of the Nagios hosts • Monitoring all datacenters from HQ
  • 10. 10 Distributed Nagios scaling problems • Long reloads due to large configuration (mainly Central server) • Freezes during large (network) fall-outs -> No alerting! • Webinterface could no longer load
  • 11. 11 Icinga • Switched in November 2010 • No more central monitoring server needed • Standalone web interface • Database backend • API • Rapid development • Painless migration: • sed -i ‘s/nagios/icinga/g’ /etc/nagios/*cfg • mv /etc/nagios/* /etc/icinga/
  • 12. • 12 Icinga hosts • 1 NOC Icinga host • 100.000 service checks • 3.500 hosts 12 Icinga setup
  • 13. • 2 Icinga-web + database hosts • Loadbalanced database and API • Easy failover 13 Icinga setup
  • 14. 14 Make use of the API: Overview checks • Overview checks for hostgroups and services • Minimizes alerts during large failures • Python script using API • Example: python check_monitoring_overview.py --hostgroup webserver --service HTTP,HipHop -w 5% -c 10% All 472 'HTTP', 'HipHop' services for 'mainweb' are OK
  • 15. 15 Missing monitoring • Is everything that should be monitored, being monitored? • Won’t realize until it’s too late • Angry people..
  • 16. 16 Solution: Puppet Puppet is an open-source next-generation server automation tool. It is composed of a declarative language for expressing system configuration, a client and server for distributing it, and a library for realizing the configuration. • Modules for each application (Nginx, Postfix, SNMP etc.) • Roles based on function as set in server management database • Everything is defined in Puppet
  • 17. 17 Example: Nginx module class nginx { tag("nginx") package { "nginx": ensure => "latest", category => "www-servers" } service { "nginx": enable => true, ensure => running } }
  • 18. 18 Example: Role module class role::webserver inherits role { include nginx }
  • 19. 19 Using Puppet to generate configs • Supports “Nagios” Exported Resources • Exported Resources stored in MySQL backend • Define nagios_services in the matching modules
  • 20. 20 Include monitoring in NGINX module modules/nginx/manifests/init.pp: class nginx { tag("nginx") <snip> @@nagios_service { "HTTP $hostname": service_description => "HTTP", check_command => "check_web_http", event_handler => "service_restart!nginx!CRITICAL", contact_groups => "sysadmins" } }
  • 21. 21 Predefine defaults in defines.pp $__notifications_enabled = $systemstatus ? { operational => "1", fail => "0" } Nagios_service { ensure => present, host_name => $hostname.$domain, use => "generic-service", notifications_enabled => $__notifications_enabled, target => "/etc/icinga/puppetgenerated/services/$hostname.cfg", notes => $monitoringhost }
  • 22. 22 Nagios_host { ensure => present, host_name => $hostname.$domain, hostgroups => $role, use => "generic-host", alias => $hostname, notifications_enabled => $__notifications_enabled, target => "/etc/icinga/puppetgenerated/hosts/$hostname.cfg", notes => $monitoringhost } Predefine defaults in defines.pp
  • 23. Define host in monitoring module 23 modules/monitoring/manifests/init.pp: class monitoring { @@nagios_host { "$hostname": address => $ip } } modules/role/manifests/init.pp: class role { include monitoring }
  • 24. 24 Retrieving resources class icinga { tag("icinga") Nagios_host <<| notes == "$hostname" |>> { require => File["/etc/icinga/puppetgenerated/hosts"] } Nagios_service <<| notes == "$hostname" |>> { require => File["/etc/icinga/puppetgenerated/services"] } }
  • 25. 25 Checking generated configuration class icinga { <snip> exec { "verify new cfg": command => "/usr/bin/icinga -v /etc/icinga/verify-puppetgenerated.cfg", require => Class["get_icinga_puppet_resources"] } exec { "mv cfgs": command => "rm -rf /etc/icinga/puppet/*; mv /etc/icinga/puppetgenerated/* /etc/icinga/ puppet/", require => Exec["verify new cfg"] } exec { "restart icinga": command => ""/usr/bin/printf '[] RESTART_PROGRAMn' > /var/icinga/rw/icinga.cmd"", require => [ Exec["mv cfgs"], Service["icinga"] ] } }
  • 26. 26 Problems exporting resources • Puppet runs on Icinga hosts took between 10 and 30 minutes! • Makes it hard to quickly change monitoring • Most time spent retrieving and processing (Nagios) resources
  • 27. 27 get_icinga_puppet_resources.py • Determined queries used by Puppet • Get all resource IDs • For each ID get parameter name and value • Write to defined file (“target”) • Finishes in 15 seconds!
  • 28. 28 Retrieving resources ourselves class icinga { <snip> exec { "get_icinga_puppet_resources": command => "/usr/bin/python /usr/local/bin/get_icinga_puppet_resources.py", require => [ File["/etc/icinga/puppetgenerated/hosts"], File["/etc/icinga/puppetgenerated/services"] ] } }
  • 29. 29 Other cool stuff to do with Puppet • Generate daemon checks for servers based on config file • Generate overview daemon checks using Icinga API
  • 30. 30 Retrieve daemons from config modules/role/lib/facter/hyvesfacters.rb: Facter.add("hyves_daemons") do daemons = ["None"] if File::exists?( "/<path_to_config>/daemons.conf" ) daemons = [] daemonarray = [] daemonconf = %x{grep name /<path_to_config>/ daemons.conf} for daemon in daemonconf daemon.sub!(/.** name:/, '') daemonarray.push(daemon.chomp) end end setcode do daemonarray.uniq end end
  • 31. 31 Create services for daemons modules/daemons/manifests/init.pp: class daemons { define add_daemon_check { @@nagios_service { "$name Daemon $hostname": use => "Daemon-check", service_description => "$name Daemon", check_command => "check_daemon!$name" } } add_daemon_check { $hyves_daemons: } }
  • 32. 32 Retrieving unique daemons from API require 'net/http' module Puppet::Parser::Functions newfunction(:get_daemons, :type => :rvalue, :docs => " This function returns an array of all current daemons, based on the Icinga API ") do |args| domain = "<icinga-web_url>" url = "/icinga-web/web/api/service/filter[AND(SERVICE_NAME%7Clike%7C*Daemon)]/ columns[SERVICE_NAME]/order[SERVICE_NAME;ASC]/authkey=<api_key>/json" response = Net::HTTP.get_response(domain, url) data = response.body results = PSON.parse(data) daemons = Array.new results.each { |result| daemon = result['SERVICE_NAME'] daemon.sub!(/ Daemon/, '') daemons << daemon } daemons.uniq end end
  • 33. 33 Create overview services for daemons modules/icinga/manifests/noc.pp: $__daemons = get_daemons() templatefile { "/etc/icinga/puppetgenerated/other/daemons.cfg": template => template("icinga/hyvesdaemons.cfg.erb") } hyvesdaemons.cfg.erb: <% __daemons.each do |daemon| -%> define service{ use DaemonOverview-check host_name daemons service_description <%= daemon %> } <% end -%>
  • 34. 34 Deployment • Deploy script to start Puppet runs on all monitoring hosts • Reports status of Puppet runs once they’re finished • Starts Puppet run on NOC monitoring host
  • 35. What if a machine doesn’t run Puppet? 35 • Check to check configuration • Retrieve all operational hosts from servermanagent DB • Retrieve all hosts from Icinga API • Alert if something is missing or notifications are off
  • 36. What about failover? 36 • Requires puppet run on all server • Speed up puppet “runs” with --noop • Redeploy Icinga
  • 37. 37 ICL (Icinga CommandLine) • Python based script • Libraries for access to Icinga API and MK_Livestatus • Library for things like translating exit codes, and statuses • See host/service status information • Control monitoring and alerting • Quickly see open problems
  • 38. 38 Integration with other tools • Integration with server administration script to change status • Fail -> disable notifications • Operational -> check if everything is OK + enable notifications • Deprecated -> disable notification + remove from Puppet DB • Integration with failover scripts • Deploy monitoring when adding new servers • Scripts can check status of hosts and services before continuing
  • 40. 40 Plans for the (near) future • Upgrade Icinga to 1.6 • Clean up ICL and make compatible with Icinga 1.6 • Put ICL on GitHub • Expose API to developers • Trend analysis / integration with Ganglia/Graphite
  • 41. 41 Thank you, questions? Puppet: http://puppetlabs.com/ Github: https://github.com/hyves-org/ Email: jeffrey@hyves.nl Hyves: http://skyler.hyves.nl/ Twitter: @0skyler0