SlideShare uma empresa Scribd logo
1 de 18
Baixar para ler offline
NWC 2011

Monitoring a Cloud Infrastructure in a Multi-Region Topology
Nicolas Brousse
nicolas@tubemogul.com
September 29th 2011

2011 TubeMogul Incorporated All rights reserved.

1
Introduction - About the speaker
• My name is Nicolas Brousse
• I previously worked for many industry leading company in France
– From Web Hosting to Online Video services

(Lycos, MultiMania, Kewego, MediaPlazza...)

– Heavy traffic environment and large user databases
• I work as a Lead Operations Engineer at TubeMogul.com since 2008
• I help TubeMogul to scale its infrastructure
– From 20 servers to +500 servers
– Using 4 Amazon EC2 Regions + 1 Colo
– Monitoring with Nagios over 6,000 actives services and 1,000 passives services
– Collecting over 80,000 metrics with Ganglia
– Managing over 300 TB of data in Hadoop HDFS
– Billions HTTP queries a day
• Occasionally contribute to OpenSource projects
– Ganglia (PHP and PERL module)
– PHP Judy
2011 TubeMogul Incorporated All rights reserved.

2
Introduction - About TubeMogul
• Created in November 2006 by John Hughes and Brett Wilson
• Formerly a video distribution and analytics platform
• Acquire Illuminex - a flash analytics firm - in October 2008
• New platform call PlayTime™ :
– TubeMogul is a Video Marketing Company
– Built for Branding
– Integrate real-time media buying, ad serving, targeting, optimization and brand
measurement

TubeMogul simplifies the delivery of video ads and maximizes the impact of
every dollar spent by brand marketers
http://www.tubemogul.com/company/about_us

2011 TubeMogul Incorporated All rights reserved.

3
Our Environment
• +10 servers hosted at LiquidWeb
• Few VPS on Linode
• +500 instances on Amazon EC2
– Over 50 different servers configurations

• Our technology stack :
– JAVA, PHP
– Hadoop : HDFS, MapReduce, HBase, Hive
– Membase
– Memcache
– MySQL
– And more...

• Monitoring with Nagios
– Using NSCA when possible

• Graphing and Trending using Ganglia with Python plugins
– Some legacy servers using Munin

• Configuration Management using Puppet
2011 TubeMogul Incorporated All rights reserved.

4
Amazon Clound Environment

2011 TubeMogul Incorporated All rights reserved.

5
Amazon Clound Environment
• We like it because....
– We can quickly start new servers/clusters
– We can quickly start new servers/clusters in many regions
• US East (Virginia)
• US West (North California)
• Europe (Dublin)
• Asia Pacific (Tokyo & Singapore)
– We can use different type of instances (RAM, CPU, Disks, etc.)
– It’s easy to automate with EC2 API
– It’s easy to plug to a configuration management tool

• But...
– It can be hard to troubleshoot some failures or network problems
– Occasionally being notified of hardware failures after the facts
– No Multicast (Though, possible with Amazon VPC)
– Bandwidth cost between regions can get expensive

2011 TubeMogul Incorporated All rights reserved.

6
What’s the plan ?
• Our monitoring must be able to scale
• We need a better Graphing/Trending solution
• Our monitoring configuration must be automated
– How to monitor a cluster of servers with variables number of servers every hours ?
– How to change configuration in multiple regions without missing something ?

• A failure in one region shouldn’t impact other regions
• We want to be wake-up only when it really matter
• We have limited resources
– Can’t spend big bucks for monitoring
– Small operation team

2011 TubeMogul Incorporated All rights reserved.

7
Graphing, Trending...
Munin

Ganglia

munin-update
munin-graph

Gmetad
Pull

Pull

Gmond
munin-nodes

Push

Gmond

Gmetad

Pull

sequential polling

2011 TubeMogul Incorporated All rights reserved.

8
Graphing, Trending...
• Why we switched from Munin to Ganglia ?
– Pretty much : Pull vs Push
• Munin server fetch data from Munin Clients (munin-nodes)
– Can quickly overload the Munin server in disk I/O and CPU
– Data collected in sequential order impacted by previous run time and server load
• Ganglia Client send data to representative clusters nodes. Data get federated
periodically by a Gmetad process.
– Lighter on the aggregation side
– Clients push data at defined interval
– Can use threshold to send data only when it make sense
» using time_threshold and value_threshold in the metric
– Ganglia is designed for Clusters and Grids
• You can use multiple layer of gmond/gmetad process
• You don’t need to manually add servers to your configuration

2011 TubeMogul Incorporated All rights reserved.

9
Monitoring with Nagios

2011 TubeMogul Incorporated All rights reserved.

10
Automating Nagios configuration
• Puppet will configure our monitoring instance in each Region
– We use Nagios regex : use_regexp_matching=1
– But we don’t use true regex : use_true_regexp_matching=0
– We use NSCA with Upstart

– We don’t use the perfdata
– We includes our configurations from 3 directories
- objects => templates, contacts, commands, event_handlers
- servers => contain a configuration file for each server
- clusters => contain a configuration file for each cluster

2011 TubeMogul Incorporated All rights reserved.

11
Automating Nagios configuration
Process of event when starting a new host and add it to our monitoring:
1. We start a new instance using Cerveza and Cloud-init
2. Puppet configure Gmond on the instance
3. Our monitoring server running Gmetad get data from the new instance
4. A Nagios check run every minute and look for new hosts in Ganglia
5. If a new host is found, the check script rebuild the Nagios config and
reload Nagios
6. If the config is corrupt, the check script will send a critical alert

2011 TubeMogul Incorporated All rights reserved.

12
Automating Nagios configuration
• Each server configuration is

generated from a template
• Our nagios plugin
“check_tm_clusters”, goes
over the RRD files generated
by Ganglia
• If a new host is found, it
simply copy the template to
the servers config directory
and replace the variables as
reported by Ganglia and
looking at DNS entries

2011 TubeMogul Incorporated All rights reserved.

13
Reducing noise and false positive
• We disable most notification and only care of a cluster status

• Most of our checks are based on Ganglia RRD files

2011 TubeMogul Incorporated All rights reserved.

14
Reducing noise and false positive
• It become really easy to monitor any metrics returned by Ganglia

2011 TubeMogul Incorporated All rights reserved.

15
Reducing noise and false positive
• We can check cluster status by hosts/services but also per returned
messages !

2011 TubeMogul Incorporated All rights reserved.

16
Reducing noise and false positive
• We extensively use our “check_cluster” plugin
• We limit as much as possible email notification
• We use a custom variable _PAGING to identify pageable services
• Paging ONLY on Critical alerts for services/hosts with _PAGING=yes
• Use different contacts and time periods to send alerts to the right person
• We use Nagios Checker for FireFox and Chrome

2011 TubeMogul Incorporated All rights reserved.

17
Thank You...
TubeMogul is Hiring !
http://www.tubemogul.com/company/careers
jobs@tubemogul.com

Follow us on Twitter
@tubemogul
2011 TubeMogul Incorporated All rights reserved.

@orieg
18

Mais conteúdo relacionado

Destaque

Best Practices for Architecting in the Cloud - Jeff Barr
Best Practices for Architecting in the Cloud - Jeff BarrBest Practices for Architecting in the Cloud - Jeff Barr
Best Practices for Architecting in the Cloud - Jeff Barr
Amazon Web Services
 
Summer School Scale Cloud Across the Enterprise
Summer School   Scale Cloud Across the EnterpriseSummer School   Scale Cloud Across the Enterprise
Summer School Scale Cloud Across the Enterprise
WSO2
 
LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud Computing
LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud ComputingLinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud Computing
LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud Computing
Mark Hinkle
 
Linthicum what is-the-true-future-of-cloud-computing
Linthicum what is-the-true-future-of-cloud-computingLinthicum what is-the-true-future-of-cloud-computing
Linthicum what is-the-true-future-of-cloud-computing
David Linthicum
 
Simplifying The Cloud Top 10 Questions By SMBs
Simplifying The Cloud Top 10 Questions By SMBsSimplifying The Cloud Top 10 Questions By SMBs
Simplifying The Cloud Top 10 Questions By SMBs
Sun Digital, Inc.
 

Destaque (19)

Technical Track
Technical TrackTechnical Track
Technical Track
 
Welcome - Keynote - AWSome Day Helsinki 2017
Welcome - Keynote - AWSome Day Helsinki 2017Welcome - Keynote - AWSome Day Helsinki 2017
Welcome - Keynote - AWSome Day Helsinki 2017
 
AWS 101: Introduction to AWS
AWS 101: Introduction to AWSAWS 101: Introduction to AWS
AWS 101: Introduction to AWS
 
Penetrating the Cloud: Opportunities & Challenges for Businesses
Penetrating the Cloud: Opportunities & Challenges for BusinessesPenetrating the Cloud: Opportunities & Challenges for Businesses
Penetrating the Cloud: Opportunities & Challenges for Businesses
 
Intro to cloud computing — MegaCOMM 2013, Jerusalem
Intro to cloud computing — MegaCOMM 2013, JerusalemIntro to cloud computing — MegaCOMM 2013, Jerusalem
Intro to cloud computing — MegaCOMM 2013, Jerusalem
 
Avoiding Cloud Outage
Avoiding Cloud OutageAvoiding Cloud Outage
Avoiding Cloud Outage
 
2013 State of Cloud Survey SMB Results
2013 State of Cloud Survey SMB Results2013 State of Cloud Survey SMB Results
2013 State of Cloud Survey SMB Results
 
Best Practices for Architecting in the Cloud - Jeff Barr
Best Practices for Architecting in the Cloud - Jeff BarrBest Practices for Architecting in the Cloud - Jeff Barr
Best Practices for Architecting in the Cloud - Jeff Barr
 
Delivering IaaS with Open Source Software
Delivering IaaS with Open Source SoftwareDelivering IaaS with Open Source Software
Delivering IaaS with Open Source Software
 
Summer School Scale Cloud Across the Enterprise
Summer School   Scale Cloud Across the EnterpriseSummer School   Scale Cloud Across the Enterprise
Summer School Scale Cloud Across the Enterprise
 
Can we hack open source #cloud platforms to help reduce emissions?
Can we hack open source #cloud platforms to help reduce emissions?Can we hack open source #cloud platforms to help reduce emissions?
Can we hack open source #cloud platforms to help reduce emissions?
 
The Total Cost of Ownership (TCO) of Web Applications in the AWS Cloud - Jine...
The Total Cost of Ownership (TCO) of Web Applications in the AWS Cloud - Jine...The Total Cost of Ownership (TCO) of Web Applications in the AWS Cloud - Jine...
The Total Cost of Ownership (TCO) of Web Applications in the AWS Cloud - Jine...
 
LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud Computing
LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud ComputingLinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud Computing
LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud Computing
 
Linthicum what is-the-true-future-of-cloud-computing
Linthicum what is-the-true-future-of-cloud-computingLinthicum what is-the-true-future-of-cloud-computing
Linthicum what is-the-true-future-of-cloud-computing
 
The Inevitable Cloud Outage
The Inevitable Cloud OutageThe Inevitable Cloud Outage
The Inevitable Cloud Outage
 
Simplifying The Cloud Top 10 Questions By SMBs
Simplifying The Cloud Top 10 Questions By SMBsSimplifying The Cloud Top 10 Questions By SMBs
Simplifying The Cloud Top 10 Questions By SMBs
 
Breaking through the Clouds
Breaking through the CloudsBreaking through the Clouds
Breaking through the Clouds
 
Module 1: AWS Introduction and History - AWSome Day Online Conference - APAC
Module 1: AWS Introduction and History - AWSome Day Online Conference - APACModule 1: AWS Introduction and History - AWSome Day Online Conference - APAC
Module 1: AWS Introduction and History - AWSome Day Online Conference - APAC
 
Introduction to Amazon Web Services
Introduction to Amazon Web ServicesIntroduction to Amazon Web Services
Introduction to Amazon Web Services
 

Mais de Nicolas Brousse

IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
Nicolas Brousse
 

Mais de Nicolas Brousse (14)

<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
 
Improving Adobe Experience Cloud Services Dependability with Machine Learning
Improving Adobe Experience Cloud Services Dependability with Machine LearningImproving Adobe Experience Cloud Services Dependability with Machine Learning
Improving Adobe Experience Cloud Services Dependability with Machine Learning
 
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
 
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
 
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStackAdobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
 
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuiteSuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
 
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with Puppet
 
Scaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced EnvironmentScaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced Environment
 
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
 
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
 
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Monitoring a Cloud Infrastructure in a Multi-Region Topology

  • 1. NWC 2011 Monitoring a Cloud Infrastructure in a Multi-Region Topology Nicolas Brousse nicolas@tubemogul.com September 29th 2011 2011 TubeMogul Incorporated All rights reserved. 1
  • 2. Introduction - About the speaker • My name is Nicolas Brousse • I previously worked for many industry leading company in France – From Web Hosting to Online Video services (Lycos, MultiMania, Kewego, MediaPlazza...) – Heavy traffic environment and large user databases • I work as a Lead Operations Engineer at TubeMogul.com since 2008 • I help TubeMogul to scale its infrastructure – From 20 servers to +500 servers – Using 4 Amazon EC2 Regions + 1 Colo – Monitoring with Nagios over 6,000 actives services and 1,000 passives services – Collecting over 80,000 metrics with Ganglia – Managing over 300 TB of data in Hadoop HDFS – Billions HTTP queries a day • Occasionally contribute to OpenSource projects – Ganglia (PHP and PERL module) – PHP Judy 2011 TubeMogul Incorporated All rights reserved. 2
  • 3. Introduction - About TubeMogul • Created in November 2006 by John Hughes and Brett Wilson • Formerly a video distribution and analytics platform • Acquire Illuminex - a flash analytics firm - in October 2008 • New platform call PlayTime™ : – TubeMogul is a Video Marketing Company – Built for Branding – Integrate real-time media buying, ad serving, targeting, optimization and brand measurement TubeMogul simplifies the delivery of video ads and maximizes the impact of every dollar spent by brand marketers http://www.tubemogul.com/company/about_us 2011 TubeMogul Incorporated All rights reserved. 3
  • 4. Our Environment • +10 servers hosted at LiquidWeb • Few VPS on Linode • +500 instances on Amazon EC2 – Over 50 different servers configurations • Our technology stack : – JAVA, PHP – Hadoop : HDFS, MapReduce, HBase, Hive – Membase – Memcache – MySQL – And more... • Monitoring with Nagios – Using NSCA when possible • Graphing and Trending using Ganglia with Python plugins – Some legacy servers using Munin • Configuration Management using Puppet 2011 TubeMogul Incorporated All rights reserved. 4
  • 5. Amazon Clound Environment 2011 TubeMogul Incorporated All rights reserved. 5
  • 6. Amazon Clound Environment • We like it because.... – We can quickly start new servers/clusters – We can quickly start new servers/clusters in many regions • US East (Virginia) • US West (North California) • Europe (Dublin) • Asia Pacific (Tokyo & Singapore) – We can use different type of instances (RAM, CPU, Disks, etc.) – It’s easy to automate with EC2 API – It’s easy to plug to a configuration management tool • But... – It can be hard to troubleshoot some failures or network problems – Occasionally being notified of hardware failures after the facts – No Multicast (Though, possible with Amazon VPC) – Bandwidth cost between regions can get expensive 2011 TubeMogul Incorporated All rights reserved. 6
  • 7. What’s the plan ? • Our monitoring must be able to scale • We need a better Graphing/Trending solution • Our monitoring configuration must be automated – How to monitor a cluster of servers with variables number of servers every hours ? – How to change configuration in multiple regions without missing something ? • A failure in one region shouldn’t impact other regions • We want to be wake-up only when it really matter • We have limited resources – Can’t spend big bucks for monitoring – Small operation team 2011 TubeMogul Incorporated All rights reserved. 7
  • 9. Graphing, Trending... • Why we switched from Munin to Ganglia ? – Pretty much : Pull vs Push • Munin server fetch data from Munin Clients (munin-nodes) – Can quickly overload the Munin server in disk I/O and CPU – Data collected in sequential order impacted by previous run time and server load • Ganglia Client send data to representative clusters nodes. Data get federated periodically by a Gmetad process. – Lighter on the aggregation side – Clients push data at defined interval – Can use threshold to send data only when it make sense » using time_threshold and value_threshold in the metric – Ganglia is designed for Clusters and Grids • You can use multiple layer of gmond/gmetad process • You don’t need to manually add servers to your configuration 2011 TubeMogul Incorporated All rights reserved. 9
  • 10. Monitoring with Nagios 2011 TubeMogul Incorporated All rights reserved. 10
  • 11. Automating Nagios configuration • Puppet will configure our monitoring instance in each Region – We use Nagios regex : use_regexp_matching=1 – But we don’t use true regex : use_true_regexp_matching=0 – We use NSCA with Upstart – We don’t use the perfdata – We includes our configurations from 3 directories - objects => templates, contacts, commands, event_handlers - servers => contain a configuration file for each server - clusters => contain a configuration file for each cluster 2011 TubeMogul Incorporated All rights reserved. 11
  • 12. Automating Nagios configuration Process of event when starting a new host and add it to our monitoring: 1. We start a new instance using Cerveza and Cloud-init 2. Puppet configure Gmond on the instance 3. Our monitoring server running Gmetad get data from the new instance 4. A Nagios check run every minute and look for new hosts in Ganglia 5. If a new host is found, the check script rebuild the Nagios config and reload Nagios 6. If the config is corrupt, the check script will send a critical alert 2011 TubeMogul Incorporated All rights reserved. 12
  • 13. Automating Nagios configuration • Each server configuration is generated from a template • Our nagios plugin “check_tm_clusters”, goes over the RRD files generated by Ganglia • If a new host is found, it simply copy the template to the servers config directory and replace the variables as reported by Ganglia and looking at DNS entries 2011 TubeMogul Incorporated All rights reserved. 13
  • 14. Reducing noise and false positive • We disable most notification and only care of a cluster status • Most of our checks are based on Ganglia RRD files 2011 TubeMogul Incorporated All rights reserved. 14
  • 15. Reducing noise and false positive • It become really easy to monitor any metrics returned by Ganglia 2011 TubeMogul Incorporated All rights reserved. 15
  • 16. Reducing noise and false positive • We can check cluster status by hosts/services but also per returned messages ! 2011 TubeMogul Incorporated All rights reserved. 16
  • 17. Reducing noise and false positive • We extensively use our “check_cluster” plugin • We limit as much as possible email notification • We use a custom variable _PAGING to identify pageable services • Paging ONLY on Critical alerts for services/hosts with _PAGING=yes • Use different contacts and time periods to send alerts to the right person • We use Nagios Checker for FireFox and Chrome 2011 TubeMogul Incorporated All rights reserved. 17
  • 18. Thank You... TubeMogul is Hiring ! http://www.tubemogul.com/company/careers jobs@tubemogul.com Follow us on Twitter @tubemogul 2011 TubeMogul Incorporated All rights reserved. @orieg 18