SlideShare uma empresa Scribd logo
1 de 7
Baixar para ler offline
Monitoring Best Practices
We often get asked about suggested practices for monitoring
servers and it’s a legitimate request – there are so many moving
parts it’s hard to know where to start. There are two things you
want your monitoring to do for you:
 Watch systems 24×7 and alert you if there is a problem
 Show you current and historical data (usually charts) to help you get a feel for overall
health and future needs
ALL of the suggestions below are for the general case. There are definitely specific situations
where one or more recommendations won’t apply (maybe high memory usage is desired on a
database server for example), so consider your situation as you consider the recommendations.
Alerting
For alerting, it’s a good idea to think of what issues are absolutely
critical and must be handled now (corporate web site is down) vs
things that need attention, but can wait a bit (disk space is under
10% free).
For critical alerts, email is a start, but probably not enough. You want a pager or phone to beep
at someone to get their attention. SMS texts, iPhone push notifications, etc. would be a good
idea.
Non-critical alerts can go to email. Sometimes emails get deleted or forgotten, so it’s a good
idea to have some sort of reminder or event escalation (where alerts get sent higher up the
chain of command the longer the issue is left unresolved).
Basic Monitoring
The core of most monitoring products is ping, CPU, memory, disk, and web pages, so we’ll start
there.
Ping
All of the rest of the monitoring isn’t worth much if the server or device isn’t up and running.
Pinging fairly often (at least once a minute) helps you stay on top of problems as they happen.
The trick is to not get hit with a lot of false-positives, which can easily happen on a busy
network. So make sure you’re only alerted after a few pings in a row have failed.
Alert Setting: Check once a minute, alert if response > 300ms, and there are 3 errors in 4
minutes.
Chart Setting: Show peak response times for the past 24 hours
CPU
Monitor the CPU usage (normally a percentage of total possible CPU output). It’s normal for it
to go up and down depending on the load. Having a very low average value means your server
isn’t being utilized much, and that server might be a good candidate for virtualization. If the
value is quite high (90%) for an extended period, the CPU might be a bottleneck. If it’s at
100% for very long at all, the system is probably not functioning well.
Alert Setting: Alert on sustained usage of > 90%
Chart Setting: Show average usage for the past 3 days to spot any unusual patterns
Memory
Measuring memory can be tricky since there are so many definitions to consider. Total physical
RAM in use? (you’d like to have 100% in use!). Total memory allocated (which can be greater
than physical RAM)? Amount of allocated memory swapped out to disk?
Personally, I like to know what percentage of memory in use out of how much is possibly
available (ie RAM and swap/page file). On Windows, this is the Memory% Committed Bytes in
Use and is defined as:
“% Committed Bytes In Use is the ratio of MemoryCommitted Bytes to the
MemoryCommit Limit. Committed memory is the physical memory in use for which space
has been reserved in the paging file should it need to be written to disk. The commit limit is
determined by the size of the paging file. If the paging file is enlarged, the commit limit
increases, and the ratio is reduced). This counter displays the current percentage value only; it
is not an average.”
If the % of total memory is high, you might be swapping to disk a lot and thus getting lower
server performance. You can check this by also monitoring how much of the swap/page file is in
use.
Alert Setting: Alert on sustained % memory used > 90%
Alert on swap/page file use > 70%
Chart Setting: Show 5-minute maximum for the past 3 days to spot any unusual patterns
Disk
I’ve experienced cases where an OS has crashed because there was no free disk space.
Certainly databases, mail servers, etc. don’t function well when they can’t write their data to
disk. Low disk space is a critical problem, but usually (hopefully) a slow moving one so you
have time to fix it.
One useful feature to watch for is trend analysis where the monitoring product looks at disk
growth rates and tries to predict when you’ll run out of disk space. This gives you an early
heads up so you can be proactive rather than reactive.
Alert Settings: Alert when free disk space < 10%
Chart Settings: Because disk space normally changes slowly, chart 30 days so you can
visually see trends
Web Page Performance
If you or your company has a website, knowing the website is up is pretty darn important. A
web page monitor should be able to check:
 Is it the site up?
 Is it responding as quickly as expected?
 Are there any errors on the page?
Some monitoring products can also check resources (ie file is where it should be), SSL
certificate expiration, etc.
You’ll need to decide how important the website is. If it’s absolutely critical to your business,
checking once every couple of minutes makes sense. If it’s a personal blog, maybe once an
hour is OK.
Hint: Since checking a page often could affect stats, have a separate page (maybe in a separate
folder) used just for polling if you can. That way it’s easy to filter those requests out from the
stats. Or if you need to hit the main page, consider adding something to the url like
?MONITOR=true for the same reason.
Hint 2: Some people want to check that the webserver is able to access the database. I
recommend having one page that hits the database and then outputs “OK” or “DATABASE
ERROR”. Then your web page monitor can check that page and alert if it sees “DATABASE
ERROR”.
Alert Setting: Check once every couple minutes or per hour, depending on critical nature of
the site. Pick a threshold for page load time that seems appropriate (alert if longer than 4
seconds for example)
Chart Setting: Maximum response time over the past 24 hours
Advanced Monitoring
The next article in this series will explore some advanced monitoring scenarios, like watching
Event Logs for specific events (user login for example), watching log files for errors, and more.
If you are looking for a product that can do all of the above, we just happen to know about
a good one

Mais conteúdo relacionado

Mais de Power Admin LLC

What is HIPAA Compliance?
What is HIPAA Compliance?What is HIPAA Compliance?
What is HIPAA Compliance?Power Admin LLC
 
Optimize and speed up windows 7
Optimize and speed up windows 7Optimize and speed up windows 7
Optimize and speed up windows 7Power Admin LLC
 
20 Windows Tools Every SysAdmin Should Know
20 Windows Tools Every SysAdmin Should Know20 Windows Tools Every SysAdmin Should Know
20 Windows Tools Every SysAdmin Should KnowPower Admin LLC
 
Cogent Consutlting Case Study
Cogent Consutlting Case StudyCogent Consutlting Case Study
Cogent Consutlting Case StudyPower Admin LLC
 
Power Admin File Sight™
Power Admin File Sight™Power Admin File Sight™
Power Admin File Sight™Power Admin LLC
 
Power Admin Server Monitor™
Power Admin Server Monitor™Power Admin Server Monitor™
Power Admin Server Monitor™Power Admin LLC
 

Mais de Power Admin LLC (8)

Top 5 Fake News Site
Top 5 Fake News SiteTop 5 Fake News Site
Top 5 Fake News Site
 
What is HIPAA Compliance?
What is HIPAA Compliance?What is HIPAA Compliance?
What is HIPAA Compliance?
 
How to Monitor IIS
How to Monitor IISHow to Monitor IIS
How to Monitor IIS
 
Optimize and speed up windows 7
Optimize and speed up windows 7Optimize and speed up windows 7
Optimize and speed up windows 7
 
20 Windows Tools Every SysAdmin Should Know
20 Windows Tools Every SysAdmin Should Know20 Windows Tools Every SysAdmin Should Know
20 Windows Tools Every SysAdmin Should Know
 
Cogent Consutlting Case Study
Cogent Consutlting Case StudyCogent Consutlting Case Study
Cogent Consutlting Case Study
 
Power Admin File Sight™
Power Admin File Sight™Power Admin File Sight™
Power Admin File Sight™
 
Power Admin Server Monitor™
Power Admin Server Monitor™Power Admin Server Monitor™
Power Admin Server Monitor™
 

Último

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Último (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

System Monitoring Best Practices

  • 1. Monitoring Best Practices We often get asked about suggested practices for monitoring servers and it’s a legitimate request – there are so many moving parts it’s hard to know where to start. There are two things you want your monitoring to do for you:  Watch systems 24×7 and alert you if there is a problem  Show you current and historical data (usually charts) to help you get a feel for overall health and future needs ALL of the suggestions below are for the general case. There are definitely specific situations where one or more recommendations won’t apply (maybe high memory usage is desired on a database server for example), so consider your situation as you consider the recommendations. Alerting For alerting, it’s a good idea to think of what issues are absolutely critical and must be handled now (corporate web site is down) vs things that need attention, but can wait a bit (disk space is under 10% free). For critical alerts, email is a start, but probably not enough. You want a pager or phone to beep at someone to get their attention. SMS texts, iPhone push notifications, etc. would be a good idea.
  • 2. Non-critical alerts can go to email. Sometimes emails get deleted or forgotten, so it’s a good idea to have some sort of reminder or event escalation (where alerts get sent higher up the chain of command the longer the issue is left unresolved). Basic Monitoring The core of most monitoring products is ping, CPU, memory, disk, and web pages, so we’ll start there. Ping All of the rest of the monitoring isn’t worth much if the server or device isn’t up and running. Pinging fairly often (at least once a minute) helps you stay on top of problems as they happen. The trick is to not get hit with a lot of false-positives, which can easily happen on a busy network. So make sure you’re only alerted after a few pings in a row have failed. Alert Setting: Check once a minute, alert if response > 300ms, and there are 3 errors in 4 minutes. Chart Setting: Show peak response times for the past 24 hours
  • 3. CPU Monitor the CPU usage (normally a percentage of total possible CPU output). It’s normal for it to go up and down depending on the load. Having a very low average value means your server isn’t being utilized much, and that server might be a good candidate for virtualization. If the value is quite high (90%) for an extended period, the CPU might be a bottleneck. If it’s at 100% for very long at all, the system is probably not functioning well. Alert Setting: Alert on sustained usage of > 90% Chart Setting: Show average usage for the past 3 days to spot any unusual patterns Memory
  • 4. Measuring memory can be tricky since there are so many definitions to consider. Total physical RAM in use? (you’d like to have 100% in use!). Total memory allocated (which can be greater than physical RAM)? Amount of allocated memory swapped out to disk? Personally, I like to know what percentage of memory in use out of how much is possibly available (ie RAM and swap/page file). On Windows, this is the Memory% Committed Bytes in Use and is defined as: “% Committed Bytes In Use is the ratio of MemoryCommitted Bytes to the MemoryCommit Limit. Committed memory is the physical memory in use for which space has been reserved in the paging file should it need to be written to disk. The commit limit is determined by the size of the paging file. If the paging file is enlarged, the commit limit increases, and the ratio is reduced). This counter displays the current percentage value only; it is not an average.” If the % of total memory is high, you might be swapping to disk a lot and thus getting lower server performance. You can check this by also monitoring how much of the swap/page file is in use. Alert Setting: Alert on sustained % memory used > 90% Alert on swap/page file use > 70% Chart Setting: Show 5-minute maximum for the past 3 days to spot any unusual patterns Disk I’ve experienced cases where an OS has crashed because there was no free disk space. Certainly databases, mail servers, etc. don’t function well when they can’t write their data to
  • 5. disk. Low disk space is a critical problem, but usually (hopefully) a slow moving one so you have time to fix it. One useful feature to watch for is trend analysis where the monitoring product looks at disk growth rates and tries to predict when you’ll run out of disk space. This gives you an early heads up so you can be proactive rather than reactive. Alert Settings: Alert when free disk space < 10% Chart Settings: Because disk space normally changes slowly, chart 30 days so you can visually see trends Web Page Performance If you or your company has a website, knowing the website is up is pretty darn important. A web page monitor should be able to check:
  • 6.  Is it the site up?  Is it responding as quickly as expected?  Are there any errors on the page? Some monitoring products can also check resources (ie file is where it should be), SSL certificate expiration, etc. You’ll need to decide how important the website is. If it’s absolutely critical to your business, checking once every couple of minutes makes sense. If it’s a personal blog, maybe once an hour is OK. Hint: Since checking a page often could affect stats, have a separate page (maybe in a separate folder) used just for polling if you can. That way it’s easy to filter those requests out from the stats. Or if you need to hit the main page, consider adding something to the url like ?MONITOR=true for the same reason.
  • 7. Hint 2: Some people want to check that the webserver is able to access the database. I recommend having one page that hits the database and then outputs “OK” or “DATABASE ERROR”. Then your web page monitor can check that page and alert if it sees “DATABASE ERROR”. Alert Setting: Check once every couple minutes or per hour, depending on critical nature of the site. Pick a threshold for page load time that seems appropriate (alert if longer than 4 seconds for example) Chart Setting: Maximum response time over the past 24 hours Advanced Monitoring The next article in this series will explore some advanced monitoring scenarios, like watching Event Logs for specific events (user login for example), watching log files for errors, and more. If you are looking for a product that can do all of the above, we just happen to know about a good one