SlideShare uma empresa Scribd logo
1 de 45
Baixar para ler offline
Common Sense
Performance
Indicators


           Nick Gerner
         June 24, 2010
Goals
 Common Sense in the Cloud
     same as outside the cloud


1. Tune performance
2. Investigate issues
3. Visualize architecture
Nick Gerner
              www.nickgerner.com
                  @gerner

•   Formerly senior engineer at SEOmoz
•   Linkscape: index of the web for SEO
•   Lead data services
•   Developer
•   Back-end ops guy
SEOmoz
• Seattle-based Startup (~7 engineers)
• SEO Blog and Community
• Toolset and Platform
    OpenSiteExplorer.org
• 300TB/month processing pipeline
• 5 mil req/day API hits
SEOmoz Engineering
• 50 < nodes < 500
• AWS based since 2008
  – EC2 – linux root access to bare VM
  – S3 – networked disk
  – EBS – local disk I/O
  – ELB – load balancing as a service
SEOmoz Architecture
         Processing


The                  Raw
Web     Crawlers
         Crawlers
                    Storage
                                    Process   Prepare




                    Data Pipeline
SEOmoz Architecture
           API

      Memcache   App   Lighttpd
                                        Partners


      Memcache   App   Lighttpd   ELB
S3

                                        SEOmoz
      Memcache   App   Lighttpd          Apps
End-to-End
 Performance Indicators

Latency   Conversion
            Rate

                 DNS
    Time to
    On-load
               Web
              Object
              Count
Great
...but not the focus of this talk

 Latency     Conversion
               Rate

                      DNS
      Time to
      On-load
                   Web
                  Object
                  Count
Performance Indicators
   System                                App
Characteristics                         Stack
                                          Front-End

 CPU      Mem     Drives                 Middleware

                                           Caching
          Net
 Disk             Competes                Back-end
                    For

                               Database                WS-API


                             http://www.flickr.com/photos/dnisbet/3118888630/
Performance Indicators
   System
Characteristics                          App
                                        Stack
  CPU     Mem                          Front-End
                   Drives             Middleware

                                        Caching
                   Competes
                     For
                                        Back-end
           Net
  Disk                         Database          WS-API




                   http://www.flickr.com/photos/dnisbet/3118888630/
/proc
• System stats
• Per-process stats
• It all comes from here
    ...but use tools to see it
System Characteristics

      Load Average
          CPU
        Memory
          Disk
        Network
Load Average
• Combines a few things
• Good place to start
• Explains nothing


                http://www.flickr.com/photos/maple03/4176389418/
CPU
• Break out by process
• Break out user vs system
• User, System, I/O wait, Idle


                     http://www.flickr.com/photos/pacdog/213442876/
Why watch it?
•   Who's doing work
•   Is CPU maxed?
•   Blocked on I/O?
•   Compare to Load Average
                    http://www.flickr.com/photos/pacdog/213442876/
Memory
• Break out by Process
• Free, cached, used



                 http://www.flickr.com/photos/williamhook/3118248600/
Why watch it?
• Cached + Free = Available
• Do you have spare memory?
  – App uses
  – Memcache
  – DB cache

               http://www.flickr.com/photos/williamhook/3118248600/
Disk
• Read bytes/sec
• Write bytes/sec
• Disk utilization


                     http://www.flickr.com/photos/robfon/2174992215/
Why watch it?

• Is disk busy?
• When?
• Who's using it?


                    http://www.flickr.com/photos/robfon/2174992215/
Network
• Read bytes/sec
• Write bytes/sec
• Established connections


                     http://www.flickr.com/photos/ahkitj/20853609/
Why watch it?
• Max connections
      (~1024 is magic)
• Bandwidth is $$$
• When are you busy?
• SOA considerations http://www.flickr.com/photos/ahkitj/20853609/
v Perf Monitoring   Solution
FREE, in Apt

  1. data collection (collectd)
  2. data storage (rrdtool)
  3. dashboard management (drraw)
Perf Monitoring Architecture
 Multiple Clusters

Multiple Applications

  Nodes come up
   and go down




     Cluster
                        Cluster
Perf Monitoring Architecture




                      collectd agents

                       new nodes get
 Cluster               generic config

            Cluster      node names
                      follow convention
                      according to role
Perf Monitoring Architecture

                                      On its own server:
                                       collectd server
       Perf Monitoring                  Web server
                                          drraw.cgi
           Server
                                     allows connections
                                       from new nodes

                                   perf data backed up daily



 Cluster
                         Cluster
Perf Monitoring Architecture
                                     Happy Sysadmin

                                    Visibility into system
                                   history of performance

       Perf Monitoring
           Server




 Cluster
                         Cluster
Perf Dashboard Featurs

1. Summarize nodes/systems
2. Visualize data over time
3. Stack measurements
– Per-process
– Per-node
4. Handle new nodes
–
Batch Mode Dashboard
CPU
Memory
Disk
Network
Web Server Dashboard
Web Requests
mod_status
System-Wide Dashboard
Per-request
Graph Summary
•   cpu, mem, disk, net
•   over time
•   per node
•   per process
•   Through in relevant app measures
      e.g. per request stats:
       • req/sec
       • median latency/req
Ad-hoc Tools
• $ dstat -cdnml
    system characteristics
• $ iotop
    per-process disk I/O
• $ iostat -x 3
    detailed disk stats
• $ netstat -tnp
    fast, per-process TCP connection stats
Resources
• Perf Testing: What, How, Why
      http://www.nickgerner.com/2010/02/performance-testing-
      what-andhow-why/

• Perf Testing Case Study: OSE
      http://www.nickgerner.com/2010/01/performance-testing-
      case-study-ose/

• S3 Benchmarks
      http://twopieceset.blogspot.com/2009/06/s3-
      performance-benchmarks.html

• Perf Measurement
  – http://twopieceset.blogspot.com/2009/03/performance-
    measurement-for-small-and.html
  –
More Resources
•   http://www.collectd.org
•   http://oss.oetiker.ch/rrdtool/
•   http://web.taranis.org/drraw/
•   http://dag.wieers.com/home-made/dstat/

• $ man proc
    –
Q: Why? A: Perf Tuning
                     Test


Validate                                Measure




           Improve          Interpret
Q: Why? A: System Arch
• Better Devs/Ops
• Identify Bottlenecks
• Scaling
  Considerations
Q: Why? A: Issue Investigation
•   Machine Specific?
•   System Wide?
•   Which Component?
•   Timeline?
•   Cascading Failures?

Mais conteĂşdo relacionado

Mais procurados

High Concurrency Architecture and Laravel Performance Tuning
High Concurrency Architecture and Laravel Performance TuningHigh Concurrency Architecture and Laravel Performance Tuning
High Concurrency Architecture and Laravel Performance Tuning
Albert Chen
 
Alfresco scalability and performnce
Alfresco   scalability and performnceAlfresco   scalability and performnce
Alfresco scalability and performnce
Paul Hampton
 

Mais procurados (20)

RackN Physical Layer Automation Innovation
RackN Physical Layer Automation InnovationRackN Physical Layer Automation Innovation
RackN Physical Layer Automation Innovation
 
deep learning in production cff 2017
deep learning in production cff 2017deep learning in production cff 2017
deep learning in production cff 2017
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
Alfresco tuning part2
Alfresco tuning part2Alfresco tuning part2
Alfresco tuning part2
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
 
Guide to alfresco monitoring
Guide to alfresco monitoringGuide to alfresco monitoring
Guide to alfresco monitoring
 
High Concurrency Architecture and Laravel Performance Tuning
High Concurrency Architecture and Laravel Performance TuningHigh Concurrency Architecture and Laravel Performance Tuning
High Concurrency Architecture and Laravel Performance Tuning
 
Splunk Java Agent
Splunk Java AgentSplunk Java Agent
Splunk Java Agent
 
Introduction to the Cluster Infrastructure and the Systems Provisioning Engin...
Introduction to the Cluster Infrastructure and the Systems Provisioning Engin...Introduction to the Cluster Infrastructure and the Systems Provisioning Engin...
Introduction to the Cluster Infrastructure and the Systems Provisioning Engin...
 
Apache Flink Hands On
Apache Flink Hands OnApache Flink Hands On
Apache Flink Hands On
 
Australian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStackAustralian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStack
 
Indic threads pune12-accelerating computation in html 5
Indic threads pune12-accelerating computation in html 5Indic threads pune12-accelerating computation in html 5
Indic threads pune12-accelerating computation in html 5
 
Alfresco scalability and performnce
Alfresco   scalability and performnceAlfresco   scalability and performnce
Alfresco scalability and performnce
 
Stac summit june 14th - goodbye datalakes
Stac summit june 14th - goodbye datalakesStac summit june 14th - goodbye datalakes
Stac summit june 14th - goodbye datalakes
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg Schad
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
Belvedere
BelvedereBelvedere
Belvedere
 
Ansible & Cumulus Networks - Simplify Network Automation
Ansible & Cumulus Networks - Simplify Network AutomationAnsible & Cumulus Networks - Simplify Network Automation
Ansible & Cumulus Networks - Simplify Network Automation
 

Semelhante a Common Sense Performance Indicators in the Cloud

Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
harendra_pathak
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Framework and Application Benchmarking
Framework and Application BenchmarkingFramework and Application Benchmarking
Framework and Application Benchmarking
Paul Jones
 

Semelhante a Common Sense Performance Indicators in the Cloud (20)

Bare Metal Provisioning for Big Data - OpenStack最新情報セミナー(2016年12月)
Bare Metal Provisioning for Big Data - OpenStack最新情報セミナー(2016年12月)Bare Metal Provisioning for Big Data - OpenStack最新情報セミナー(2016年12月)
Bare Metal Provisioning for Big Data - OpenStack最新情報セミナー(2016年12月)
 
Performance on a budget
Performance on a budgetPerformance on a budget
Performance on a budget
 
OpenStack Deployments with Chef
OpenStack Deployments with ChefOpenStack Deployments with Chef
OpenStack Deployments with Chef
 
3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging
 
Chef for OpenStack - OpenStack Fall 2012 Summit
Chef for OpenStack  - OpenStack Fall 2012 SummitChef for OpenStack  - OpenStack Fall 2012 Summit
Chef for OpenStack - OpenStack Fall 2012 Summit
 
Chef for OpenStack- Fall 2012.pdf
Chef for OpenStack- Fall 2012.pdfChef for OpenStack- Fall 2012.pdf
Chef for OpenStack- Fall 2012.pdf
 
Achieving Infrastructure Portability with Chef
Achieving Infrastructure Portability with ChefAchieving Infrastructure Portability with Chef
Achieving Infrastructure Portability with Chef
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
 
TechChat - What’s New in Sumo Logic 7/21/15
TechChat - What’s New in Sumo Logic 7/21/15TechChat - What’s New in Sumo Logic 7/21/15
TechChat - What’s New in Sumo Logic 7/21/15
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Planning For High Performance Web Application
Planning For High Performance Web ApplicationPlanning For High Performance Web Application
Planning For High Performance Web Application
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Docker based Architecture by Denys Serdiuk
Docker based Architecture by Denys SerdiukDocker based Architecture by Denys Serdiuk
Docker based Architecture by Denys Serdiuk
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
Framework and Application Benchmarking
Framework and Application BenchmarkingFramework and Application Benchmarking
Framework and Application Benchmarking
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
 
Serverless spark
Serverless sparkServerless spark
Serverless spark
 
Monitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECSMonitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECS
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Enterprise Knowledge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Common Sense Performance Indicators in the Cloud

  • 1. Common Sense Performance Indicators Nick Gerner June 24, 2010
  • 2. Goals Common Sense in the Cloud same as outside the cloud 1. Tune performance 2. Investigate issues 3. Visualize architecture
  • 3. Nick Gerner www.nickgerner.com @gerner • Formerly senior engineer at SEOmoz • Linkscape: index of the web for SEO • Lead data services • Developer • Back-end ops guy
  • 4. SEOmoz • Seattle-based Startup (~7 engineers) • SEO Blog and Community • Toolset and Platform OpenSiteExplorer.org • 300TB/month processing pipeline • 5 mil req/day API hits
  • 5. SEOmoz Engineering • 50 < nodes < 500 • AWS based since 2008 – EC2 – linux root access to bare VM – S3 – networked disk – EBS – local disk I/O – ELB – load balancing as a service
  • 6. SEOmoz Architecture Processing The Raw Web Crawlers Crawlers Storage Process Prepare Data Pipeline
  • 7. SEOmoz Architecture API Memcache App Lighttpd Partners Memcache App Lighttpd ELB S3 SEOmoz Memcache App Lighttpd Apps
  • 8. End-to-End Performance Indicators Latency Conversion Rate DNS Time to On-load Web Object Count
  • 9. Great ...but not the focus of this talk Latency Conversion Rate DNS Time to On-load Web Object Count
  • 10. Performance Indicators System App Characteristics Stack Front-End CPU Mem Drives Middleware Caching Net Disk Competes Back-end For Database WS-API http://www.flickr.com/photos/dnisbet/3118888630/
  • 11. Performance Indicators System Characteristics App Stack CPU Mem Front-End Drives Middleware Caching Competes For Back-end Net Disk Database WS-API http://www.flickr.com/photos/dnisbet/3118888630/
  • 12. /proc • System stats • Per-process stats • It all comes from here ...but use tools to see it
  • 13. System Characteristics Load Average CPU Memory Disk Network
  • 14. Load Average • Combines a few things • Good place to start • Explains nothing http://www.flickr.com/photos/maple03/4176389418/
  • 15. CPU • Break out by process • Break out user vs system • User, System, I/O wait, Idle http://www.flickr.com/photos/pacdog/213442876/
  • 16. Why watch it? • Who's doing work • Is CPU maxed? • Blocked on I/O? • Compare to Load Average http://www.flickr.com/photos/pacdog/213442876/
  • 17. Memory • Break out by Process • Free, cached, used http://www.flickr.com/photos/williamhook/3118248600/
  • 18. Why watch it? • Cached + Free = Available • Do you have spare memory? – App uses – Memcache – DB cache http://www.flickr.com/photos/williamhook/3118248600/
  • 19. Disk • Read bytes/sec • Write bytes/sec • Disk utilization http://www.flickr.com/photos/robfon/2174992215/
  • 20. Why watch it? • Is disk busy? • When? • Who's using it? http://www.flickr.com/photos/robfon/2174992215/
  • 21. Network • Read bytes/sec • Write bytes/sec • Established connections http://www.flickr.com/photos/ahkitj/20853609/
  • 22. Why watch it? • Max connections (~1024 is magic) • Bandwidth is $$$ • When are you busy? • SOA considerations http://www.flickr.com/photos/ahkitj/20853609/
  • 23. v Perf Monitoring Solution FREE, in Apt 1. data collection (collectd) 2. data storage (rrdtool) 3. dashboard management (drraw)
  • 24. Perf Monitoring Architecture Multiple Clusters Multiple Applications Nodes come up and go down Cluster Cluster
  • 25. Perf Monitoring Architecture collectd agents new nodes get Cluster generic config Cluster node names follow convention according to role
  • 26. Perf Monitoring Architecture On its own server: collectd server Perf Monitoring Web server drraw.cgi Server allows connections from new nodes perf data backed up daily Cluster Cluster
  • 27. Perf Monitoring Architecture Happy Sysadmin Visibility into system history of performance Perf Monitoring Server Cluster Cluster
  • 28. Perf Dashboard Featurs 1. Summarize nodes/systems 2. Visualize data over time 3. Stack measurements – Per-process – Per-node 4. Handle new nodes –
  • 30. CPU
  • 32. Disk
  • 39. Graph Summary • cpu, mem, disk, net • over time • per node • per process • Through in relevant app measures e.g. per request stats: • req/sec • median latency/req
  • 40. Ad-hoc Tools • $ dstat -cdnml system characteristics • $ iotop per-process disk I/O • $ iostat -x 3 detailed disk stats • $ netstat -tnp fast, per-process TCP connection stats
  • 41. Resources • Perf Testing: What, How, Why http://www.nickgerner.com/2010/02/performance-testing- what-andhow-why/ • Perf Testing Case Study: OSE http://www.nickgerner.com/2010/01/performance-testing- case-study-ose/ • S3 Benchmarks http://twopieceset.blogspot.com/2009/06/s3- performance-benchmarks.html • Perf Measurement – http://twopieceset.blogspot.com/2009/03/performance- measurement-for-small-and.html –
  • 42. More Resources • http://www.collectd.org • http://oss.oetiker.ch/rrdtool/ • http://web.taranis.org/drraw/ • http://dag.wieers.com/home-made/dstat/ • $ man proc –
  • 43. Q: Why? A: Perf Tuning Test Validate Measure Improve Interpret
  • 44. Q: Why? A: System Arch • Better Devs/Ops • Identify Bottlenecks • Scaling Considerations
  • 45. Q: Why? A: Issue Investigation • Machine Specific? • System Wide? • Which Component? • Timeline? • Cascading Failures?