SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
Finger Pointing
    Mahendra Kutare
 mahendra@boundary.com
   twitter - @imaxxs
FingerPointing ?

FingerPointing is a way through
w h ic h h u m a n s co m m u n icate
emotions of urgency, surprise, joy,
acknowle dgment, achievement,
blame, frustration, fear and more.
FingerPointing ?




Some do it with one..
                    Some need two..
FingerPointing ?




Some do it with one..
                        Some need two..
Systems FingerPointing ?




    Some do it everywhere...
Human Computer FingerPointing ?




        Some do it with....
Systems Control Loop
           Time to Collect
 Monitor                     Collect
                Info

                                  Time to Detect/Analyze
                 Act

           Time to Recover
 Recover                     Analysis


  Local                        Global
Systems Control Loop
           Time to Collect
  Meter                      Collector

                                   Time to Detect/Analyze


           Time to Recover
 Recover                      Engine


  Local                         Global
Problem Determination

Detection - Identifies violations or
anomalies.
Diagnosis - Analyzes violations or
anomalies.
Remediation - Recovers the
system to normal state
Detection

Threshold
Signature
Anomaly
Detection
Thresholds - Matching single value/predicate.

Signature - Matching faults with known fault
signatures. It can detect a set of know faults.

Anomalies - Learn to recognize the normal
runtime behavior. It can detect previously
unseen faults.
Aniketos
 No use of statistical machine learning.

 Uses computational geometry - convex hull.

 Convex hull - Encompassing shape around a
 group of points.

 Works independent of whether metrics are
 correlated or not.


Stehle, Lynch et.al ICAC 2010
Fault Detection
Training Phase

No one knows when enough training data is
collected.

If a system has an extensive test suite, that
represents normal behavior, then execution
of the test suite will produce a good training
dataset.

Replay request logs of production system on
test system.
Bounded Box Example
Given two metrics A and B, if the safe range of A
is 5 to 10 and B is 10 to 20 the normal behavior of
the system can be represented as 2D rectangle
with vertices (5,10), (5,20), (10,20) and (10,10)

Any datapoint that falls within that rectangle, for
example (7,15), is classified as normal.

Any datapoint that falls outside of the rectangle,
for example (15,15) is classified as anomalous.
Detection Phase
Egress/Ingress Data




volume_1s_meter_ip query, 6000 data points
Egress/Ingress Data




volume_1s_meter_ip query, 150,000 data points
Fault Detection Comparison




Maximum fault coverage, tradeoff false positives
Diagnosis

Dependency Inference
Correlation Analysis
Peer Analysis
E2EProf
Useful for debugging distributed systems of black boxes.




         Sandeep et. al DSN 2007
Service Paths

Client requests take different “paths” through the
software invoking dynamic dependencies across
distributed systems. Ensemble of paths taken by
client requests - “Service Paths”

Key idea - Convert message traces per service
node to per edge signals and compute cross
correlations of these signals.
Path Discovery
A request path VC1->VS1->VS2->VS4

Collect timestamp, source/dest ip at each VS
node.

Calculates cross correlation between time
series signals across VS nodes.

If cross correlation has a spike at a phase
lag = latency between nodes, there exists a
path/edge between VS nodes.
App Vis




   Network topology view
Augment with “service paths” ??
Remediation
Software Rejuvenation for Software Aging

  Reactive - Reboots, Micro Reboots

  Proactive - Time or load based

Checkpointing and Recovery

Treating bugs as allergies
Software Aging

Patriot missiles, used during the Gulf war, to
destroy Iraq’s Scud missile used a computer
who software accu mu late d er rors i.e
software aging.

The effect of aging in this case was mis-
interpretation of an incoming Scud as not a
missile but just a false alarm, which resulted
in death of 28 US soldiers.
Software Rejuvenation

Periodic preemptive rollback of continuously running
applications to prevent failures in the future.

Open - Not based on feedback from the system -
Elapsed Time, Cumulative jobs in system

Closed - Based on some notion of system health.
Continuously monitor, analyze the estimated time to
exhaustion of a resource.


    Trivedi et. al Duke University.
Apache Web Server
MaxRequestPerChild - If this value is set
to a positive value, then the parent
process of Apache kills a child process as
soon as MaxRequestsPerChild        request
have been handled by this child process.

By doing this, Apache limits “the amount
of memory a process can consume by
accidental memory leak”and “helps reduce
the num of process when server load
reduces.”
Treating Bugs as Allergies

 Inspired by allergy treatment in real life. If
 you are allergic to milk, remove dairy
 products from your diet.

 Rollback the program to a recent checkpoint
 when a bug is detected, dynamically change
 the execution environment based on failure
 symptoms, and then re-execute the program
 in modified environment.

     Quin et. al SOSP 2005
Treating Bugs As Allergies
Examples

Uninitialized reads may be avoided if every
newly allocated buffer is filled with zeros.

Data races can be avoided by changing time
related event such as thread scheduling,
asynchronous events.
Environment Changes
Comparison of Rx and
       Alternative Approaches




For systems where reboot ~5sec is not good enough
   Checkpoint, Replay bounded by reboot ~5sec
Finger pointing

Mais conteúdo relacionado

Semelhante a Finger pointing

Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...ijcisjournal
 
Implementing Vulnerability Management
Implementing Vulnerability Management Implementing Vulnerability Management
Implementing Vulnerability Management Argyle Executive Forum
 
IRJET- Web-based Application to Detect Heart Attack using Machine Learning
IRJET- Web-based Application to Detect Heart Attack using Machine LearningIRJET- Web-based Application to Detect Heart Attack using Machine Learning
IRJET- Web-based Application to Detect Heart Attack using Machine LearningIRJET Journal
 
Ch20-Software Engineering 9
Ch20-Software Engineering 9Ch20-Software Engineering 9
Ch20-Software Engineering 9Ian Sommerville
 
Performance engineering methodologies
Performance engineering  methodologiesPerformance engineering  methodologies
Performance engineering methodologiesManeesh Chaturvedi
 
Performance testing basics
Performance testing basicsPerformance testing basics
Performance testing basicsCharu Anand
 
PreMonR - A Reactive Platform To Monitor Reactive Application
PreMonR - A Reactive Platform To Monitor Reactive ApplicationPreMonR - A Reactive Platform To Monitor Reactive Application
PreMonR - A Reactive Platform To Monitor Reactive ApplicationKnoldus Inc.
 
IRJET - Neural Network based Leaf Disease Detection and Remedy Recommenda...
IRJET -  	  Neural Network based Leaf Disease Detection and Remedy Recommenda...IRJET -  	  Neural Network based Leaf Disease Detection and Remedy Recommenda...
IRJET - Neural Network based Leaf Disease Detection and Remedy Recommenda...IRJET Journal
 
Resilient systems design
Resilient systems designResilient systems design
Resilient systems designEdward Jones
 
Resilient systems design
Resilient systems designResilient systems design
Resilient systems designEdward Jones
 
System Event Monitoring for Active Authentication
System Event Monitoring for Active AuthenticationSystem Event Monitoring for Active Authentication
System Event Monitoring for Active AuthenticationCoveros, Inc.
 
smartwatch-user-identification
smartwatch-user-identificationsmartwatch-user-identification
smartwatch-user-identificationSebastian W. Cheah
 
Automatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang ApplicationsAutomatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang ApplicationsJan Henry Nystrom
 
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017sandhibhide
 
Employment Hero monitoring solution
Employment Hero monitoring solutionEmployment Hero monitoring solution
Employment Hero monitoring solutionLuong Vo
 

Semelhante a Finger pointing (20)

Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
 
Implementing Vulnerability Management
Implementing Vulnerability Management Implementing Vulnerability Management
Implementing Vulnerability Management
 
Mathworks CAE simulation suite – case in point from automotive and aerospace.
Mathworks CAE simulation suite – case in point from automotive and aerospace.Mathworks CAE simulation suite – case in point from automotive and aerospace.
Mathworks CAE simulation suite – case in point from automotive and aerospace.
 
Performance testing and rpt
Performance testing and rptPerformance testing and rpt
Performance testing and rpt
 
IRJET- Web-based Application to Detect Heart Attack using Machine Learning
IRJET- Web-based Application to Detect Heart Attack using Machine LearningIRJET- Web-based Application to Detect Heart Attack using Machine Learning
IRJET- Web-based Application to Detect Heart Attack using Machine Learning
 
Sa03 tactics
Sa03 tacticsSa03 tactics
Sa03 tactics
 
Ch20-Software Engineering 9
Ch20-Software Engineering 9Ch20-Software Engineering 9
Ch20-Software Engineering 9
 
Performance engineering methodologies
Performance engineering  methodologiesPerformance engineering  methodologies
Performance engineering methodologies
 
Performance testing basics
Performance testing basicsPerformance testing basics
Performance testing basics
 
PreMonR - A Reactive Platform To Monitor Reactive Application
PreMonR - A Reactive Platform To Monitor Reactive ApplicationPreMonR - A Reactive Platform To Monitor Reactive Application
PreMonR - A Reactive Platform To Monitor Reactive Application
 
IRJET - Neural Network based Leaf Disease Detection and Remedy Recommenda...
IRJET -  	  Neural Network based Leaf Disease Detection and Remedy Recommenda...IRJET -  	  Neural Network based Leaf Disease Detection and Remedy Recommenda...
IRJET - Neural Network based Leaf Disease Detection and Remedy Recommenda...
 
Vissec2014
Vissec2014Vissec2014
Vissec2014
 
Resilient systems design
Resilient systems designResilient systems design
Resilient systems design
 
Resilient systems design
Resilient systems designResilient systems design
Resilient systems design
 
System Event Monitoring for Active Authentication
System Event Monitoring for Active AuthenticationSystem Event Monitoring for Active Authentication
System Event Monitoring for Active Authentication
 
smartwatch-user-identification
smartwatch-user-identificationsmartwatch-user-identification
smartwatch-user-identification
 
Software testing overview subbu
Software testing overview subbuSoftware testing overview subbu
Software testing overview subbu
 
Automatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang ApplicationsAutomatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang Applications
 
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
 
Employment Hero monitoring solution
Employment Hero monitoring solutionEmployment Hero monitoring solution
Employment Hero monitoring solution
 

Último

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Último (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Finger pointing

  • 1. Finger Pointing Mahendra Kutare mahendra@boundary.com twitter - @imaxxs
  • 2. FingerPointing ? FingerPointing is a way through w h ic h h u m a n s co m m u n icate emotions of urgency, surprise, joy, acknowle dgment, achievement, blame, frustration, fear and more.
  • 3. FingerPointing ? Some do it with one.. Some need two..
  • 4. FingerPointing ? Some do it with one.. Some need two..
  • 5. Systems FingerPointing ? Some do it everywhere...
  • 6. Human Computer FingerPointing ? Some do it with....
  • 7. Systems Control Loop Time to Collect Monitor Collect Info Time to Detect/Analyze Act Time to Recover Recover Analysis Local Global
  • 8. Systems Control Loop Time to Collect Meter Collector Time to Detect/Analyze Time to Recover Recover Engine Local Global
  • 9. Problem Determination Detection - Identifies violations or anomalies. Diagnosis - Analyzes violations or anomalies. Remediation - Recovers the system to normal state
  • 11. Detection Thresholds - Matching single value/predicate. Signature - Matching faults with known fault signatures. It can detect a set of know faults. Anomalies - Learn to recognize the normal runtime behavior. It can detect previously unseen faults.
  • 12. Aniketos No use of statistical machine learning. Uses computational geometry - convex hull. Convex hull - Encompassing shape around a group of points. Works independent of whether metrics are correlated or not. Stehle, Lynch et.al ICAC 2010
  • 14. Training Phase No one knows when enough training data is collected. If a system has an extensive test suite, that represents normal behavior, then execution of the test suite will produce a good training dataset. Replay request logs of production system on test system.
  • 15. Bounded Box Example Given two metrics A and B, if the safe range of A is 5 to 10 and B is 10 to 20 the normal behavior of the system can be represented as 2D rectangle with vertices (5,10), (5,20), (10,20) and (10,10) Any datapoint that falls within that rectangle, for example (7,15), is classified as normal. Any datapoint that falls outside of the rectangle, for example (15,15) is classified as anomalous.
  • 19. Fault Detection Comparison Maximum fault coverage, tradeoff false positives
  • 21. E2EProf Useful for debugging distributed systems of black boxes. Sandeep et. al DSN 2007
  • 22. Service Paths Client requests take different “paths” through the software invoking dynamic dependencies across distributed systems. Ensemble of paths taken by client requests - “Service Paths” Key idea - Convert message traces per service node to per edge signals and compute cross correlations of these signals.
  • 23. Path Discovery A request path VC1->VS1->VS2->VS4 Collect timestamp, source/dest ip at each VS node. Calculates cross correlation between time series signals across VS nodes. If cross correlation has a spike at a phase lag = latency between nodes, there exists a path/edge between VS nodes.
  • 24. App Vis Network topology view Augment with “service paths” ??
  • 25. Remediation Software Rejuvenation for Software Aging Reactive - Reboots, Micro Reboots Proactive - Time or load based Checkpointing and Recovery Treating bugs as allergies
  • 26. Software Aging Patriot missiles, used during the Gulf war, to destroy Iraq’s Scud missile used a computer who software accu mu late d er rors i.e software aging. The effect of aging in this case was mis- interpretation of an incoming Scud as not a missile but just a false alarm, which resulted in death of 28 US soldiers.
  • 27. Software Rejuvenation Periodic preemptive rollback of continuously running applications to prevent failures in the future. Open - Not based on feedback from the system - Elapsed Time, Cumulative jobs in system Closed - Based on some notion of system health. Continuously monitor, analyze the estimated time to exhaustion of a resource. Trivedi et. al Duke University.
  • 28. Apache Web Server MaxRequestPerChild - If this value is set to a positive value, then the parent process of Apache kills a child process as soon as MaxRequestsPerChild request have been handled by this child process. By doing this, Apache limits “the amount of memory a process can consume by accidental memory leak”and “helps reduce the num of process when server load reduces.”
  • 29. Treating Bugs as Allergies Inspired by allergy treatment in real life. If you are allergic to milk, remove dairy products from your diet. Rollback the program to a recent checkpoint when a bug is detected, dynamically change the execution environment based on failure symptoms, and then re-execute the program in modified environment. Quin et. al SOSP 2005
  • 30. Treating Bugs As Allergies
  • 31. Examples Uninitialized reads may be avoided if every newly allocated buffer is filled with zeros. Data races can be avoided by changing time related event such as thread scheduling, asynchronous events.
  • 33. Comparison of Rx and Alternative Approaches For systems where reboot ~5sec is not good enough Checkpoint, Replay bounded by reboot ~5sec