SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
Finger Pointing
    Mahendra Kutare
 mahendra@boundary.com
   twitter - @imaxxs
FingerPointing ?

FingerPointing is a way through
w h ic h h u m a n s co m m u n icate
emotions of urgency, surprise, joy,
acknowle dgment, achievement,
blame, frustration, fear and more.
FingerPointing ?




Some do it with one..
                    Some need two..
FingerPointing ?




Some do it with one..
                        Some need two..
Systems FingerPointing ?




    Some do it everywhere...
Human Computer FingerPointing ?




        Some do it with....
Systems Control Loop
           Time to Collect
 Monitor                     Collect
                Info

                                  Time to Detect/Analyze
                 Act

           Time to Recover
 Recover                     Analysis


  Local                        Global
Systems Control Loop
           Time to Collect
  Meter                      Collector

                                   Time to Detect/Analyze


           Time to Recover
 Recover                      Engine


  Local                         Global
Problem Determination

Detection - Identifies violations or
anomalies.
Diagnosis - Analyzes violations or
anomalies.
Remediation - Recovers the
system to normal state
Detection

Threshold
Signature
Anomaly
Detection
Thresholds - Matching single value/predicate.

Signature - Matching faults with known fault
signatures. It can detect a set of know faults.

Anomalies - Learn to recognize the normal
runtime behavior. It can detect previously
unseen faults.
Aniketos
 No use of statistical machine learning.

 Uses computational geometry - convex hull.

 Convex hull - Encompassing shape around a
 group of points.

 Works independent of whether metrics are
 correlated or not.


Stehle, Lynch et.al ICAC 2010
Fault Detection
Training Phase

No one knows when enough training data is
collected.

If a system has an extensive test suite, that
represents normal behavior, then execution
of the test suite will produce a good training
dataset.

Replay request logs of production system on
test system.
Bounded Box Example
Given two metrics A and B, if the safe range of A
is 5 to 10 and B is 10 to 20 the normal behavior of
the system can be represented as 2D rectangle
with vertices (5,10), (5,20), (10,20) and (10,10)

Any datapoint that falls within that rectangle, for
example (7,15), is classified as normal.

Any datapoint that falls outside of the rectangle,
for example (15,15) is classified as anomalous.
Detection Phase
Egress/Ingress Data




volume_1s_meter_ip query, 6000 data points
Egress/Ingress Data




volume_1s_meter_ip query, 150,000 data points
Fault Detection Comparison




Maximum fault coverage, tradeoff false positives
Diagnosis

Dependency Inference
Correlation Analysis
Peer Analysis
E2EProf
Useful for debugging distributed systems of black boxes.




         Sandeep et. al DSN 2007
Service Paths

Client requests take different “paths” through the
software invoking dynamic dependencies across
distributed systems. Ensemble of paths taken by
client requests - “Service Paths”

Key idea - Convert message traces per service
node to per edge signals and compute cross
correlations of these signals.
Path Discovery
A request path VC1->VS1->VS2->VS4

Collect timestamp, source/dest ip at each VS
node.

Calculates cross correlation between time
series signals across VS nodes.

If cross correlation has a spike at a phase
lag = latency between nodes, there exists a
path/edge between VS nodes.
App Vis




   Network topology view
Augment with “service paths” ??
Remediation
Software Rejuvenation for Software Aging

  Reactive - Reboots, Micro Reboots

  Proactive - Time or load based

Checkpointing and Recovery

Treating bugs as allergies
Software Aging

Patriot missiles, used during the Gulf war, to
destroy Iraq’s Scud missile used a computer
who software accu mu late d er rors i.e
software aging.

The effect of aging in this case was mis-
interpretation of an incoming Scud as not a
missile but just a false alarm, which resulted
in death of 28 US soldiers.
Software Rejuvenation

Periodic preemptive rollback of continuously running
applications to prevent failures in the future.

Open - Not based on feedback from the system -
Elapsed Time, Cumulative jobs in system

Closed - Based on some notion of system health.
Continuously monitor, analyze the estimated time to
exhaustion of a resource.


    Trivedi et. al Duke University.
Apache Web Server
MaxRequestPerChild - If this value is set
to a positive value, then the parent
process of Apache kills a child process as
soon as MaxRequestsPerChild        request
have been handled by this child process.

By doing this, Apache limits “the amount
of memory a process can consume by
accidental memory leak”and “helps reduce
the num of process when server load
reduces.”
Treating Bugs as Allergies

 Inspired by allergy treatment in real life. If
 you are allergic to milk, remove dairy
 products from your diet.

 Rollback the program to a recent checkpoint
 when a bug is detected, dynamically change
 the execution environment based on failure
 symptoms, and then re-execute the program
 in modified environment.

     Quin et. al SOSP 2005
Treating Bugs As Allergies
Examples

Uninitialized reads may be avoided if every
newly allocated buffer is filled with zeros.

Data races can be avoided by changing time
related event such as thread scheduling,
asynchronous events.
Environment Changes
Comparison of Rx and
       Alternative Approaches




For systems where reboot ~5sec is not good enough
   Checkpoint, Replay bounded by reboot ~5sec
Finger pointing

Mais conteúdo relacionado

Semelhante a Finger pointing

Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...ijcisjournal
 
Implementing Vulnerability Management
Implementing Vulnerability Management Implementing Vulnerability Management
Implementing Vulnerability Management Argyle Executive Forum
 
IRJET- Web-based Application to Detect Heart Attack using Machine Learning
IRJET- Web-based Application to Detect Heart Attack using Machine LearningIRJET- Web-based Application to Detect Heart Attack using Machine Learning
IRJET- Web-based Application to Detect Heart Attack using Machine LearningIRJET Journal
 
Ch20-Software Engineering 9
Ch20-Software Engineering 9Ch20-Software Engineering 9
Ch20-Software Engineering 9Ian Sommerville
 
Performance engineering methodologies
Performance engineering  methodologiesPerformance engineering  methodologies
Performance engineering methodologiesManeesh Chaturvedi
 
Performance testing basics
Performance testing basicsPerformance testing basics
Performance testing basicsCharu Anand
 
PreMonR - A Reactive Platform To Monitor Reactive Application
PreMonR - A Reactive Platform To Monitor Reactive ApplicationPreMonR - A Reactive Platform To Monitor Reactive Application
PreMonR - A Reactive Platform To Monitor Reactive ApplicationKnoldus Inc.
 
IRJET - Neural Network based Leaf Disease Detection and Remedy Recommenda...
IRJET -  	  Neural Network based Leaf Disease Detection and Remedy Recommenda...IRJET -  	  Neural Network based Leaf Disease Detection and Remedy Recommenda...
IRJET - Neural Network based Leaf Disease Detection and Remedy Recommenda...IRJET Journal
 
Resilient systems design
Resilient systems designResilient systems design
Resilient systems designEdward Jones
 
Resilient systems design
Resilient systems designResilient systems design
Resilient systems designEdward Jones
 
System Event Monitoring for Active Authentication
System Event Monitoring for Active AuthenticationSystem Event Monitoring for Active Authentication
System Event Monitoring for Active AuthenticationCoveros, Inc.
 
smartwatch-user-identification
smartwatch-user-identificationsmartwatch-user-identification
smartwatch-user-identificationSebastian W. Cheah
 
Automatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang ApplicationsAutomatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang ApplicationsJan Henry Nystrom
 
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017sandhibhide
 
Employment Hero monitoring solution
Employment Hero monitoring solutionEmployment Hero monitoring solution
Employment Hero monitoring solutionLuong Vo
 

Semelhante a Finger pointing (20)

Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
Fault Detection in Mobile Communication Networks Using Data Mining Techniques...
 
Implementing Vulnerability Management
Implementing Vulnerability Management Implementing Vulnerability Management
Implementing Vulnerability Management
 
Mathworks CAE simulation suite – case in point from automotive and aerospace.
Mathworks CAE simulation suite – case in point from automotive and aerospace.Mathworks CAE simulation suite – case in point from automotive and aerospace.
Mathworks CAE simulation suite – case in point from automotive and aerospace.
 
Performance testing and rpt
Performance testing and rptPerformance testing and rpt
Performance testing and rpt
 
IRJET- Web-based Application to Detect Heart Attack using Machine Learning
IRJET- Web-based Application to Detect Heart Attack using Machine LearningIRJET- Web-based Application to Detect Heart Attack using Machine Learning
IRJET- Web-based Application to Detect Heart Attack using Machine Learning
 
Sa03 tactics
Sa03 tacticsSa03 tactics
Sa03 tactics
 
Ch20-Software Engineering 9
Ch20-Software Engineering 9Ch20-Software Engineering 9
Ch20-Software Engineering 9
 
Performance engineering methodologies
Performance engineering  methodologiesPerformance engineering  methodologies
Performance engineering methodologies
 
Performance testing basics
Performance testing basicsPerformance testing basics
Performance testing basics
 
PreMonR - A Reactive Platform To Monitor Reactive Application
PreMonR - A Reactive Platform To Monitor Reactive ApplicationPreMonR - A Reactive Platform To Monitor Reactive Application
PreMonR - A Reactive Platform To Monitor Reactive Application
 
IRJET - Neural Network based Leaf Disease Detection and Remedy Recommenda...
IRJET -  	  Neural Network based Leaf Disease Detection and Remedy Recommenda...IRJET -  	  Neural Network based Leaf Disease Detection and Remedy Recommenda...
IRJET - Neural Network based Leaf Disease Detection and Remedy Recommenda...
 
Vissec2014
Vissec2014Vissec2014
Vissec2014
 
Resilient systems design
Resilient systems designResilient systems design
Resilient systems design
 
Resilient systems design
Resilient systems designResilient systems design
Resilient systems design
 
System Event Monitoring for Active Authentication
System Event Monitoring for Active AuthenticationSystem Event Monitoring for Active Authentication
System Event Monitoring for Active Authentication
 
smartwatch-user-identification
smartwatch-user-identificationsmartwatch-user-identification
smartwatch-user-identification
 
Software testing overview subbu
Software testing overview subbuSoftware testing overview subbu
Software testing overview subbu
 
Automatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang ApplicationsAutomatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang Applications
 
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
 
Employment Hero monitoring solution
Employment Hero monitoring solutionEmployment Hero monitoring solution
Employment Hero monitoring solution
 

Último

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Último (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Finger pointing

  • 1. Finger Pointing Mahendra Kutare mahendra@boundary.com twitter - @imaxxs
  • 2. FingerPointing ? FingerPointing is a way through w h ic h h u m a n s co m m u n icate emotions of urgency, surprise, joy, acknowle dgment, achievement, blame, frustration, fear and more.
  • 3. FingerPointing ? Some do it with one.. Some need two..
  • 4. FingerPointing ? Some do it with one.. Some need two..
  • 5. Systems FingerPointing ? Some do it everywhere...
  • 6. Human Computer FingerPointing ? Some do it with....
  • 7. Systems Control Loop Time to Collect Monitor Collect Info Time to Detect/Analyze Act Time to Recover Recover Analysis Local Global
  • 8. Systems Control Loop Time to Collect Meter Collector Time to Detect/Analyze Time to Recover Recover Engine Local Global
  • 9. Problem Determination Detection - Identifies violations or anomalies. Diagnosis - Analyzes violations or anomalies. Remediation - Recovers the system to normal state
  • 11. Detection Thresholds - Matching single value/predicate. Signature - Matching faults with known fault signatures. It can detect a set of know faults. Anomalies - Learn to recognize the normal runtime behavior. It can detect previously unseen faults.
  • 12. Aniketos No use of statistical machine learning. Uses computational geometry - convex hull. Convex hull - Encompassing shape around a group of points. Works independent of whether metrics are correlated or not. Stehle, Lynch et.al ICAC 2010
  • 14. Training Phase No one knows when enough training data is collected. If a system has an extensive test suite, that represents normal behavior, then execution of the test suite will produce a good training dataset. Replay request logs of production system on test system.
  • 15. Bounded Box Example Given two metrics A and B, if the safe range of A is 5 to 10 and B is 10 to 20 the normal behavior of the system can be represented as 2D rectangle with vertices (5,10), (5,20), (10,20) and (10,10) Any datapoint that falls within that rectangle, for example (7,15), is classified as normal. Any datapoint that falls outside of the rectangle, for example (15,15) is classified as anomalous.
  • 19. Fault Detection Comparison Maximum fault coverage, tradeoff false positives
  • 21. E2EProf Useful for debugging distributed systems of black boxes. Sandeep et. al DSN 2007
  • 22. Service Paths Client requests take different “paths” through the software invoking dynamic dependencies across distributed systems. Ensemble of paths taken by client requests - “Service Paths” Key idea - Convert message traces per service node to per edge signals and compute cross correlations of these signals.
  • 23. Path Discovery A request path VC1->VS1->VS2->VS4 Collect timestamp, source/dest ip at each VS node. Calculates cross correlation between time series signals across VS nodes. If cross correlation has a spike at a phase lag = latency between nodes, there exists a path/edge between VS nodes.
  • 24. App Vis Network topology view Augment with “service paths” ??
  • 25. Remediation Software Rejuvenation for Software Aging Reactive - Reboots, Micro Reboots Proactive - Time or load based Checkpointing and Recovery Treating bugs as allergies
  • 26. Software Aging Patriot missiles, used during the Gulf war, to destroy Iraq’s Scud missile used a computer who software accu mu late d er rors i.e software aging. The effect of aging in this case was mis- interpretation of an incoming Scud as not a missile but just a false alarm, which resulted in death of 28 US soldiers.
  • 27. Software Rejuvenation Periodic preemptive rollback of continuously running applications to prevent failures in the future. Open - Not based on feedback from the system - Elapsed Time, Cumulative jobs in system Closed - Based on some notion of system health. Continuously monitor, analyze the estimated time to exhaustion of a resource. Trivedi et. al Duke University.
  • 28. Apache Web Server MaxRequestPerChild - If this value is set to a positive value, then the parent process of Apache kills a child process as soon as MaxRequestsPerChild request have been handled by this child process. By doing this, Apache limits “the amount of memory a process can consume by accidental memory leak”and “helps reduce the num of process when server load reduces.”
  • 29. Treating Bugs as Allergies Inspired by allergy treatment in real life. If you are allergic to milk, remove dairy products from your diet. Rollback the program to a recent checkpoint when a bug is detected, dynamically change the execution environment based on failure symptoms, and then re-execute the program in modified environment. Quin et. al SOSP 2005
  • 30. Treating Bugs As Allergies
  • 31. Examples Uninitialized reads may be avoided if every newly allocated buffer is filled with zeros. Data races can be avoided by changing time related event such as thread scheduling, asynchronous events.
  • 33. Comparison of Rx and Alternative Approaches For systems where reboot ~5sec is not good enough Checkpoint, Replay bounded by reboot ~5sec