as boundary change the game with second by second application monitoring sometimes this will affect how you apply your problem analysis steps. perhaps things can change
2. FingerPointing ?
FingerPointing is a way through
w h ic h h u m a n s co m m u n icate
emotions of urgency, surprise, joy,
acknowle dgment, achievement,
blame, frustration, fear and more.
7. Systems Control Loop
Time to Collect
Monitor Collect
Info
Time to Detect/Analyze
Act
Time to Recover
Recover Analysis
Local Global
8. Systems Control Loop
Time to Collect
Meter Collector
Time to Detect/Analyze
Time to Recover
Recover Engine
Local Global
9. Problem Determination
Detection - Identifies violations or
anomalies.
Diagnosis - Analyzes violations or
anomalies.
Remediation - Recovers the
system to normal state
11. Detection
Thresholds - Matching single value/predicate.
Signature - Matching faults with known fault
signatures. It can detect a set of know faults.
Anomalies - Learn to recognize the normal
runtime behavior. It can detect previously
unseen faults.
12. Aniketos
No use of statistical machine learning.
Uses computational geometry - convex hull.
Convex hull - Encompassing shape around a
group of points.
Works independent of whether metrics are
correlated or not.
Stehle, Lynch et.al ICAC 2010
14. Training Phase
No one knows when enough training data is
collected.
If a system has an extensive test suite, that
represents normal behavior, then execution
of the test suite will produce a good training
dataset.
Replay request logs of production system on
test system.
15. Bounded Box Example
Given two metrics A and B, if the safe range of A
is 5 to 10 and B is 10 to 20 the normal behavior of
the system can be represented as 2D rectangle
with vertices (5,10), (5,20), (10,20) and (10,10)
Any datapoint that falls within that rectangle, for
example (7,15), is classified as normal.
Any datapoint that falls outside of the rectangle,
for example (15,15) is classified as anomalous.
22. Service Paths
Client requests take different “paths” through the
software invoking dynamic dependencies across
distributed systems. Ensemble of paths taken by
client requests - “Service Paths”
Key idea - Convert message traces per service
node to per edge signals and compute cross
correlations of these signals.
23. Path Discovery
A request path VC1->VS1->VS2->VS4
Collect timestamp, source/dest ip at each VS
node.
Calculates cross correlation between time
series signals across VS nodes.
If cross correlation has a spike at a phase
lag = latency between nodes, there exists a
path/edge between VS nodes.
24. App Vis
Network topology view
Augment with “service paths” ??
25. Remediation
Software Rejuvenation for Software Aging
Reactive - Reboots, Micro Reboots
Proactive - Time or load based
Checkpointing and Recovery
Treating bugs as allergies
26. Software Aging
Patriot missiles, used during the Gulf war, to
destroy Iraq’s Scud missile used a computer
who software accu mu late d er rors i.e
software aging.
The effect of aging in this case was mis-
interpretation of an incoming Scud as not a
missile but just a false alarm, which resulted
in death of 28 US soldiers.
27. Software Rejuvenation
Periodic preemptive rollback of continuously running
applications to prevent failures in the future.
Open - Not based on feedback from the system -
Elapsed Time, Cumulative jobs in system
Closed - Based on some notion of system health.
Continuously monitor, analyze the estimated time to
exhaustion of a resource.
Trivedi et. al Duke University.
28. Apache Web Server
MaxRequestPerChild - If this value is set
to a positive value, then the parent
process of Apache kills a child process as
soon as MaxRequestsPerChild request
have been handled by this child process.
By doing this, Apache limits “the amount
of memory a process can consume by
accidental memory leak”and “helps reduce
the num of process when server load
reduces.”
29. Treating Bugs as Allergies
Inspired by allergy treatment in real life. If
you are allergic to milk, remove dairy
products from your diet.
Rollback the program to a recent checkpoint
when a bug is detected, dynamically change
the execution environment based on failure
symptoms, and then re-execute the program
in modified environment.
Quin et. al SOSP 2005
31. Examples
Uninitialized reads may be avoided if every
newly allocated buffer is filled with zeros.
Data races can be avoided by changing time
related event such as thread scheduling,
asynchronous events.