2. About Darren
● CTO at Intergral (The FusionReactor people…)
● 18 yrs CF experience (CF released 20 years ago!)
● Over 33 years in Software
● Worked in HP’s OpenView Network + System
Management Software Division before Intergral
● Background in Network and System Management
for banks
● Responsible for all Fusion(X) products
● Based in Stuttgart, Germany for last 25 years :-)
3. Overview
• The need for monitoring
• Gartner Application Performance Model
• Core APM
• Stability
• When things go wrong
• World Premier!
• Monitoring ProfileBox and FusionReactor
4. The Need for APM
Modern IT solutions need to be monitored and managed
in a complete, end-to-end manner
Detail remains important and has to be set into a well-
understood overall picture of system behavior
Five distinct dimensions of application performance
exist, each one complementary to the others
5. Gartner's APM Model
Five Dimensions:
End-user experience monitoring
Transaction profiling
Runtime application architecture
Component deep-dive monitoring
Analytics
12. Stability Antipatterns
● Blocked Threads
Almost all stability issues relate to Block Threads eventually.
Caused by locks,synchronizers,resources waits,exhaustion
● Chain Reaction
Blocked threads on one server increase load on others. This
slows the them down, causing more blocked threads...
● Integration Point
Exit points from the platform. Typical systems today may touch
8 or more on average. You're at the mercy of someone else...
● Cascade Failure
Occurs when problems in one layer causes problems in the
previous. Cracks jump from system to system. Be paranoid
about integration and stay up even if they do down.
14. Stability Patterns
● Circuit Breaker
Protects callers by not calling if Integration Point has failed.
Fast-fail when the breaker is open.
● Steady-State
System must run without you touching it. Anything that grows
resource (DB,files) must have a something that cleans it up. Use
caching to maintain performance.
● Bulkhead
Partitions capacity to preserve functionality. Use pools to protect
critical actions
● Timeouts
Use timeouts to prevent integration points becoming blocked
threads. Consider (delayed) retries.
15. When things go wrong
• Avoid Blame!!!
• Reduce Service instead of Outage
• Monitor and Gather Data
• Mean Time to Restore Service (MTRS)
• Always generate a test for every bug you find
• Tools are critical (ProfileBox)
• How can you debug a production problem?