This is a high level presentation on how to develop a monitoring improvement program. The topic of what to monitor is covered in a separate presentation.
2. The Same Old Problem
Corporate
LANs & VPNs
ISP
Connection
DNS & Internet
Services
Content Mgmt
System
Social Network
Widgets
Site Tracking
& Analytics
Banner Ads &
Revenue Generators
Multimedia &
CDN Content
Home Wireless
& Broadband
Mobile Broadband
Is It My Data Center?
• Configuration errors
• Application design issues
• Code defects
• Insufficient infrastructure
• Oversubscription Issues
• Poor routing optimization
• Low cache hit rate
Is It a Service Provider Problem?
• Non-optimized mobile content
• Bad performance under load
• Blocking content delivery
• Incorrect geo-targeted content
Is it an ISP Problem?
• Peering problems
• ISP Outages
Is it My Code or a Browser Problem?
• Missing content
• Poorly performing JavaScript
• Inconsistent CSS rendering
• Browser/device incompatibility
• Page size too big
• Conflicting HTML tag support
• Too many objects
• Content not optimized for device
The Cloud
Distributed
Database
Mainframe
Network
Middleware
Storage
3. Anatomy of an Outage
Corporate
LANs & VPNs
Load Balancer
Firewall
Web
Servers
Message
Queue
zOS
CICS
WAS
Database
WAS
Database
zOS
MQ
DB2
4
3
1
5:45-ish pm: CICS ABENDS start
flooding the console but not high
enough to ticket
2
6:00-ish pm: MQ flows start are interrupted and
are alerting in Flow Diagnostics
6:04pm: Synthetic transactions fail at and
6:14 the Ops Center confirms the issue
and creates a P0 Incident
6:54pm: Support teams
investigate the interrupted
flows and determine it is a
“back-end” problem
10:29pm: Support teams investigate
MQ and ultimately and rule it out
and ultimately decide to reset CICS
to resolve the issue
5
4. Gaining Perspective
Requires Balance
Packet Capture
Synthetic Transactions
Client Monitoring
Client Monitoring
Synthetic Transactions
Server Probe
1. Client to the Server
2. Server to the Client
3. “3rd
Party” Vantage Point
4. Synthetic Transactions
Four Perspectives of User Experience
5. Why Multiple Perspectives?
Know Your Customer:
• What they do?
§ Customers care about completing tasks
NOT whether the homepage is available
• Where they do it from?
§ Your customers don’t live in the cloud, test from their perspective
• When they do it?
§ Test at peak and normal traffic levels, to find all the problems
• What expectations do customers have?
§ Is 5 seconds fast enough or does it have to be quicker?
6. Itemize the
existing
monitors
Brainstorm
potential gaps
to fill
Deploy new
monitors
Identify the
potential
risks
Itemize the
existing
monitors
Determine
if which
gaps exist
Fill the
monitoring
gaps
Current Approach
Proposed Approach
Picking Better Monitors
7. What Does Good
Monitoring Look Like?
Corporate
LANs & VPNs
Load Balancer
Load Balancer
Firewall
Switch
Web Server Farm
Database
Data Power
Mainframe
Middleware
Load Balancer
1. System Availability
2. Operating SystemPerformance
3. Hardware Monitoring
4. Service/Daemon and Process Availability
5. Error Logs
6. Application Resource KPIs
7. End-to-End Transactions
8. Point of Failure Transactions
9. Fail-Over Success
10.“Activity Monitors” and “Reverse Hockey Stick”
Elements of Good Monitoring
32 4 5 61
7
8
9 10
8. What Matters Most?
Dr. Lee
Goldman
Cook County Hospital,
Chicago, IL
1. Is the patient feelingunstable
angina?
2. Is there fluid in the patient’s lungs?
3. Is the patient’s systolic blood
pressure below 100?
The Goldman Algorithm
Prediction of Patients Expected to
Have a Heart Attack Within 72 Hours
0
20
40
60
80
100
Traditional Techniques Goldman Algorithm
By paying attention to what really matters, Dr.
Goldman improved the “false negatives” by 20
percentage points and eliminated the “false
positives” altogether.
9. The Goldman Algorithm
ECG Evidenceof Acute Ischemia?
ST-Segment Depression ≥ 1mmin ≥ 2 ContiguousLeads
(New or Unknown Age) or
T- Wave Inversion in ≥ 2 Contiguous Leads(Newor
Unknown Age) or
Left Bundle-Branch Block (Newor UnknownAge)
Observation
Unit
Inpatient
Telemetry
Unit
High Risk Low Risk Very Low RiskModerate Risk
Yes N
o
Coronary
Care Unit
N
o
ECG Evidenceof Acute MyocardialInfarction (MI)?
ST-Segment Elevation ≥ 1mmin ≥ 2 Contiguous
Leads (New orUnknown Age)
or
Pathologic Q Waves in ≥ 2 ContiguousLeads (New
or Unknown Age)
Yes
Patient suspected of
Acute Cardiac
Ischema
Perform
Electrocardiogram
(EKG)
0 Factors2 or 3 Factors 1 Factors0 or 1 Factors2 or 3 Factors
Urgent FactorsPresent?
Rates Above BothLung Bases
Systolic BloodPressure <100 mmHg
UnstableIschemic Heart Disease
Urgent FactorsPresent?
Rates Above BothLung Bases
Systolic BloodPressure <100 mmHg
UnstableIschemic Heart Disease
10. Seven Deadly Sins
Although Companies Realize the Importance of an Effective Monitoring System,
Most Fall Prey to Common Mistakes That Erode the Value
UsageReportingCollectionSelection
Strategic
Tactical
NatureofMistake
Life Cycle Activity
Ignoring the possibilities: Lack
of optimal utilization of available data
One size fits all: Lack
of audience segmentation
“Metrics Toilet”: Lack of aggregation and screening
of low-level metrics, resulting in cumbersome reports
Waiting for the perfect tool: Lack of focus
on process, leading to over-reliance on technology
An arbitrary exercise: Lack
of defined criteria for target setting
Metrics that (don’t) matter: Lack
of actionable metrics
IT’s World View: Lack of user involvement
in metrics selection and refinement
Source: Infrastructure Executive Council, 2003
11. Finding Metrics That Matter
§ Will the metric be used in a report? If so, which one? How is it used in the report?
§ Will the metric be used in a dashboard? If so, which one? How will it be used?
§ What action(s) will be taken if an alert is generated? Who are the actors? Will a
ticket be generated? If so, what severity?
§ How often is this event likely to occur? What is the impact if the event occurs?
What is the likelihood it can be detected by monitoring?
§ Will the metric help identify the source of a problem? Is it a coincident /
symptomatic indicator?
§ Is the metric always associated with a single problem? Could this metric become a
false indicator?
§ What is the impact if this goes undetected?
§ What is the lifespan for this metric? What is the potential for changes that may
reduce the efficacy of the metric?
Evaluating the Effectiveness of a Metric
12. The bulk of the monitoring
performed measures the health of
the operating system
Transaction
Monitoring
Application
Resource
Monitoring
Operating System
Monitoring
This monitoring is developed
specially for the technologies
used by the application to
determine if they are
functioning correctly
Transaction monitoring is the key
to good monitoring as it provides
depth and the capability to
determine customer impact
The overlap ensures
sufficient fault detection
The Layered Approach
13. Monitoring Patterns
Layers of Pre-Defined Monitoring Patterns
• The OS template is deployed when the
server is provisioned
• As a server is customized to fit its role,
additional templates are deployed
• Templates are stacked on top of each
other until no gaps remain
• This approach provides a high degree of
standardization without sacrificing the
ability to develop a custom solution
14. Application-Technology Matrix
Maps services, applications and technologies
enabling:
•Monitoring investment prioritization
•Monitoring maturity
•Which templates need to be deployed when
new hardware is acquired
•Whether an service has sufficient monitoring
coverage based on its application components
•This approach allows for anticipatingchanges
to a customer’s monitoring needs
Scores indicate:
0 – No Strategy
1 – Limited Monitoring
2 – Fully Integrated Strategy
15. Integrate Your Processes
Presentation
Framework
Asset Management
& Topology
Database
Aggregation
and Analysis
Security
Management
Availability
Management
Configuration
Management
Change
Management
Performance
Management
Enterprise Data
Sources
Business
Telemetry
Information
Configuration Discrepancies
Enrichment Data
Business Activity Data
Historical Data
“Enriched” Events
Change Activity
Topology Snapshots
Trend-RelatedFaults
DiscoveredProblems
Status Indications
Incidents
Audit Information and Suspicious Activity
Enrichment Data Business Activity Data
Automated
Discovery
16. Processing Streams
Situational
Awareness
Engine
Adapted from http://www.slideshare.net/TimBassCEP/getting-started-in-cep-
how-to-build-an-event-processing-application-presentation-717795
Real-Time
Event Streams
Detected and
Predicted Situations
Patterns from
Historical Data
Causal Relationship
from Past RCAs
17. Complex Event Processing
Event Pipeline
Event Queries
Time Window
Data Events
Control Event
Other Events
Event Filter
Scenarios
A
B
C
Feedback Loop
Event Intelligence
Action Events
19. The Management Eco-System
Capacity ManagementCompute Storage Network Facilities
Event Management /
Manager of Managers
CMDB
Billing &
Chargeback
Software
Tracking
Server
Monitoring
Storage
Manager
Network
Performance
Manager
Data Center
Infrastructure
Manager
Capacity Management
Predictive Insights Capacity Analyzer
Automated Reporting Engine
Cloud Orchestrator
Interface for
Capacity Planners
Interface for
Business Users
Policies Manager
Data Warehouse
20. Closeout
Meeting
Deliverables
•Acceptance Document
Event Integration
Test
Deliverables
•Acceptance Document
Build Integration
Solution
Deliverables
•Design Document Package
•Integration Rules
•As-Built Document
•Test Plan & Results
•Code Review Results
•Quality Inspection Checklist
Event Integration
Design
Deliverables
•Event Life Cycle Matrix
•Data Flow Diagram
•Integration Stories
Integration
Required?
Deploy
Monitoring
Deliverables
•Monitors
•Alerts
•Netcool Facts
•Readiness Test Results
Plan Approval
Deliverables
•Solution Discussion
•Plan Approval Document
Gap Analysis and
Monitoring
Strategy Design
Deliverables
•Monitoring Strategy
•Deployment Plan
•Application/Technology
Matrix
•Additional Questions
Incident History
Analysis &
Monitor
Discovery
Deliverables
•Ticket History Report
•Points of Failure List
•Monitor Inventory List
•Alert History Report
•Alert Logic Flow Chart
•Non-Standard Monitoring
Audit
Question &
Answer Session
Deliverables
•Physical & Logical Diagrams
•Asset List (Hardware & Software)
•PBRA Recommendations for
Monitoring
•Existing “Home Grown” Monitoring
Identified
•Solution Discussion
Develop
Recommended
Best Practices
Deliverables
•Industry Recommendations
•ESM Best Practices
•Questions for the QA Session
Y
N
Improvement Lifecycle
Legend
Systems Monitoring Consultant
SA
SMC
Arch
SM
PM
Systems Administrator
Platform Architect
Service Manager
Project Manager
SA
SMC
SA SMC Arch SM SMC
SA SMC
Arch SM
Arch SM
Arch SM
Arch SM
SA SMC Arch SM SMC
SMC
SMC
SMC
SMC
SMC
SMC
SA SMC
SA SMC
SA SMC
SA SMC
SA
SA
SA
SA
SA
SA
SMC SM
SMC SM
SMC SM
SA SMC SM
SA SMC SM
SA SMC SM
SA SMC SM
SA SMC Arch SM SMC
Arch SM SMC
SA SM