SlideShare uma empresa Scribd logo
1 de 20
Baixar para ler offline
Monitoring Improvement
Assessment Process
The Same Old Problem
Corporate
LANs & VPNs
ISP
Connection
DNS & Internet
Services
Content Mgmt
System
Social Network
Widgets
Site Tracking
& Analytics
Banner Ads &
Revenue Generators
Multimedia &
CDN Content
Home Wireless
& Broadband
Mobile Broadband
Is It My Data Center?
• Configuration errors
• Application design issues
• Code defects
• Insufficient infrastructure
• Oversubscription Issues
• Poor routing optimization
• Low cache hit rate
Is It a Service Provider Problem?
• Non-optimized mobile content
• Bad performance under load
• Blocking content delivery
• Incorrect geo-targeted content
Is it an ISP Problem?
• Peering problems
• ISP Outages
Is it My Code or a Browser Problem?
• Missing content
• Poorly performing JavaScript
• Inconsistent CSS rendering
• Browser/device incompatibility
• Page size too big
• Conflicting HTML tag support
• Too many objects
• Content not optimized for device
The Cloud
Distributed
Database
Mainframe
Network
Middleware
Storage
Anatomy of an Outage
Corporate
LANs & VPNs
Load Balancer
Firewall
Web
Servers
Message
Queue
zOS
CICS
WAS
Database
WAS
Database
zOS
MQ
DB2
4
3
1
5:45-ish pm: CICS ABENDS start
flooding the console but not high
enough to ticket
2
6:00-ish pm: MQ flows start are interrupted and
are alerting in Flow Diagnostics
6:04pm: Synthetic transactions fail at and
6:14 the Ops Center confirms the issue
and creates a P0 Incident
6:54pm: Support teams
investigate the interrupted
flows and determine it is a
“back-end” problem
10:29pm: Support teams investigate
MQ and ultimately and rule it out
and ultimately decide to reset CICS
to resolve the issue
5
Gaining Perspective
Requires Balance
Packet Capture
Synthetic Transactions
Client Monitoring
Client Monitoring
Synthetic Transactions
Server Probe
1. Client to the Server
2. Server to the Client
3. “3rd
Party” Vantage Point
4. Synthetic Transactions
Four Perspectives of User Experience
Why Multiple Perspectives?
Know Your Customer:
• What they do?
§ Customers care about completing tasks
NOT whether the homepage is available
• Where they do it from?
§ Your customers don’t live in the cloud, test from their perspective
• When they do it?
§ Test at peak and normal traffic levels, to find all the problems
• What expectations do customers have?
§ Is 5 seconds fast enough or does it have to be quicker?
Itemize the
existing
monitors
Brainstorm
potential gaps
to fill
Deploy new
monitors
Identify the
potential
risks
Itemize the
existing
monitors
Determine
if which
gaps exist
Fill the
monitoring
gaps
Current Approach
Proposed Approach
Picking Better Monitors
What Does Good
Monitoring Look Like?
Corporate
LANs & VPNs
Load Balancer
Load Balancer
Firewall
Switch
Web Server Farm
Database
Data Power
Mainframe
Middleware
Load Balancer
1. System Availability
2. Operating SystemPerformance
3. Hardware Monitoring
4. Service/Daemon and Process Availability
5. Error Logs
6. Application Resource KPIs
7. End-to-End Transactions
8. Point of Failure Transactions
9. Fail-Over Success
10.“Activity Monitors” and “Reverse Hockey Stick”
Elements of Good Monitoring
32 4 5 61
7
8
9 10
What Matters Most?
Dr. Lee
Goldman
Cook County Hospital,
Chicago, IL
1. Is the patient feelingunstable
angina?
2. Is there fluid in the patient’s lungs?
3. Is the patient’s systolic blood
pressure below 100?
The Goldman Algorithm
Prediction of Patients Expected to
Have a Heart Attack Within 72 Hours
0
20
40
60
80
100
Traditional Techniques Goldman Algorithm
By paying attention to what really matters, Dr.
Goldman improved the “false negatives” by 20
percentage points and eliminated the “false
positives” altogether.
The Goldman Algorithm
ECG Evidenceof Acute Ischemia?
ST-Segment Depression ≥ 1mmin ≥ 2 ContiguousLeads
(New or Unknown Age) or
T- Wave Inversion in ≥ 2 Contiguous Leads(Newor
Unknown Age) or
Left Bundle-Branch Block (Newor UnknownAge)
Observation
Unit
Inpatient
Telemetry
Unit
High Risk Low Risk Very Low RiskModerate Risk
Yes N
o
Coronary
Care Unit
N
o
ECG Evidenceof Acute MyocardialInfarction (MI)?
ST-Segment Elevation ≥ 1mmin ≥ 2 Contiguous
Leads (New orUnknown Age)
or
Pathologic Q Waves in ≥ 2 ContiguousLeads (New
or Unknown Age)
Yes
Patient suspected of
Acute Cardiac
Ischema
Perform
Electrocardiogram
(EKG)
0 Factors2 or 3 Factors 1 Factors0 or 1 Factors2 or 3 Factors
Urgent FactorsPresent?
Rates Above BothLung Bases
Systolic BloodPressure <100 mmHg
UnstableIschemic Heart Disease
Urgent FactorsPresent?
Rates Above BothLung Bases
Systolic BloodPressure <100 mmHg
UnstableIschemic Heart Disease
Seven Deadly Sins
Although Companies Realize the Importance of an Effective Monitoring System,
Most Fall Prey to Common Mistakes That Erode the Value
UsageReportingCollectionSelection
Strategic
Tactical
NatureofMistake
Life Cycle Activity
Ignoring the possibilities: Lack
of optimal utilization of available data
One size fits all: Lack
of audience segmentation
“Metrics Toilet”: Lack of aggregation and screening
of low-level metrics, resulting in cumbersome reports
Waiting for the perfect tool: Lack of focus
on process, leading to over-reliance on technology
An arbitrary exercise: Lack
of defined criteria for target setting
Metrics that (don’t) matter: Lack
of actionable metrics
IT’s World View: Lack of user involvement
in metrics selection and refinement
Source: Infrastructure Executive Council, 2003
Finding Metrics That Matter
§ Will the metric be used in a report? If so, which one? How is it used in the report?
§ Will the metric be used in a dashboard? If so, which one? How will it be used?
§ What action(s) will be taken if an alert is generated? Who are the actors? Will a
ticket be generated? If so, what severity?
§ How often is this event likely to occur? What is the impact if the event occurs?
What is the likelihood it can be detected by monitoring?
§ Will the metric help identify the source of a problem? Is it a coincident /
symptomatic indicator?
§ Is the metric always associated with a single problem? Could this metric become a
false indicator?
§ What is the impact if this goes undetected?
§ What is the lifespan for this metric? What is the potential for changes that may
reduce the efficacy of the metric?
Evaluating the Effectiveness of a Metric
The bulk of the monitoring
performed measures the health of
the operating system
Transaction
Monitoring
Application
Resource
Monitoring
Operating System
Monitoring
This monitoring is developed
specially for the technologies
used by the application to
determine if they are
functioning correctly
Transaction monitoring is the key
to good monitoring as it provides
depth and the capability to
determine customer impact
The overlap ensures
sufficient fault detection
The Layered Approach
Monitoring Patterns
Layers of Pre-Defined Monitoring Patterns
• The OS template is deployed when the
server is provisioned
• As a server is customized to fit its role,
additional templates are deployed
• Templates are stacked on top of each
other until no gaps remain
• This approach provides a high degree of
standardization without sacrificing the
ability to develop a custom solution
Application-Technology Matrix
Maps services, applications and technologies
enabling:
•Monitoring investment prioritization
•Monitoring maturity
•Which templates need to be deployed when
new hardware is acquired
•Whether an service has sufficient monitoring
coverage based on its application components
•This approach allows for anticipatingchanges
to a customer’s monitoring needs
Scores indicate:
0 – No Strategy
1 – Limited Monitoring
2 – Fully Integrated Strategy
Integrate Your Processes
Presentation
Framework
Asset Management
& Topology
Database
Aggregation
and Analysis
Security
Management
Availability
Management
Configuration
Management
Change
Management
Performance
Management
Enterprise Data
Sources
Business
Telemetry
Information
Configuration Discrepancies
Enrichment Data
Business Activity Data
Historical Data
“Enriched” Events
Change Activity
Topology Snapshots
Trend-RelatedFaults
DiscoveredProblems
Status Indications
Incidents
Audit Information and Suspicious Activity
Enrichment Data Business Activity Data
Automated
Discovery
Processing Streams
Situational
Awareness
Engine
Adapted from http://www.slideshare.net/TimBassCEP/getting-started-in-cep-
how-to-build-an-event-processing-application-presentation-717795
Real-Time
Event Streams
Detected and
Predicted Situations
Patterns from
Historical Data
Causal Relationship
from Past RCAs
Complex Event Processing
Event Pipeline
Event Queries
Time Window
Data Events
Control Event
Other Events
Event Filter
Scenarios
A
B
C
Feedback Loop
Event Intelligence
Action Events
AutomatedAction
Notificationand
Escalation
BusinessImpact
Analysis
RootCauseAnalysis
Correlationand
EventSuppression
Enrichment
Meta-Data Integration Bus
DistributedCollectorsDistributedCollectors
LOB Managed
Monitoring
System
Service Provider
Monitoring
System
Vendor Managed
Monitoring
System
Element
Manager
Element
Manager
Element
Manager
Other
Enterprise
Data
Document
Sharing
Service Desk CMDB
Batch
Scheduling
Knowledge
Database
Online Run
Book
PBX/Call
Manager
Visualization Framework
CommonEvent
Format
Topology And
Relationship
Database
Automated
Action Tools
DistributedCollectors
Automated
Provisioning
System
Predictive
Analysis
Automated
Change
Reconciliation
Security
Management
ArchiveandReport
Business
Telemetry Data
Service Center
and Enterprise
Notification Tool
Event Processing
The Management Eco-System
Capacity ManagementCompute Storage Network Facilities
Event Management /
Manager of Managers
CMDB
Billing &
Chargeback
Software
Tracking
Server
Monitoring
Storage
Manager
Network
Performance
Manager
Data Center
Infrastructure
Manager
Capacity Management
Predictive Insights Capacity Analyzer
Automated Reporting Engine
Cloud Orchestrator
Interface for
Capacity Planners
Interface for
Business Users
Policies Manager
Data Warehouse
Closeout
Meeting
Deliverables
•Acceptance Document
Event Integration
Test
Deliverables
•Acceptance Document
Build Integration
Solution
Deliverables
•Design Document Package
•Integration Rules
•As-Built Document
•Test Plan & Results
•Code Review Results
•Quality Inspection Checklist
Event Integration
Design
Deliverables
•Event Life Cycle Matrix
•Data Flow Diagram
•Integration Stories
Integration
Required?
Deploy
Monitoring
Deliverables
•Monitors
•Alerts
•Netcool Facts
•Readiness Test Results
Plan Approval
Deliverables
•Solution Discussion
•Plan Approval Document
Gap Analysis and
Monitoring
Strategy Design
Deliverables
•Monitoring Strategy
•Deployment Plan
•Application/Technology
Matrix
•Additional Questions
Incident History
Analysis &
Monitor
Discovery
Deliverables
•Ticket History Report
•Points of Failure List
•Monitor Inventory List
•Alert History Report
•Alert Logic Flow Chart
•Non-Standard Monitoring
Audit
Question &
Answer Session
Deliverables
•Physical & Logical Diagrams
•Asset List (Hardware & Software)
•PBRA Recommendations for
Monitoring
•Existing “Home Grown” Monitoring
Identified
•Solution Discussion
Develop
Recommended
Best Practices
Deliverables
•Industry Recommendations
•ESM Best Practices
•Questions for the QA Session
Y
N
Improvement Lifecycle
Legend
Systems Monitoring Consultant
SA
SMC
Arch
SM
PM
Systems Administrator
Platform Architect
Service Manager
Project Manager
SA
SMC
SA SMC Arch SM SMC
SA SMC
Arch SM
Arch SM
Arch SM
Arch SM
SA SMC Arch SM SMC
SMC
SMC
SMC
SMC
SMC
SMC
SA SMC
SA SMC
SA SMC
SA SMC
SA
SA
SA
SA
SA
SA
SMC SM
SMC SM
SMC SM
SA SMC SM
SA SMC SM
SA SMC SM
SA SMC SM
SA SMC Arch SM SMC
Arch SM SMC
SA SM

Mais conteúdo relacionado

Mais procurados

Brighttalk outage insurance- what you need to know - final
Brighttalk   outage insurance- what you need to know - finalBrighttalk   outage insurance- what you need to know - final
Brighttalk outage insurance- what you need to know - final
Andrew White
 
Petronas Project Oversight and Corporate Governance System Requirements
Petronas Project Oversight and Corporate Governance System RequirementsPetronas Project Oversight and Corporate Governance System Requirements
Petronas Project Oversight and Corporate Governance System Requirements
Darren Surin, BSc, MBA, PMP, ITIL
 
Securing your IT infrastructure with SOC-NOC collaboration TWP
Securing your IT infrastructure with SOC-NOC collaboration TWPSecuring your IT infrastructure with SOC-NOC collaboration TWP
Securing your IT infrastructure with SOC-NOC collaboration TWP
Sridhar Karnam
 

Mais procurados (20)

Structured NERC CIP Process Improvement Using Six Sigma
Structured NERC CIP Process Improvement Using Six SigmaStructured NERC CIP Process Improvement Using Six Sigma
Structured NERC CIP Process Improvement Using Six Sigma
 
Is Your Vulnerability Management Program Irrelevant?
Is Your Vulnerability Management Program Irrelevant?Is Your Vulnerability Management Program Irrelevant?
Is Your Vulnerability Management Program Irrelevant?
 
6 Tools for Improving IT Operations in ICS Environments
6 Tools for Improving IT Operations in ICS Environments6 Tools for Improving IT Operations in ICS Environments
6 Tools for Improving IT Operations in ICS Environments
 
Data Breach Risk Intelligence for Higher Education
Data Breach Risk Intelligence for Higher EducationData Breach Risk Intelligence for Higher Education
Data Breach Risk Intelligence for Higher Education
 
Ebusiness Auditing
Ebusiness AuditingEbusiness Auditing
Ebusiness Auditing
 
Six Mistakes of Log Management 2008
Six Mistakes of Log Management 2008Six Mistakes of Log Management 2008
Six Mistakes of Log Management 2008
 
Why Patch Management is Still the Best First Line of Defense
Why Patch Management is Still the Best First Line of DefenseWhy Patch Management is Still the Best First Line of Defense
Why Patch Management is Still the Best First Line of Defense
 
Is Your Vulnerability Management Program Keeping Pace With Risks?
Is Your Vulnerability Management Program Keeping Pace With Risks?Is Your Vulnerability Management Program Keeping Pace With Risks?
Is Your Vulnerability Management Program Keeping Pace With Risks?
 
Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy Webinar
Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy WebinarBeyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy Webinar
Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy Webinar
 
Brighttalk outage insurance- what you need to know - final
Brighttalk   outage insurance- what you need to know - finalBrighttalk   outage insurance- what you need to know - final
Brighttalk outage insurance- what you need to know - final
 
Petronas Project Oversight and Corporate Governance System Requirements
Petronas Project Oversight and Corporate Governance System RequirementsPetronas Project Oversight and Corporate Governance System Requirements
Petronas Project Oversight and Corporate Governance System Requirements
 
Abb e guide3
Abb e guide3Abb e guide3
Abb e guide3
 
Securing your IT infrastructure with SOC-NOC collaboration TWP
Securing your IT infrastructure with SOC-NOC collaboration TWPSecuring your IT infrastructure with SOC-NOC collaboration TWP
Securing your IT infrastructure with SOC-NOC collaboration TWP
 
Survey: Security Analytics and Intelligence
Survey: Security Analytics and IntelligenceSurvey: Security Analytics and Intelligence
Survey: Security Analytics and Intelligence
 
2016 virus bulletin
2016 virus bulletin2016 virus bulletin
2016 virus bulletin
 
Future of Software Analysis & Measurement_CAST
Future of Software Analysis & Measurement_CASTFuture of Software Analysis & Measurement_CAST
Future of Software Analysis & Measurement_CAST
 
network-host-reconciliation
network-host-reconciliationnetwork-host-reconciliation
network-host-reconciliation
 
NextGen Endpoint Security for Dummies
NextGen Endpoint Security for DummiesNextGen Endpoint Security for Dummies
NextGen Endpoint Security for Dummies
 
Enterprise Class Vulnerability Management Like A Boss
Enterprise Class Vulnerability Management Like A BossEnterprise Class Vulnerability Management Like A Boss
Enterprise Class Vulnerability Management Like A Boss
 
8 Steps for Selecting Oil and Gas Software
8 Steps for Selecting Oil and Gas Software8 Steps for Selecting Oil and Gas Software
8 Steps for Selecting Oil and Gas Software
 

Semelhante a How to improve your system monitoring

performancetestinganoverview-110206071921-phpapp02.pdf
performancetestinganoverview-110206071921-phpapp02.pdfperformancetestinganoverview-110206071921-phpapp02.pdf
performancetestinganoverview-110206071921-phpapp02.pdf
MAshok10
 
Measuring the Success of Cloud-Based Services
Measuring the Success of Cloud-Based ServicesMeasuring the Success of Cloud-Based Services
Measuring the Success of Cloud-Based Services
Vistara
 

Semelhante a How to improve your system monitoring (20)

What is Platform Observability? An Overview
What is Platform Observability? An OverviewWhat is Platform Observability? An Overview
What is Platform Observability? An Overview
 
State of the Market - Data Quality in 2023
State of the Market - Data Quality in 2023State of the Market - Data Quality in 2023
State of the Market - Data Quality in 2023
 
Migrating to the Cloud – Is Application Performance Monitoring still required?
Migrating to the Cloud – Is Application Performance Monitoring still required?Migrating to the Cloud – Is Application Performance Monitoring still required?
Migrating to the Cloud – Is Application Performance Monitoring still required?
 
Data Analytics & Hospital Asset Managemenr
Data Analytics & Hospital Asset ManagemenrData Analytics & Hospital Asset Managemenr
Data Analytics & Hospital Asset Managemenr
 
performancetestinganoverview-110206071921-phpapp02.pdf
performancetestinganoverview-110206071921-phpapp02.pdfperformancetestinganoverview-110206071921-phpapp02.pdf
performancetestinganoverview-110206071921-phpapp02.pdf
 
Visualizing Your Network Health - Know your Network
Visualizing Your Network Health - Know your NetworkVisualizing Your Network Health - Know your Network
Visualizing Your Network Health - Know your Network
 
Data Analytics 3 Analytics Techniques
Data Analytics 3 Analytics Techniques Data Analytics 3 Analytics Techniques
Data Analytics 3 Analytics Techniques
 
Encontrando la Aguja en el Rendimiento de Aplicaciones
Encontrando la Aguja en el Rendimiento de AplicacionesEncontrando la Aguja en el Rendimiento de Aplicaciones
Encontrando la Aguja en el Rendimiento de Aplicaciones
 
Its Not You Its Me MSSP Couples Counseling
Its Not You Its Me   MSSP Couples CounselingIts Not You Its Me   MSSP Couples Counseling
Its Not You Its Me MSSP Couples Counseling
 
Using Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps PracticesUsing Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps Practices
 
Mobile & Cloud Technology - Doing more with less
Mobile & Cloud Technology - Doing more with lessMobile & Cloud Technology - Doing more with less
Mobile & Cloud Technology - Doing more with less
 
Measuring the Success of Cloud-Based Services
Measuring the Success of Cloud-Based ServicesMeasuring the Success of Cloud-Based Services
Measuring the Success of Cloud-Based Services
 
Mobile & Cloud Tech - doing more with less
Mobile & Cloud Tech - doing more with lessMobile & Cloud Tech - doing more with less
Mobile & Cloud Tech - doing more with less
 
ADDO Open Source Observability Tools
ADDO Open Source Observability Tools ADDO Open Source Observability Tools
ADDO Open Source Observability Tools
 
Coradiant
CoradiantCoradiant
Coradiant
 
Enterprise DevOps
Enterprise DevOpsEnterprise DevOps
Enterprise DevOps
 
If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...
If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...
If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...
 
Modernizing legacy systems
Modernizing legacy systemsModernizing legacy systems
Modernizing legacy systems
 
20160000 Cloud Discovery Event - Cloud Access Security Brokers
20160000 Cloud Discovery Event - Cloud Access Security Brokers20160000 Cloud Discovery Event - Cloud Access Security Brokers
20160000 Cloud Discovery Event - Cloud Access Security Brokers
 
Unified Clinical Operations - Ennov Presentation
Unified Clinical Operations - Ennov PresentationUnified Clinical Operations - Ennov Presentation
Unified Clinical Operations - Ennov Presentation
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

How to improve your system monitoring

  • 2. The Same Old Problem Corporate LANs & VPNs ISP Connection DNS & Internet Services Content Mgmt System Social Network Widgets Site Tracking & Analytics Banner Ads & Revenue Generators Multimedia & CDN Content Home Wireless & Broadband Mobile Broadband Is It My Data Center? • Configuration errors • Application design issues • Code defects • Insufficient infrastructure • Oversubscription Issues • Poor routing optimization • Low cache hit rate Is It a Service Provider Problem? • Non-optimized mobile content • Bad performance under load • Blocking content delivery • Incorrect geo-targeted content Is it an ISP Problem? • Peering problems • ISP Outages Is it My Code or a Browser Problem? • Missing content • Poorly performing JavaScript • Inconsistent CSS rendering • Browser/device incompatibility • Page size too big • Conflicting HTML tag support • Too many objects • Content not optimized for device The Cloud Distributed Database Mainframe Network Middleware Storage
  • 3. Anatomy of an Outage Corporate LANs & VPNs Load Balancer Firewall Web Servers Message Queue zOS CICS WAS Database WAS Database zOS MQ DB2 4 3 1 5:45-ish pm: CICS ABENDS start flooding the console but not high enough to ticket 2 6:00-ish pm: MQ flows start are interrupted and are alerting in Flow Diagnostics 6:04pm: Synthetic transactions fail at and 6:14 the Ops Center confirms the issue and creates a P0 Incident 6:54pm: Support teams investigate the interrupted flows and determine it is a “back-end” problem 10:29pm: Support teams investigate MQ and ultimately and rule it out and ultimately decide to reset CICS to resolve the issue 5
  • 4. Gaining Perspective Requires Balance Packet Capture Synthetic Transactions Client Monitoring Client Monitoring Synthetic Transactions Server Probe 1. Client to the Server 2. Server to the Client 3. “3rd Party” Vantage Point 4. Synthetic Transactions Four Perspectives of User Experience
  • 5. Why Multiple Perspectives? Know Your Customer: • What they do? § Customers care about completing tasks NOT whether the homepage is available • Where they do it from? § Your customers don’t live in the cloud, test from their perspective • When they do it? § Test at peak and normal traffic levels, to find all the problems • What expectations do customers have? § Is 5 seconds fast enough or does it have to be quicker?
  • 6. Itemize the existing monitors Brainstorm potential gaps to fill Deploy new monitors Identify the potential risks Itemize the existing monitors Determine if which gaps exist Fill the monitoring gaps Current Approach Proposed Approach Picking Better Monitors
  • 7. What Does Good Monitoring Look Like? Corporate LANs & VPNs Load Balancer Load Balancer Firewall Switch Web Server Farm Database Data Power Mainframe Middleware Load Balancer 1. System Availability 2. Operating SystemPerformance 3. Hardware Monitoring 4. Service/Daemon and Process Availability 5. Error Logs 6. Application Resource KPIs 7. End-to-End Transactions 8. Point of Failure Transactions 9. Fail-Over Success 10.“Activity Monitors” and “Reverse Hockey Stick” Elements of Good Monitoring 32 4 5 61 7 8 9 10
  • 8. What Matters Most? Dr. Lee Goldman Cook County Hospital, Chicago, IL 1. Is the patient feelingunstable angina? 2. Is there fluid in the patient’s lungs? 3. Is the patient’s systolic blood pressure below 100? The Goldman Algorithm Prediction of Patients Expected to Have a Heart Attack Within 72 Hours 0 20 40 60 80 100 Traditional Techniques Goldman Algorithm By paying attention to what really matters, Dr. Goldman improved the “false negatives” by 20 percentage points and eliminated the “false positives” altogether.
  • 9. The Goldman Algorithm ECG Evidenceof Acute Ischemia? ST-Segment Depression ≥ 1mmin ≥ 2 ContiguousLeads (New or Unknown Age) or T- Wave Inversion in ≥ 2 Contiguous Leads(Newor Unknown Age) or Left Bundle-Branch Block (Newor UnknownAge) Observation Unit Inpatient Telemetry Unit High Risk Low Risk Very Low RiskModerate Risk Yes N o Coronary Care Unit N o ECG Evidenceof Acute MyocardialInfarction (MI)? ST-Segment Elevation ≥ 1mmin ≥ 2 Contiguous Leads (New orUnknown Age) or Pathologic Q Waves in ≥ 2 ContiguousLeads (New or Unknown Age) Yes Patient suspected of Acute Cardiac Ischema Perform Electrocardiogram (EKG) 0 Factors2 or 3 Factors 1 Factors0 or 1 Factors2 or 3 Factors Urgent FactorsPresent? Rates Above BothLung Bases Systolic BloodPressure <100 mmHg UnstableIschemic Heart Disease Urgent FactorsPresent? Rates Above BothLung Bases Systolic BloodPressure <100 mmHg UnstableIschemic Heart Disease
  • 10. Seven Deadly Sins Although Companies Realize the Importance of an Effective Monitoring System, Most Fall Prey to Common Mistakes That Erode the Value UsageReportingCollectionSelection Strategic Tactical NatureofMistake Life Cycle Activity Ignoring the possibilities: Lack of optimal utilization of available data One size fits all: Lack of audience segmentation “Metrics Toilet”: Lack of aggregation and screening of low-level metrics, resulting in cumbersome reports Waiting for the perfect tool: Lack of focus on process, leading to over-reliance on technology An arbitrary exercise: Lack of defined criteria for target setting Metrics that (don’t) matter: Lack of actionable metrics IT’s World View: Lack of user involvement in metrics selection and refinement Source: Infrastructure Executive Council, 2003
  • 11. Finding Metrics That Matter § Will the metric be used in a report? If so, which one? How is it used in the report? § Will the metric be used in a dashboard? If so, which one? How will it be used? § What action(s) will be taken if an alert is generated? Who are the actors? Will a ticket be generated? If so, what severity? § How often is this event likely to occur? What is the impact if the event occurs? What is the likelihood it can be detected by monitoring? § Will the metric help identify the source of a problem? Is it a coincident / symptomatic indicator? § Is the metric always associated with a single problem? Could this metric become a false indicator? § What is the impact if this goes undetected? § What is the lifespan for this metric? What is the potential for changes that may reduce the efficacy of the metric? Evaluating the Effectiveness of a Metric
  • 12. The bulk of the monitoring performed measures the health of the operating system Transaction Monitoring Application Resource Monitoring Operating System Monitoring This monitoring is developed specially for the technologies used by the application to determine if they are functioning correctly Transaction monitoring is the key to good monitoring as it provides depth and the capability to determine customer impact The overlap ensures sufficient fault detection The Layered Approach
  • 13. Monitoring Patterns Layers of Pre-Defined Monitoring Patterns • The OS template is deployed when the server is provisioned • As a server is customized to fit its role, additional templates are deployed • Templates are stacked on top of each other until no gaps remain • This approach provides a high degree of standardization without sacrificing the ability to develop a custom solution
  • 14. Application-Technology Matrix Maps services, applications and technologies enabling: •Monitoring investment prioritization •Monitoring maturity •Which templates need to be deployed when new hardware is acquired •Whether an service has sufficient monitoring coverage based on its application components •This approach allows for anticipatingchanges to a customer’s monitoring needs Scores indicate: 0 – No Strategy 1 – Limited Monitoring 2 – Fully Integrated Strategy
  • 15. Integrate Your Processes Presentation Framework Asset Management & Topology Database Aggregation and Analysis Security Management Availability Management Configuration Management Change Management Performance Management Enterprise Data Sources Business Telemetry Information Configuration Discrepancies Enrichment Data Business Activity Data Historical Data “Enriched” Events Change Activity Topology Snapshots Trend-RelatedFaults DiscoveredProblems Status Indications Incidents Audit Information and Suspicious Activity Enrichment Data Business Activity Data Automated Discovery
  • 16. Processing Streams Situational Awareness Engine Adapted from http://www.slideshare.net/TimBassCEP/getting-started-in-cep- how-to-build-an-event-processing-application-presentation-717795 Real-Time Event Streams Detected and Predicted Situations Patterns from Historical Data Causal Relationship from Past RCAs
  • 17. Complex Event Processing Event Pipeline Event Queries Time Window Data Events Control Event Other Events Event Filter Scenarios A B C Feedback Loop Event Intelligence Action Events
  • 18. AutomatedAction Notificationand Escalation BusinessImpact Analysis RootCauseAnalysis Correlationand EventSuppression Enrichment Meta-Data Integration Bus DistributedCollectorsDistributedCollectors LOB Managed Monitoring System Service Provider Monitoring System Vendor Managed Monitoring System Element Manager Element Manager Element Manager Other Enterprise Data Document Sharing Service Desk CMDB Batch Scheduling Knowledge Database Online Run Book PBX/Call Manager Visualization Framework CommonEvent Format Topology And Relationship Database Automated Action Tools DistributedCollectors Automated Provisioning System Predictive Analysis Automated Change Reconciliation Security Management ArchiveandReport Business Telemetry Data Service Center and Enterprise Notification Tool Event Processing
  • 19. The Management Eco-System Capacity ManagementCompute Storage Network Facilities Event Management / Manager of Managers CMDB Billing & Chargeback Software Tracking Server Monitoring Storage Manager Network Performance Manager Data Center Infrastructure Manager Capacity Management Predictive Insights Capacity Analyzer Automated Reporting Engine Cloud Orchestrator Interface for Capacity Planners Interface for Business Users Policies Manager Data Warehouse
  • 20. Closeout Meeting Deliverables •Acceptance Document Event Integration Test Deliverables •Acceptance Document Build Integration Solution Deliverables •Design Document Package •Integration Rules •As-Built Document •Test Plan & Results •Code Review Results •Quality Inspection Checklist Event Integration Design Deliverables •Event Life Cycle Matrix •Data Flow Diagram •Integration Stories Integration Required? Deploy Monitoring Deliverables •Monitors •Alerts •Netcool Facts •Readiness Test Results Plan Approval Deliverables •Solution Discussion •Plan Approval Document Gap Analysis and Monitoring Strategy Design Deliverables •Monitoring Strategy •Deployment Plan •Application/Technology Matrix •Additional Questions Incident History Analysis & Monitor Discovery Deliverables •Ticket History Report •Points of Failure List •Monitor Inventory List •Alert History Report •Alert Logic Flow Chart •Non-Standard Monitoring Audit Question & Answer Session Deliverables •Physical & Logical Diagrams •Asset List (Hardware & Software) •PBRA Recommendations for Monitoring •Existing “Home Grown” Monitoring Identified •Solution Discussion Develop Recommended Best Practices Deliverables •Industry Recommendations •ESM Best Practices •Questions for the QA Session Y N Improvement Lifecycle Legend Systems Monitoring Consultant SA SMC Arch SM PM Systems Administrator Platform Architect Service Manager Project Manager SA SMC SA SMC Arch SM SMC SA SMC Arch SM Arch SM Arch SM Arch SM SA SMC Arch SM SMC SMC SMC SMC SMC SMC SMC SA SMC SA SMC SA SMC SA SMC SA SA SA SA SA SA SMC SM SMC SM SMC SM SA SMC SM SA SMC SM SA SMC SM SA SMC SM SA SMC Arch SM SMC Arch SM SMC SA SM