2. Overview
● Where IT Operations is Today
● Integrated Monitoring
● Automated Operations
● AI-Ops
3. Where We Are Now - Diversity
● Lots of IT Systems
● Physical Servers, Disks, RAID
● Network Devices of Many Types
● SANs & Storage Systems
4. Where We Are Now - Diversity
● Private Clouds - VMWare & OpenStack
● Public Clouds - Many
● Hybrid Clouds & Everything Else
● and more ...
5. Where We Are Now - Many Monitoring Systems
● Each Piece Monitored a Different Way
● Zabbix, Prometheus, Cacti
● Networks by SNMP, Cacti,
● Commercial via BMC, etc.
● APM & Tracing, too
● Logs to ELK, or nowhere
7. Where We Are Now - Resource Monitoring
● 75% of Monitoring is Resources
○ CPU, RAM, Network,
● 20% for Services
○ Performance
● 5% for URLs
● Very Little For:
○ Apps & Customer Experience
○ Internal Services & Golden Signals
○ Architecture, Topology
○ Configuration
8. Challenges I
● Alarm Overload & Fatigue
● Hard to Set Thresholds
● Hard to Know What’s Wrong
● Getting Worse
○ DevOps & Faster Releases
○ Dynamic Systems
○ Microservices
○ Clouds & Cloud Services
● More Players - Ops, Developers, DevOps, SRE
9. Challenges II
● Systems Not Connected
● Collection Methods Vary
● Metric Definitions Vary
● Alerts Vary
● Can’t Unify Anything
○ Metrics
○ Alerts
○ Incidents
○ Understanding
11. Goals - Single System
● Single Monitoring System
● Single Metric System
● Single Source of Truth
● Anomaly Detection
● Resource Metrics
● Golden Signals
● Logs & Events
● APM & Tracing
12. Goals - Rich Data & Information
● Add Context & Details
● Discovery - Metrics on All
○ Hosts & Nodes
○ Services
○ Connections
○ Dependencies
● CMDB
○ Deep Service Configuration
○ Security & Governance
13. How to Get There - Re-Thinking Monitoring
● Monitoring Strategy
● What to Monitor
● How to Monitor
● How to Alert
● How to Troubleshoot
● How to Manage Incidents
15. Monitoring Strategy
● Business Level - Key KPIs, drivers of below items
● User Level - What the User Experiences
● App Level - Engineers, Managers, Users think in apps
● Service Level - The Real Work & Real Problems
● Resource Level - Underlying Everything
● Security - Important Everywhere
16. Monitoring Sources - Need Them All
● OS Metrics & Logs
● Service Metrics & Logs
● App Metrics & Logs
● Cloud Metrics & Logs
● APM Data & Tracing
● CMDB Configs
● Architecture & Dependencies
● Auto-Discovery of Everything
17. What to Monitor
● Focus on User & App Level (Results)
○ User & Browser via RUM (Real User Monitoring)
● Modern Golden Signals
● System Health (Health & Status Endpoints)
● Hard Errors (Only Alert if User Impact)
○ Disk full, service/server dead, etc.
● Background Useful Data (no alerts)
18. Golden Signals - Key to SRE
● Modern Golden Signals
○ USE - Resources (Utilization, Saturation, Errors)
○ RED - Results (Rate, Errors, Duration/Latency)
● Use Specialized Agents
● For Every Service
● At Every System Level
● Down to Disks, Networks, etc.
19. Getting Data & Metrics
● Many Collection Methods
○ Agents, SNMP, SSH, Cloud APIs
○ Defined vs. Ad Hoc Metrics
● Focus on Golden Signals
○ Need Special Tools for RED
● Use Good Statistics
○ Medians, Percentiles, Sampling
20. CMDB
● Key to Many Processes
● Service Discovery
● Security & Compliance
● Change Tracking
● Faster Troubleshooting
● Expert System Source
● Needs Specialized Collector
21. Architecture & Topology
● Use Special Agents to Discover
● Key to Understanding
● Drives System Diagram
● Drives Understanding
● Drives Dependencies
● Often Changing
22. Dependencies
● What Depends on What
● Critical for MicroServices
● Key to Alert Impact Analyses
● Key to Alert Consolidation
● Key to AIOps Root Cause
23. Observability - For Developers & DevOps
● Add Metrics to Application Code
● Uses Structured Logs & Emitted Events
○ With Metrics, Latency, and Errors
● Lots of Tags
○ User, Customer, Browser, Product, OS, much more
● Canned & Ad Hoc Analytics & Exploration
● Should be Integrated into Single System
○ Correlate USE/RED Metrics with APM Data
25. How to Alert
● Two Types of Alerts
○ Alert to Wake Someone Up/Urgent
○ Alert as Information (FYI)
● Alert on User Impact
● All the Rest is Background Info
● Smart Alert Strategy
○ Thresholds where it makes sense
○ Anomalies are Key, but noisy, too
26. Anomaly Alerting
● Many Types
● Historical Checks
○ Univariate, Multi-variate, Neural Networks, Seasons
● Cluster Checks
○ Different rom Peers
● Ratio Checks
○ Metrics don’t Match
○ e.g. Requests vs. Queries
27. How to Manage Incidents
● Incidents are Real Issues (ITIL)
○ Something Broke
○ Combine Many Alerts
● Categorize & Document
● Troubleshoot, Fix, Document
● Communicate
● Review & Report (Post Mortem)
28. How to Troubleshoot
● Train People in Troubleshooting
● Defined Processes - Especially for Emergencies
● Use All the Data
○ Alerts & Incidents
○ Metrics
○ Logs & Events
○ Topology & Dependencies
● Root Cause Analyses Critical
● Runbooks & Team Communications
29. How to Manage Problems
● Problems are Recurring Incidents (ITIL)
● Key to Reducing False Alerts & Fatigue
● Key to Improving Alert Thresholds
● Key to Improving Systems
● Ideally Dedicated Team or Resources
● Needs Monitoring System Support
33. Upgrade & Integrate Monitoring
● Plan What to Monitor
● Plan How to Monitor
● Single Unified System
● Multi-Phase Process
● Temporary Integrations
34. Add Golden Signals, App/User Focus
● Setup App & Service Structures
● Plan Golden Signals
○ Needs Special Agents & Techniques
○ Varies a Lot by Service
● Set Baselines & Anomalies
35. Add Discovery & Dependencies
● Driven by Monitoring System
● Get Data
● Build Diagrams
● Verify Dependencies
36. Add APM & Tracing
● Part of Monitoring
● Initial Setup & Test
● Tune for Transactions
● Extract User/RUM Metrics
● Set Alerts as Needed
● Train Developers on Tracing
● Correlate USE/RED Metrics with APM Data
38. Upgrade & Integrate Logging
● Logs are Key Part of Troubleshooting
○ OS Level - Linux & Windows
○ Services - Web, Java/Tomcat, MySQL, etc.
● Send All to Unified Platform
● Get Metrics & Analytics
● Add Alertings on Logs - Errors & Metrics
39. Add Observability in Code
● Structured Logs are Best - Events
● Emitted by App Code with Tags & Dimensions
● Include Metrics - Ideally Latency & Errors
● Build Analyses Dashboards
40. Upgraded Alerting, Incident, Problems
● Move to ITIL Naming & Processes
● Build Procedures
● Dedicated Problem Team
41. Training Teams
● General Training
● ITIL & ITOP Training
● Golden Signals Thinking
● APM, RUM, Tracing Usage
43. Automated Operations - Goals
● Automate Things
● Build Things Faster
● Change Things Faster
● Fix Things Faster
● Reduce Manual Mistakes
● Improve Consistency
● Support Large Scale Systems
44. Key Components
● Clouds & Dynamic Systems
● Infrastructure as Code
● Config Management Systems
○ Ansible, Puppet, Chef, SaltStack
● Automated Troubleshooting
● Auto-Healing
● Auto-Governance
45. Clouds & Dynamic Systems
● Cloud APIs Support Automation
● Clouds Can Change Themselves
○ Auto-Scaling
● CI/CD Systems Can Change Them
● Core of Lots of Automated Processes
46. Infrastructure as Code
● Define Infrastructure Programmatically
● Cloud Formation, Terraform, etc.
● Build & Change Continually
● Continually Updated
● Versioned & Can Diff
47. Config Management Systems
● Support Auto Deployments
● Leverage Infra-as-Code
● Ansible, Puppet, Chef, SaltStack
● Build Large, Reproducible Systems
48. Automated Troubleshooting
● Built on Data & Rule Systems
● Auto-Gather More Details
● Help Find Root Causes
● Use Automation System
● Advanced Use Needs AI-Ops
49. Auto-Healing
● Automatically Fix Things
● Driven by Rule Engines
● Fix Things Faster
● Rapid Response 7x24
● Use Automation System
52. AI-Ops - What is it?
● Analytical IT Operations
● Artificial Intelligence (AI) Opeations
● Machine Learning for Operations
53. Really, What is It?
● Synthesis of Many Sources of Information
● Better Understanding of Problems & Situation
● Predictions for the Future
● Usually with Machine Learning / Big Data
57. Many Vendors
● Most Focus on Alert Consolidation
● Some on Root Alert/Issue
● Few on Real Root Cause
● Need Deep Data for RCA
58.
59. Alert Reduction & Consolidation
● Merge Related & Duplicate Alerts
○ By Time, Dependencies, History, etc.
● Helps Ops Teams Focus
● Avoids Missing Key Things in Noise
● First Phase of any AIOPS
60. Root Cause Analysis
● Multi-Mode & Method
● Expert System Uses Everything
● Use Dependencies from Discovery
● Use History with Feedback
● Sort & Prioritize
● Enrich with Additional Data Collection
○ Including Automated Actions