SlideShare uma empresa Scribd logo
1 de 65
Unified Operations Vision
By Steve
Mushero
March, 2019
Overview
● Where IT Operations is Today
● Integrated Monitoring
● Automated Operations
● AI-Ops
Where We Are Now - Diversity
● Lots of IT Systems
● Physical Servers, Disks, RAID
● Network Devices of Many Types
● SANs & Storage Systems
Where We Are Now - Diversity
● Private Clouds - VMWare & OpenStack
● Public Clouds - Many
● Hybrid Clouds & Everything Else
● and more ...
Where We Are Now - Many Monitoring Systems
● Each Piece Monitored a Different Way
● Zabbix, Prometheus, Cacti
● Networks by SNMP, Cacti,
● Commercial via BMC, etc.
● APM & Tracing, too
● Logs to ELK, or nowhere
So Many Tools ...
Where We Are Now - Resource Monitoring
● 75% of Monitoring is Resources
○ CPU, RAM, Network,
● 20% for Services
○ Performance
● 5% for URLs
● Very Little For:
○ Apps & Customer Experience
○ Internal Services & Golden Signals
○ Architecture, Topology
○ Configuration
Challenges I
● Alarm Overload & Fatigue
● Hard to Set Thresholds
● Hard to Know What’s Wrong
● Getting Worse
○ DevOps & Faster Releases
○ Dynamic Systems
○ Microservices
○ Clouds & Cloud Services
● More Players - Ops, Developers, DevOps, SRE
Challenges II
● Systems Not Connected
● Collection Methods Vary
● Metric Definitions Vary
● Alerts Vary
● Can’t Unify Anything
○ Metrics
○ Alerts
○ Incidents
○ Understanding
Integrated Monitoring
Goals - Single System
● Single Monitoring System
● Single Metric System
● Single Source of Truth
● Anomaly Detection
● Resource Metrics
● Golden Signals
● Logs & Events
● APM & Tracing
Goals - Rich Data & Information
● Add Context & Details
● Discovery - Metrics on All
○ Hosts & Nodes
○ Services
○ Connections
○ Dependencies
● CMDB
○ Deep Service Configuration
○ Security & Governance
How to Get There - Re-Thinking Monitoring
● Monitoring Strategy
● What to Monitor
● How to Monitor
● How to Alert
● How to Troubleshoot
● How to Manage Incidents
What to Monitor
Monitoring Strategy
● Business Level - Key KPIs, drivers of below items
● User Level - What the User Experiences
● App Level - Engineers, Managers, Users think in apps
● Service Level - The Real Work & Real Problems
● Resource Level - Underlying Everything
● Security - Important Everywhere
Monitoring Sources - Need Them All
● OS Metrics & Logs
● Service Metrics & Logs
● App Metrics & Logs
● Cloud Metrics & Logs
● APM Data & Tracing
● CMDB Configs
● Architecture & Dependencies
● Auto-Discovery of Everything
What to Monitor
● Focus on User & App Level (Results)
○ User & Browser via RUM (Real User Monitoring)
● Modern Golden Signals
● System Health (Health & Status Endpoints)
● Hard Errors (Only Alert if User Impact)
○ Disk full, service/server dead, etc.
● Background Useful Data (no alerts)
Golden Signals - Key to SRE
● Modern Golden Signals
○ USE - Resources (Utilization, Saturation, Errors)
○ RED - Results (Rate, Errors, Duration/Latency)
● Use Specialized Agents
● For Every Service
● At Every System Level
● Down to Disks, Networks, etc.
Getting Data & Metrics
● Many Collection Methods
○ Agents, SNMP, SSH, Cloud APIs
○ Defined vs. Ad Hoc Metrics
● Focus on Golden Signals
○ Need Special Tools for RED
● Use Good Statistics
○ Medians, Percentiles, Sampling
CMDB
● Key to Many Processes
● Service Discovery
● Security & Compliance
● Change Tracking
● Faster Troubleshooting
● Expert System Source
● Needs Specialized Collector
Architecture & Topology
● Use Special Agents to Discover
● Key to Understanding
● Drives System Diagram
● Drives Understanding
● Drives Dependencies
● Often Changing
Dependencies
● What Depends on What
● Critical for MicroServices
● Key to Alert Impact Analyses
● Key to Alert Consolidation
● Key to AIOps Root Cause
Observability - For Developers & DevOps
● Add Metrics to Application Code
● Uses Structured Logs & Emitted Events
○ With Metrics, Latency, and Errors
● Lots of Tags
○ User, Customer, Browser, Product, OS, much more
● Canned & Ad Hoc Analytics & Exploration
● Should be Integrated into Single System
○ Correlate USE/RED Metrics with APM Data
Visualization
● Dashboards
● Diagrams - Layered Arch, Dependencies
● Graphs - Summary, System, Service, Deeper
● Advanced - Heatmaps, Box Plots, Histograms
● Analytics - Cluster, Deviance, Cycles
How to Alert
● Two Types of Alerts
○ Alert to Wake Someone Up/Urgent
○ Alert as Information (FYI)
● Alert on User Impact
● All the Rest is Background Info
● Smart Alert Strategy
○ Thresholds where it makes sense
○ Anomalies are Key, but noisy, too
Anomaly Alerting
● Many Types
● Historical Checks
○ Univariate, Multi-variate, Neural Networks, Seasons
● Cluster Checks
○ Different rom Peers
● Ratio Checks
○ Metrics don’t Match
○ e.g. Requests vs. Queries
How to Manage Incidents
● Incidents are Real Issues (ITIL)
○ Something Broke
○ Combine Many Alerts
● Categorize & Document
● Troubleshoot, Fix, Document
● Communicate
● Review & Report (Post Mortem)
How to Troubleshoot
● Train People in Troubleshooting
● Defined Processes - Especially for Emergencies
● Use All the Data
○ Alerts & Incidents
○ Metrics
○ Logs & Events
○ Topology & Dependencies
● Root Cause Analyses Critical
● Runbooks & Team Communications
How to Manage Problems
● Problems are Recurring Incidents (ITIL)
● Key to Reducing False Alerts & Fatigue
● Key to Improving Alert Thresholds
● Key to Improving Systems
● Ideally Dedicated Team or Resources
● Needs Monitoring System Support
Getting To Integrated Monitoring
Getting to Integrated Monitoring
● Big Project
● Multiple Phases
● Lots of Details
● Involves Integration
● And New Strategies
Getting to Integrated Monitoring
● Upgrade & Integrate Monitoring
● Add Golden Signals, App/User Focus
● Add Discovery & Dependencies
● Add APM & Tracing
● Upgrade & Integrate Logging
● Add Observability in Code
● Upgraded Alerting, Incident, Problems
● Train Teams
Upgrade & Integrate Monitoring
● Plan What to Monitor
● Plan How to Monitor
● Single Unified System
● Multi-Phase Process
● Temporary Integrations
Add Golden Signals, App/User Focus
● Setup App & Service Structures
● Plan Golden Signals
○ Needs Special Agents & Techniques
○ Varies a Lot by Service
● Set Baselines & Anomalies
Add Discovery & Dependencies
● Driven by Monitoring System
● Get Data
● Build Diagrams
● Verify Dependencies
Add APM & Tracing
● Part of Monitoring
● Initial Setup & Test
● Tune for Transactions
● Extract User/RUM Metrics
● Set Alerts as Needed
● Train Developers on Tracing
● Correlate USE/RED Metrics with APM Data
Tracing Example
Upgrade & Integrate Logging
● Logs are Key Part of Troubleshooting
○ OS Level - Linux & Windows
○ Services - Web, Java/Tomcat, MySQL, etc.
● Send All to Unified Platform
● Get Metrics & Analytics
● Add Alertings on Logs - Errors & Metrics
Add Observability in Code
● Structured Logs are Best - Events
● Emitted by App Code with Tags & Dimensions
● Include Metrics - Ideally Latency & Errors
● Build Analyses Dashboards
Upgraded Alerting, Incident, Problems
● Move to ITIL Naming & Processes
● Build Procedures
● Dedicated Problem Team
Training Teams
● General Training
● ITIL & ITOP Training
● Golden Signals Thinking
● APM, RUM, Tracing Usage
Automated Operations
Automated Operations - Goals
● Automate Things
● Build Things Faster
● Change Things Faster
● Fix Things Faster
● Reduce Manual Mistakes
● Improve Consistency
● Support Large Scale Systems
Key Components
● Clouds & Dynamic Systems
● Infrastructure as Code
● Config Management Systems
○ Ansible, Puppet, Chef, SaltStack
● Automated Troubleshooting
● Auto-Healing
● Auto-Governance
Clouds & Dynamic Systems
● Cloud APIs Support Automation
● Clouds Can Change Themselves
○ Auto-Scaling
● CI/CD Systems Can Change Them
● Core of Lots of Automated Processes
Infrastructure as Code
● Define Infrastructure Programmatically
● Cloud Formation, Terraform, etc.
● Build & Change Continually
● Continually Updated
● Versioned & Can Diff
Config Management Systems
● Support Auto Deployments
● Leverage Infra-as-Code
● Ansible, Puppet, Chef, SaltStack
● Build Large, Reproducible Systems
Automated Troubleshooting
● Built on Data & Rule Systems
● Auto-Gather More Details
● Help Find Root Causes
● Use Automation System
● Advanced Use Needs AI-Ops
Auto-Healing
● Automatically Fix Things
● Driven by Rule Engines
● Fix Things Faster
● Rapid Response 7x24
● Use Automation System
Auto-Governance
● Continual Compliance
● Guardrails to Prevent Risks
● Systems Auto-Correct
○ Remove Bad Security
AI-Ops
AI-Ops - What is it?
● Analytical IT Operations
● Artificial Intelligence (AI) Opeations
● Machine Learning for Operations
Really, What is It?
● Synthesis of Many Sources of Information
● Better Understanding of Problems & Situation
● Predictions for the Future
● Usually with Machine Learning / Big Data
From Gartner
Goals
● Impact Analysis
● Alert Reduction
● Alert Consolidation
● Root Cause Analysis
● Auto-Healing
● Prediction
Method - Combine Everything
● Alerts
● Events
● Metrics
● History
● Topology
● Dependencies
Many Vendors
● Most Focus on Alert Consolidation
● Some on Root Alert/Issue
● Few on Real Root Cause
● Need Deep Data for RCA
Alert Reduction & Consolidation
● Merge Related & Duplicate Alerts
○ By Time, Dependencies, History, etc.
● Helps Ops Teams Focus
● Avoids Missing Key Things in Noise
● First Phase of any AIOPS
Root Cause Analysis
● Multi-Mode & Method
● Expert System Uses Everything
● Use Dependencies from Discovery
● Use History with Feedback
● Sort & Prioritize
● Enrich with Additional Data Collection
○ Including Automated Actions
Manual-Healing
● One-Click Fixing
● Procedure/Runbooks
● For high-risk, complex solutions
● Automatic Real-Time
● Uses Automation Platform
Auto-Healing
● Automatic Real-Time Fixing
● Uses Automation Platform
● Helps 7x24, Reducing On-Call
● Responds in Seconds, not Hours
Prediction
● See Problems in Advance
● Solve Problems in Advance
● Capacity Planning - Resources
● Elevated Errors & Pending Failures
Summary
● IT Operations is Manual & Messy
● Monitoring is Diverse & Distributed
● Automation helps take Action
● AI-Ops help Fix Stuff Faster
Thank You
www.Siglos.io

Mais conteúdo relacionado

Semelhante a Unified Operations Vision

Observability for Application Developers (1)-1.pptx
Observability for Application Developers (1)-1.pptxObservability for Application Developers (1)-1.pptx
Observability for Application Developers (1)-1.pptxOpsTree solutions
 
Challenges of monitoring distributed systems
Challenges of monitoring distributed systemsChallenges of monitoring distributed systems
Challenges of monitoring distributed systemsNenad Bozic
 
MuleSoft Manchester Meetup #2 slides 29th October 2019
MuleSoft Manchester Meetup #2 slides 29th October 2019MuleSoft Manchester Meetup #2 slides 29th October 2019
MuleSoft Manchester Meetup #2 slides 29th October 2019Ieva Navickaite
 
AWS Well Architected Framework in Summary
AWS Well Architected Framework in SummaryAWS Well Architected Framework in Summary
AWS Well Architected Framework in SummaryEwere Diagboya
 
How to apply machine learning into your CI/CD pipeline
How to apply machine learning into your CI/CD pipelineHow to apply machine learning into your CI/CD pipeline
How to apply machine learning into your CI/CD pipelineAlon Weiss
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesEd Hunter
 
LIMS Implementation
LIMS ImplementationLIMS Implementation
LIMS ImplementationRobin Emig
 
Monitoring & alerting presentation sabin&mustafa
Monitoring & alerting presentation sabin&mustafaMonitoring & alerting presentation sabin&mustafa
Monitoring & alerting presentation sabin&mustafaLama K Banna
 
Training Webinar: Effective Platform Server Monitoring
Training Webinar: Effective Platform Server MonitoringTraining Webinar: Effective Platform Server Monitoring
Training Webinar: Effective Platform Server MonitoringOutSystems
 
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekendI pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekendNicolas Carlier
 
Data driven @startups
Data driven @startups Data driven @startups
Data driven @startups IIMBNSRCEL
 
AppDynamics User Group
AppDynamics User GroupAppDynamics User Group
AppDynamics User GroupMike Ruangutai
 
Automated monitoring using grafana - DevOpsBKK 2018
Automated monitoring using grafana  - DevOpsBKK 2018Automated monitoring using grafana  - DevOpsBKK 2018
Automated monitoring using grafana - DevOpsBKK 2018Eakkapan Sinlapachipwilai
 
Designing for operability and managability
Designing for operability and managabilityDesigning for operability and managability
Designing for operability and managabilityGaurav Bahrani
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analyticsXiang Fu
 
Monitoring via Datadog
Monitoring via DatadogMonitoring via Datadog
Monitoring via DatadogKnoldus Inc.
 
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?Denodo
 

Semelhante a Unified Operations Vision (20)

MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 
Observability for Application Developers (1)-1.pptx
Observability for Application Developers (1)-1.pptxObservability for Application Developers (1)-1.pptx
Observability for Application Developers (1)-1.pptx
 
Challenges of monitoring distributed systems
Challenges of monitoring distributed systemsChallenges of monitoring distributed systems
Challenges of monitoring distributed systems
 
MuleSoft Manchester Meetup #2 slides 29th October 2019
MuleSoft Manchester Meetup #2 slides 29th October 2019MuleSoft Manchester Meetup #2 slides 29th October 2019
MuleSoft Manchester Meetup #2 slides 29th October 2019
 
AWS Well Architected Framework in Summary
AWS Well Architected Framework in SummaryAWS Well Architected Framework in Summary
AWS Well Architected Framework in Summary
 
How to apply machine learning into your CI/CD pipeline
How to apply machine learning into your CI/CD pipelineHow to apply machine learning into your CI/CD pipeline
How to apply machine learning into your CI/CD pipeline
 
Sea of Data
Sea of DataSea of Data
Sea of Data
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 
LIMS Implementation
LIMS ImplementationLIMS Implementation
LIMS Implementation
 
Monitoring & alerting presentation sabin&mustafa
Monitoring & alerting presentation sabin&mustafaMonitoring & alerting presentation sabin&mustafa
Monitoring & alerting presentation sabin&mustafa
 
Training Webinar: Effective Platform Server Monitoring
Training Webinar: Effective Platform Server MonitoringTraining Webinar: Effective Platform Server Monitoring
Training Webinar: Effective Platform Server Monitoring
 
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekendI pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
 
Data driven @startups
Data driven @startups Data driven @startups
Data driven @startups
 
AppDynamics User Group
AppDynamics User GroupAppDynamics User Group
AppDynamics User Group
 
Automated monitoring using grafana - DevOpsBKK 2018
Automated monitoring using grafana  - DevOpsBKK 2018Automated monitoring using grafana  - DevOpsBKK 2018
Automated monitoring using grafana - DevOpsBKK 2018
 
Designing for operability and managability
Designing for operability and managabilityDesigning for operability and managability
Designing for operability and managability
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Monitoring via Datadog
Monitoring via DatadogMonitoring via Datadog
Monitoring via Datadog
 
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
 

Último

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Último (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Unified Operations Vision

  • 1. Unified Operations Vision By Steve Mushero March, 2019
  • 2. Overview ● Where IT Operations is Today ● Integrated Monitoring ● Automated Operations ● AI-Ops
  • 3. Where We Are Now - Diversity ● Lots of IT Systems ● Physical Servers, Disks, RAID ● Network Devices of Many Types ● SANs & Storage Systems
  • 4. Where We Are Now - Diversity ● Private Clouds - VMWare & OpenStack ● Public Clouds - Many ● Hybrid Clouds & Everything Else ● and more ...
  • 5. Where We Are Now - Many Monitoring Systems ● Each Piece Monitored a Different Way ● Zabbix, Prometheus, Cacti ● Networks by SNMP, Cacti, ● Commercial via BMC, etc. ● APM & Tracing, too ● Logs to ELK, or nowhere
  • 7. Where We Are Now - Resource Monitoring ● 75% of Monitoring is Resources ○ CPU, RAM, Network, ● 20% for Services ○ Performance ● 5% for URLs ● Very Little For: ○ Apps & Customer Experience ○ Internal Services & Golden Signals ○ Architecture, Topology ○ Configuration
  • 8. Challenges I ● Alarm Overload & Fatigue ● Hard to Set Thresholds ● Hard to Know What’s Wrong ● Getting Worse ○ DevOps & Faster Releases ○ Dynamic Systems ○ Microservices ○ Clouds & Cloud Services ● More Players - Ops, Developers, DevOps, SRE
  • 9. Challenges II ● Systems Not Connected ● Collection Methods Vary ● Metric Definitions Vary ● Alerts Vary ● Can’t Unify Anything ○ Metrics ○ Alerts ○ Incidents ○ Understanding
  • 11. Goals - Single System ● Single Monitoring System ● Single Metric System ● Single Source of Truth ● Anomaly Detection ● Resource Metrics ● Golden Signals ● Logs & Events ● APM & Tracing
  • 12. Goals - Rich Data & Information ● Add Context & Details ● Discovery - Metrics on All ○ Hosts & Nodes ○ Services ○ Connections ○ Dependencies ● CMDB ○ Deep Service Configuration ○ Security & Governance
  • 13. How to Get There - Re-Thinking Monitoring ● Monitoring Strategy ● What to Monitor ● How to Monitor ● How to Alert ● How to Troubleshoot ● How to Manage Incidents
  • 15. Monitoring Strategy ● Business Level - Key KPIs, drivers of below items ● User Level - What the User Experiences ● App Level - Engineers, Managers, Users think in apps ● Service Level - The Real Work & Real Problems ● Resource Level - Underlying Everything ● Security - Important Everywhere
  • 16. Monitoring Sources - Need Them All ● OS Metrics & Logs ● Service Metrics & Logs ● App Metrics & Logs ● Cloud Metrics & Logs ● APM Data & Tracing ● CMDB Configs ● Architecture & Dependencies ● Auto-Discovery of Everything
  • 17. What to Monitor ● Focus on User & App Level (Results) ○ User & Browser via RUM (Real User Monitoring) ● Modern Golden Signals ● System Health (Health & Status Endpoints) ● Hard Errors (Only Alert if User Impact) ○ Disk full, service/server dead, etc. ● Background Useful Data (no alerts)
  • 18. Golden Signals - Key to SRE ● Modern Golden Signals ○ USE - Resources (Utilization, Saturation, Errors) ○ RED - Results (Rate, Errors, Duration/Latency) ● Use Specialized Agents ● For Every Service ● At Every System Level ● Down to Disks, Networks, etc.
  • 19. Getting Data & Metrics ● Many Collection Methods ○ Agents, SNMP, SSH, Cloud APIs ○ Defined vs. Ad Hoc Metrics ● Focus on Golden Signals ○ Need Special Tools for RED ● Use Good Statistics ○ Medians, Percentiles, Sampling
  • 20. CMDB ● Key to Many Processes ● Service Discovery ● Security & Compliance ● Change Tracking ● Faster Troubleshooting ● Expert System Source ● Needs Specialized Collector
  • 21. Architecture & Topology ● Use Special Agents to Discover ● Key to Understanding ● Drives System Diagram ● Drives Understanding ● Drives Dependencies ● Often Changing
  • 22. Dependencies ● What Depends on What ● Critical for MicroServices ● Key to Alert Impact Analyses ● Key to Alert Consolidation ● Key to AIOps Root Cause
  • 23. Observability - For Developers & DevOps ● Add Metrics to Application Code ● Uses Structured Logs & Emitted Events ○ With Metrics, Latency, and Errors ● Lots of Tags ○ User, Customer, Browser, Product, OS, much more ● Canned & Ad Hoc Analytics & Exploration ● Should be Integrated into Single System ○ Correlate USE/RED Metrics with APM Data
  • 24. Visualization ● Dashboards ● Diagrams - Layered Arch, Dependencies ● Graphs - Summary, System, Service, Deeper ● Advanced - Heatmaps, Box Plots, Histograms ● Analytics - Cluster, Deviance, Cycles
  • 25. How to Alert ● Two Types of Alerts ○ Alert to Wake Someone Up/Urgent ○ Alert as Information (FYI) ● Alert on User Impact ● All the Rest is Background Info ● Smart Alert Strategy ○ Thresholds where it makes sense ○ Anomalies are Key, but noisy, too
  • 26. Anomaly Alerting ● Many Types ● Historical Checks ○ Univariate, Multi-variate, Neural Networks, Seasons ● Cluster Checks ○ Different rom Peers ● Ratio Checks ○ Metrics don’t Match ○ e.g. Requests vs. Queries
  • 27. How to Manage Incidents ● Incidents are Real Issues (ITIL) ○ Something Broke ○ Combine Many Alerts ● Categorize & Document ● Troubleshoot, Fix, Document ● Communicate ● Review & Report (Post Mortem)
  • 28. How to Troubleshoot ● Train People in Troubleshooting ● Defined Processes - Especially for Emergencies ● Use All the Data ○ Alerts & Incidents ○ Metrics ○ Logs & Events ○ Topology & Dependencies ● Root Cause Analyses Critical ● Runbooks & Team Communications
  • 29. How to Manage Problems ● Problems are Recurring Incidents (ITIL) ● Key to Reducing False Alerts & Fatigue ● Key to Improving Alert Thresholds ● Key to Improving Systems ● Ideally Dedicated Team or Resources ● Needs Monitoring System Support
  • 30. Getting To Integrated Monitoring
  • 31. Getting to Integrated Monitoring ● Big Project ● Multiple Phases ● Lots of Details ● Involves Integration ● And New Strategies
  • 32. Getting to Integrated Monitoring ● Upgrade & Integrate Monitoring ● Add Golden Signals, App/User Focus ● Add Discovery & Dependencies ● Add APM & Tracing ● Upgrade & Integrate Logging ● Add Observability in Code ● Upgraded Alerting, Incident, Problems ● Train Teams
  • 33. Upgrade & Integrate Monitoring ● Plan What to Monitor ● Plan How to Monitor ● Single Unified System ● Multi-Phase Process ● Temporary Integrations
  • 34. Add Golden Signals, App/User Focus ● Setup App & Service Structures ● Plan Golden Signals ○ Needs Special Agents & Techniques ○ Varies a Lot by Service ● Set Baselines & Anomalies
  • 35. Add Discovery & Dependencies ● Driven by Monitoring System ● Get Data ● Build Diagrams ● Verify Dependencies
  • 36. Add APM & Tracing ● Part of Monitoring ● Initial Setup & Test ● Tune for Transactions ● Extract User/RUM Metrics ● Set Alerts as Needed ● Train Developers on Tracing ● Correlate USE/RED Metrics with APM Data
  • 38. Upgrade & Integrate Logging ● Logs are Key Part of Troubleshooting ○ OS Level - Linux & Windows ○ Services - Web, Java/Tomcat, MySQL, etc. ● Send All to Unified Platform ● Get Metrics & Analytics ● Add Alertings on Logs - Errors & Metrics
  • 39. Add Observability in Code ● Structured Logs are Best - Events ● Emitted by App Code with Tags & Dimensions ● Include Metrics - Ideally Latency & Errors ● Build Analyses Dashboards
  • 40. Upgraded Alerting, Incident, Problems ● Move to ITIL Naming & Processes ● Build Procedures ● Dedicated Problem Team
  • 41. Training Teams ● General Training ● ITIL & ITOP Training ● Golden Signals Thinking ● APM, RUM, Tracing Usage
  • 43. Automated Operations - Goals ● Automate Things ● Build Things Faster ● Change Things Faster ● Fix Things Faster ● Reduce Manual Mistakes ● Improve Consistency ● Support Large Scale Systems
  • 44. Key Components ● Clouds & Dynamic Systems ● Infrastructure as Code ● Config Management Systems ○ Ansible, Puppet, Chef, SaltStack ● Automated Troubleshooting ● Auto-Healing ● Auto-Governance
  • 45. Clouds & Dynamic Systems ● Cloud APIs Support Automation ● Clouds Can Change Themselves ○ Auto-Scaling ● CI/CD Systems Can Change Them ● Core of Lots of Automated Processes
  • 46. Infrastructure as Code ● Define Infrastructure Programmatically ● Cloud Formation, Terraform, etc. ● Build & Change Continually ● Continually Updated ● Versioned & Can Diff
  • 47. Config Management Systems ● Support Auto Deployments ● Leverage Infra-as-Code ● Ansible, Puppet, Chef, SaltStack ● Build Large, Reproducible Systems
  • 48. Automated Troubleshooting ● Built on Data & Rule Systems ● Auto-Gather More Details ● Help Find Root Causes ● Use Automation System ● Advanced Use Needs AI-Ops
  • 49. Auto-Healing ● Automatically Fix Things ● Driven by Rule Engines ● Fix Things Faster ● Rapid Response 7x24 ● Use Automation System
  • 50. Auto-Governance ● Continual Compliance ● Guardrails to Prevent Risks ● Systems Auto-Correct ○ Remove Bad Security
  • 52. AI-Ops - What is it? ● Analytical IT Operations ● Artificial Intelligence (AI) Opeations ● Machine Learning for Operations
  • 53. Really, What is It? ● Synthesis of Many Sources of Information ● Better Understanding of Problems & Situation ● Predictions for the Future ● Usually with Machine Learning / Big Data
  • 55. Goals ● Impact Analysis ● Alert Reduction ● Alert Consolidation ● Root Cause Analysis ● Auto-Healing ● Prediction
  • 56. Method - Combine Everything ● Alerts ● Events ● Metrics ● History ● Topology ● Dependencies
  • 57. Many Vendors ● Most Focus on Alert Consolidation ● Some on Root Alert/Issue ● Few on Real Root Cause ● Need Deep Data for RCA
  • 58.
  • 59. Alert Reduction & Consolidation ● Merge Related & Duplicate Alerts ○ By Time, Dependencies, History, etc. ● Helps Ops Teams Focus ● Avoids Missing Key Things in Noise ● First Phase of any AIOPS
  • 60. Root Cause Analysis ● Multi-Mode & Method ● Expert System Uses Everything ● Use Dependencies from Discovery ● Use History with Feedback ● Sort & Prioritize ● Enrich with Additional Data Collection ○ Including Automated Actions
  • 61. Manual-Healing ● One-Click Fixing ● Procedure/Runbooks ● For high-risk, complex solutions ● Automatic Real-Time ● Uses Automation Platform
  • 62. Auto-Healing ● Automatic Real-Time Fixing ● Uses Automation Platform ● Helps 7x24, Reducing On-Call ● Responds in Seconds, not Hours
  • 63. Prediction ● See Problems in Advance ● Solve Problems in Advance ● Capacity Planning - Resources ● Elevated Errors & Pending Failures
  • 64. Summary ● IT Operations is Manual & Messy ● Monitoring is Diverse & Distributed ● Automation helps take Action ● AI-Ops help Fix Stuff Faster