SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
Intelligent Monitoring

        Denis A. Vieira Jr.
       Ricardo Clemente
Intelligent Monitoring

Summary:


 Motivation
 Where are we?
 Where are we going?
 Action Plan
 Event Correlation
Intelligent Monitoring

Summary:


 Motivation
 Where are we?
 Where are we going?
 Action Plan
 Event Correlation
Intelligent Monitoring


Motivation:

    Only ponctual monitoring available

    Decrease time to repair incidents

    Proactive monitoring

    Realistic view from live environment
Intelligent Monitoring


Motivation:

    Learn (identify patterns )

    Automation

    Store historical data with no loss

    Improve credibility and Situational Awareness
Intelligent Monitoring

Summary:


 Motivation
 Where are we?
 Where are we going?
 Action Plan
 Event Correlation
Intelligent Monitoring


 Where are we?:

    Lots of information (1200 servers with more than 14000 monitors)
     – more than 40000 graphs being plot

    Lots of tools for monitoring running (SME, IPMonitor, Cricket,
     SiteScope, SiteSeer, Logs)

    Difficulties with specific customizations, performance and cost

    No credibility (lots of emails) with alarms. But much better than
     before.
Intelligent Monitoring

Summary:


 Motivation
 Where are we?
 Where are we going?
 Action Plan
 Event Correlation
Intelligent Monitoring


Were are we going:

    Use of events. E.g.: Appenders for log frameworks to integrate
     information from applications

    Knowledge to antecipate undesired situations

    Unified interface for monitoring

    Root cause detection
Intelligent Monitoring

Summary:


 Motivation
 Where are we?
 Where are we going?
 Action Plan
 Event Correlation
Intelligent Monitoring


Action Plan:

    Unify the monitoring tools with Nagios (scalability and integration)

    Integrate Nagios with correlation system using NEB (Nagios Event
     Broker)
    available ate:
         code.google.com/p/neb2activemq

    Map event and systems to correlate
   (manual and analytic task)
Intelligent Monitoring

Summary:


 Motivation
 Where are we?
 Where are we going?
 Action Plan
 Event Correlation
    Orverview and system architecture
    Event Bus
    Correlation tecnique
    Correlation egine
    Visualization
    Machine Learning
    Project
Overview and system architecture

 Modular and event-driven architecture



                                  CORRELATION
             COLLECTOR
                                    ENGINE




                              EVENT BUS




                     MACHINE LEARN        VISUALIZATION
Overview and system architecture
What is the system architecture?

 Unique bus for message exchange
 Modules are separte process for operating system and can be on
  differente machines
 Modules can publish / subscribe to queue / topic from bus

Why an Event Driven Architecture ?

 Loose coupled e Distributed
    Less intrusive for monitored systems
    Modules are independent
Event bus
Open source project

Chosen Apache ActiveMQ:
 Stable
 Performance
 Active Comunity
 Conectivity
     JMS
     STOMP
     REST
     XMPP (...)
Event Bus
Message format

 JSON ( not XML)
     Simplicity
 Structure
     Header : channel type(queue or topic) and event type
     Body: data



 $ curl -d "type=queue&body={'idle'=70, 'sys’=20,
 'usr'=10, 'host'='ws122' }&eventtype=CPU"
 http://barramento/message/events;
Correlation Technique

CEP (Complex Event Processing )
 Technology that enables processing mutiple events in real time with
  the goal to identify meaningful events
 Based on rules or queries (“SQL like”)
 Queries created on execution time

History
 On1995, professor David Luckham from Stanford, working on Rapide
  project coined the term CEP
 Database research topic: Data Stream Management Systems (DSMS)
Correlation technique
                 “upside down database”

 query                answer        continuos
                                                                answer
                                    query


                                                Processamento de
         Query Processing
                                dados               consultas            dados
             Memory                                Memória


                                                  Data stream



            Dados
             Dados
               Data

    Persistents relations
Correlation Technique
 Marketing
 Trend(Buzz)
  CEP market is estimated on 460 milion dolars by 2010 (source: IEEE
   Computer Society – April 2009)

 Useful where there are data streams and necessity to extract
   information on real time from that data
  Financial Market
  Logistic process (RFID)
  Airport control
  ICUs
  Datacenters
Correlation Technique
 Big Players
Correlation Technique
 Open Source Players
 Academic projects:
  STREAM – Stanford – 2003 (officialy deprecated)
  TelegraphCQ – Berkeley - 2003
      Based on PostgreSQL 7.3.2
      No activity
      Cayuga – Cornell

 From the industry:
 Esper, a codehaus project complete in terms features
  Compact syntax and flexible
  Excelent documentation
  Performance
  Our choice!
Correlation Engine
 Application




                     If session raised 10% on the
                     last 3 min, and the average
                     from Servers cpu didn’t raise
                     5%, and Mysql slow queries
                     are above 10, so there is a
                     database retention causing
                     users to queue
Correlation Engine
Application
                t – 3 min      t


              Vip           session

                t – 3 min      t


              Server        cpu_usr

                                   t


              Mysql         slow_query
Correlation Engine
 Application

 SELECT Server.host , Server.cpu_usr, Server_PAST.cpu_usr, Vip.session,
   Vip_PAST.session, Mysql.slow_query
   FROM
          Server.win:time(1 min) as Server,
          Server.win:ext_timed(current_timestamp(), 3 min) as Server_PAST,
          Vip.win:time(1 min) as Vip,
          Vip.win:ext_timed(current_timestamp(), 3 min) as Vip_PAST ,
          Mysql.win:time (1min) as Mysql
   HAVING
          Vip.session > Vip_PAST.session * 1.10 AND
          avg(Server.cpu_usr) < avg (Server_PAST.cpu_usr) * 1.05 AND
          Mysql.slow_query > 10
Correlation Engine
 Identifing na outlier
     select host, free, avg(free)
     from Memory.win:time(240 sec) group by host
     having free < avg(free)

 Events sequence
    select * from
       pattern [every Memory(free < 10) ->
            (timer:interval(60 sec) and Log(text like ‘%OutOfMemory%’)) ]

 Schedule and extensions
     select idle from pattern [every timer:at(*, [16:22], *, [0,3], *) ].win:time(30
        sec), CPU.win:time(30) where idle < 30 AND Filter.isInNode(id,
        “Sports.BigFarm")
Motor de correlação
 Performance Esper

      Item                     Especificação
      HW Servidor Esper        2 x Intel Xeon 5130 2GHz (4 cores total), 16GB RAM
      VM config                -Xms2g -Xmx2g -Xns128m -Xgc:gencon


  Consulta                         # cons.    evt/s     Latência      Latência         Nota
                                                                      média
  select '$' as ticker from    1000           519 728 99.66% <        2.8us            CPU com 85%,
  Market(ticker='$').win:lengt                        10us                             70 Mbit/s
  h(1000).stat:weighted_avg('p
  rice', 'volume') output last
  every 30 seconds


Source: Esper Performance - http://docs.codehaus.org/display/ESPER/Esper+performance
Correlation engine
 Process inside Correlaion engine
Visualization – Console
Quering the live environment
Visualization – Troubleshooting
Antecipating and solving incidents quicker
Visualization- Dashboard
Consolidate view of environment
What about unseen problems?
Machine Learning

Choice for non-supervised and incremental algorithms

Incremental PCA
 Transforms a number of possible correlated variables in a minor
  number of non-correlated, the principal componnents
 A change on principal componnents means a broken correlation, or
  annomaly
 Can be used for data compression

Inspired on a paper from Carnegie Mellon University (Hoke et al. 2006)
Source: http://www.pdl.cmu.edu/PDL-FTP/SelfStar/osr_sub.pdf


Implementation had two main challenges: measures with missing values
  and different scales
Machine Learning

60 input signals
Machine Learning

Summarized on 1 principal component + gerenation matriz
Machine Learning


                      Second principal component




                   sensibility




                                              three annomaly
Project

Status

 Developed all functionalities

 Algorithms being validated through tests with
  RRDs and meeting with operation team

 Performance tests on going

 System on live enviroment with reduced scope
Project at Globo.com – Next challenges


Scale
    Events“Sharding”
    Rule balance
    Cache

Otimize algorithm
    Adaptative control of memory and sensibility parameters
    Insert a supervisioned layer
    Other algorithms to cooperate
Intelligent Monitoring

      Final considerations
References




       http://delicious.com/fisl10
Questions

 Contacts
   Denis A. Vieira Jr
   denis@corp.globo.com (www.globo.com)
   Ricardo Clemente
   ricardo@intelie.com.br (www.intelie.com.br)

 Globo.com stand
    This afternoon

 Raise your hand!

Mais conteúdo relacionado

Destaque

INTELIE - Inteligência em Operação
INTELIE - Inteligência em OperaçãoINTELIE - Inteligência em Operação
INTELIE - Inteligência em OperaçãoDC-DinsmoreCompass
 
Security Events correlation with ESPER
Security Events correlation with ESPERSecurity Events correlation with ESPER
Security Events correlation with ESPERNikolay Klendar
 
Gartner ITxpo 2015 - 3 casos de operações digitais mais inteligentes usando r...
Gartner ITxpo 2015 - 3 casos de operações digitais mais inteligentes usando r...Gartner ITxpo 2015 - 3 casos de operações digitais mais inteligentes usando r...
Gartner ITxpo 2015 - 3 casos de operações digitais mais inteligentes usando r...Intelie
 
Machine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWSMachine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWSAmazon Web Services
 
Complex Event Processing with Esper
Complex Event Processing with EsperComplex Event Processing with Esper
Complex Event Processing with EsperAntónio Alegria
 

Destaque (6)

Intelie BPMS
Intelie BPMSIntelie BPMS
Intelie BPMS
 
INTELIE - Inteligência em Operação
INTELIE - Inteligência em OperaçãoINTELIE - Inteligência em Operação
INTELIE - Inteligência em Operação
 
Security Events correlation with ESPER
Security Events correlation with ESPERSecurity Events correlation with ESPER
Security Events correlation with ESPER
 
Gartner ITxpo 2015 - 3 casos de operações digitais mais inteligentes usando r...
Gartner ITxpo 2015 - 3 casos de operações digitais mais inteligentes usando r...Gartner ITxpo 2015 - 3 casos de operações digitais mais inteligentes usando r...
Gartner ITxpo 2015 - 3 casos de operações digitais mais inteligentes usando r...
 
Machine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWSMachine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWS
 
Complex Event Processing with Esper
Complex Event Processing with EsperComplex Event Processing with Esper
Complex Event Processing with Esper
 

Semelhante a Intelligent Monitoring

Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...Flink Forward
 
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...In-Memory Computing Summit
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesSigmoid
 
Advanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest FiresAdvanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest FiresIvo Andreev
 
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to ProductionWebinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to Productioniguazio
 
Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05Rajesh Gupta
 
Complex Event Processing - A brief overview
Complex Event Processing - A brief overviewComplex Event Processing - A brief overview
Complex Event Processing - A brief overviewIstván Dávid
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaAlluxio, Inc.
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemC4Media
 
Observability foundations in dynamically evolving architectures
Observability foundations in dynamically evolving architecturesObservability foundations in dynamically evolving architectures
Observability foundations in dynamically evolving architecturesBoyan Dimitrov
 
MongoDB for Time Series Data
MongoDB for Time Series DataMongoDB for Time Series Data
MongoDB for Time Series DataMongoDB
 
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...Fulvio Corno
 
Chapter 1 computer abstractions and technology
Chapter 1 computer abstractions and technologyChapter 1 computer abstractions and technology
Chapter 1 computer abstractions and technologyBATMUNHMUNHZAYA
 
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...Priyanka Aash
 
Adventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and InstanaAdventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and InstanaMarcel Birkner
 
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 Adventures in Observability: How in-house ClickHouse deployment enabled Inst... Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...Altinity Ltd
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 

Semelhante a Intelligent Monitoring (20)

Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
 
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
 
Advanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest FiresAdvanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest Fires
 
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to ProductionWebinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
 
Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05
 
Complex Event Processing - A brief overview
Complex Event Processing - A brief overviewComplex Event Processing - A brief overview
Complex Event Processing - A brief overview
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing System
 
Observability foundations in dynamically evolving architectures
Observability foundations in dynamically evolving architecturesObservability foundations in dynamically evolving architectures
Observability foundations in dynamically evolving architectures
 
Linux capacity planning
Linux capacity planningLinux capacity planning
Linux capacity planning
 
MongoDB for Time Series Data
MongoDB for Time Series DataMongoDB for Time Series Data
MongoDB for Time Series Data
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
 
Chapter 1 computer abstractions and technology
Chapter 1 computer abstractions and technologyChapter 1 computer abstractions and technology
Chapter 1 computer abstractions and technology
 
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
 
Adventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and InstanaAdventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and Instana
 
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 Adventures in Observability: How in-house ClickHouse deployment enabled Inst... Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 

Último

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Último (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Intelligent Monitoring

  • 1. Intelligent Monitoring Denis A. Vieira Jr. Ricardo Clemente
  • 2. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
  • 3. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
  • 4. Intelligent Monitoring Motivation:  Only ponctual monitoring available  Decrease time to repair incidents  Proactive monitoring  Realistic view from live environment
  • 5. Intelligent Monitoring Motivation:  Learn (identify patterns )  Automation  Store historical data with no loss  Improve credibility and Situational Awareness
  • 6. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
  • 7. Intelligent Monitoring Where are we?:  Lots of information (1200 servers with more than 14000 monitors) – more than 40000 graphs being plot  Lots of tools for monitoring running (SME, IPMonitor, Cricket, SiteScope, SiteSeer, Logs)  Difficulties with specific customizations, performance and cost  No credibility (lots of emails) with alarms. But much better than before.
  • 8. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
  • 9. Intelligent Monitoring Were are we going:  Use of events. E.g.: Appenders for log frameworks to integrate information from applications  Knowledge to antecipate undesired situations  Unified interface for monitoring  Root cause detection
  • 10. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
  • 11. Intelligent Monitoring Action Plan:  Unify the monitoring tools with Nagios (scalability and integration)  Integrate Nagios with correlation system using NEB (Nagios Event Broker)  available ate: code.google.com/p/neb2activemq  Map event and systems to correlate (manual and analytic task)
  • 12. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation  Orverview and system architecture  Event Bus  Correlation tecnique  Correlation egine  Visualization  Machine Learning  Project
  • 13. Overview and system architecture  Modular and event-driven architecture CORRELATION COLLECTOR ENGINE EVENT BUS MACHINE LEARN VISUALIZATION
  • 14. Overview and system architecture What is the system architecture?  Unique bus for message exchange  Modules are separte process for operating system and can be on differente machines  Modules can publish / subscribe to queue / topic from bus Why an Event Driven Architecture ?  Loose coupled e Distributed  Less intrusive for monitored systems  Modules are independent
  • 15. Event bus Open source project Chosen Apache ActiveMQ:  Stable  Performance  Active Comunity  Conectivity  JMS  STOMP  REST  XMPP (...)
  • 16. Event Bus Message format  JSON ( not XML)  Simplicity  Structure  Header : channel type(queue or topic) and event type  Body: data $ curl -d "type=queue&body={'idle'=70, 'sys’=20, 'usr'=10, 'host'='ws122' }&eventtype=CPU" http://barramento/message/events;
  • 17. Correlation Technique CEP (Complex Event Processing )  Technology that enables processing mutiple events in real time with the goal to identify meaningful events  Based on rules or queries (“SQL like”)  Queries created on execution time History  On1995, professor David Luckham from Stanford, working on Rapide project coined the term CEP  Database research topic: Data Stream Management Systems (DSMS)
  • 18. Correlation technique “upside down database” query answer continuos answer query Processamento de Query Processing dados consultas dados Memory Memória Data stream Dados Dados Data Persistents relations
  • 19. Correlation Technique Marketing Trend(Buzz)  CEP market is estimated on 460 milion dolars by 2010 (source: IEEE Computer Society – April 2009) Useful where there are data streams and necessity to extract information on real time from that data  Financial Market  Logistic process (RFID)  Airport control  ICUs  Datacenters
  • 21. Correlation Technique Open Source Players Academic projects:  STREAM – Stanford – 2003 (officialy deprecated)  TelegraphCQ – Berkeley - 2003  Based on PostgreSQL 7.3.2  No activity  Cayuga – Cornell From the industry: Esper, a codehaus project complete in terms features  Compact syntax and flexible  Excelent documentation  Performance  Our choice!
  • 22. Correlation Engine Application If session raised 10% on the last 3 min, and the average from Servers cpu didn’t raise 5%, and Mysql slow queries are above 10, so there is a database retention causing users to queue
  • 23. Correlation Engine Application t – 3 min t Vip session t – 3 min t Server cpu_usr t Mysql slow_query
  • 24. Correlation Engine Application SELECT Server.host , Server.cpu_usr, Server_PAST.cpu_usr, Vip.session, Vip_PAST.session, Mysql.slow_query FROM Server.win:time(1 min) as Server, Server.win:ext_timed(current_timestamp(), 3 min) as Server_PAST, Vip.win:time(1 min) as Vip, Vip.win:ext_timed(current_timestamp(), 3 min) as Vip_PAST , Mysql.win:time (1min) as Mysql HAVING Vip.session > Vip_PAST.session * 1.10 AND avg(Server.cpu_usr) < avg (Server_PAST.cpu_usr) * 1.05 AND Mysql.slow_query > 10
  • 25. Correlation Engine Identifing na outlier select host, free, avg(free) from Memory.win:time(240 sec) group by host having free < avg(free) Events sequence select * from pattern [every Memory(free < 10) -> (timer:interval(60 sec) and Log(text like ‘%OutOfMemory%’)) ] Schedule and extensions select idle from pattern [every timer:at(*, [16:22], *, [0,3], *) ].win:time(30 sec), CPU.win:time(30) where idle < 30 AND Filter.isInNode(id, “Sports.BigFarm")
  • 26. Motor de correlação Performance Esper Item Especificação HW Servidor Esper 2 x Intel Xeon 5130 2GHz (4 cores total), 16GB RAM VM config -Xms2g -Xmx2g -Xns128m -Xgc:gencon Consulta # cons. evt/s Latência Latência Nota média select '$' as ticker from 1000 519 728 99.66% < 2.8us CPU com 85%, Market(ticker='$').win:lengt 10us 70 Mbit/s h(1000).stat:weighted_avg('p rice', 'volume') output last every 30 seconds Source: Esper Performance - http://docs.codehaus.org/display/ESPER/Esper+performance
  • 27. Correlation engine Process inside Correlaion engine
  • 28. Visualization – Console Quering the live environment
  • 29. Visualization – Troubleshooting Antecipating and solving incidents quicker
  • 31. What about unseen problems?
  • 32. Machine Learning Choice for non-supervised and incremental algorithms Incremental PCA  Transforms a number of possible correlated variables in a minor number of non-correlated, the principal componnents  A change on principal componnents means a broken correlation, or annomaly  Can be used for data compression Inspired on a paper from Carnegie Mellon University (Hoke et al. 2006) Source: http://www.pdl.cmu.edu/PDL-FTP/SelfStar/osr_sub.pdf Implementation had two main challenges: measures with missing values and different scales
  • 34. Machine Learning Summarized on 1 principal component + gerenation matriz
  • 35. Machine Learning Second principal component sensibility three annomaly
  • 36. Project Status  Developed all functionalities  Algorithms being validated through tests with RRDs and meeting with operation team  Performance tests on going  System on live enviroment with reduced scope
  • 37. Project at Globo.com – Next challenges Scale Events“Sharding” Rule balance Cache Otimize algorithm Adaptative control of memory and sensibility parameters Insert a supervisioned layer Other algorithms to cooperate
  • 38. Intelligent Monitoring Final considerations
  • 39. References http://delicious.com/fisl10
  • 40. Questions Contacts Denis A. Vieira Jr denis@corp.globo.com (www.globo.com) Ricardo Clemente ricardo@intelie.com.br (www.intelie.com.br) Globo.com stand This afternoon Raise your hand!