SlideShare uma empresa Scribd logo
1 de 38
Never Fail Twice
How Playtech Mastered Failure Detection Across Distributed Systems
Bio
 Technical Architect with more than 18 y. of experience
 Passionate about IT
 Financial and Data Science background
 Last years in Research and Design projects
Agenda
• What is observability and monitoring
• Why this is hard
• Possible approaches
• How we solved it
• Future of the instrumentation and observability
Objectives
 Get in touch with time-series analysis
 Understanding Distributed Systems pro’s and con’s
 Understanding observability and instrumentation concepts
Observability
 Monitoring is for operating software/systems
 Instrumentation is for writing software
 Observability is for understanding systems
Charity Majors
Why is it difficult
 1. Various problems may lead to non-obvious system behaviour.
 2. Various metrics may have different correlations in time and space.
 3. Monitoring a complex application is a significant engineering endeavor in and of itself.
 4. There is a mix of different measurements and metrics.
System monitoring
in Playtech
 50+ multibranded sites, distributed all over
the world
 Multiple products
 Multichannel
 Different mix of integrations
On the shoulders of giants
A lot of companies
built their own
solutions for
monitoring their
systems.
There was not
always success
stories.
Etsy
 Etsy is a large online
marketplace of handmade
goods
 Their engineering team
collected more than 250,000
different metrics from their
servers
 They tried to find anomalies
using complex math
approaches.
Lessons
learnt from
KALE 1.0
Anomalies in other metrics should be used for root cause
analysis.
Alerts should only be sent out when anomalies are detected in
business and user metrics
A one-size-fits-all type of approach will probably not fit
at all
Anomaly detection is more than just outlier detection
Google SRE team’s BorgMon
 Google has trended toward simpler and faster monitoring
systems, with better tools for post hoc analysis
 [They] avoid “magic” systems that try to learn thresholds or
automatically detect causality
 Rules that generate alerts for humans should be simple to
understand and represent a clear failure
According to the authors of Site Reliability Engineering
Playtech
case
Past tool from HP is “one-fits-for-
all”
Low efficiency and side effects
False Positives and missed incidents
Horrible operability
Time Series
 A time series is a series of data points indexed (or listed or graphed) in time order
 Economical processes have a regular structure
 These are amount of sales in the shops, production of champagne, online transactions
 Usually they have seasonal periods and trend lines
 Using this information, simplifies analysis
Stationary Time-Series Data
 Is a stochastic process, which characteristics does not change
 White noise
Non Stationary Time Series
 Trend line
 Dispersion change
How to model that?
 Every measurement consists of a signal and an error
component/noise, because our processes are affected by many
factors
 Point_of_measurement = signal + error
 Subtract the model’s values from our measurements
 The more our model resembles the real signal, the more our
residue will approximate the error component or stationarity or
white noise
Example
Cut 30 min data piece
Regression or finding a trend line
Trend line subtracted
Looks like white noise
Dickey-Fuller test of an initial piece of data
Stationary hypothesis rejected
And after subtraction
Result is a stationary time series
Let’s take a moving average from our example
A bit of Salvador Dali
Compared with a next week data
Why Time Series DB matters
Optimized for handling time series data
No Updates. Facts do not change ever
Appending data only
Last data has been queried more often
InfluxDB is one of the best time series database
An
Important
Notice
The second level involves receiving such information and
making decisions as to whether they represent real problems
or outages.
This is the information consumption level.
The first level involves searching for anomalies in metrics and
sending out notifications if outliers are found.
This is the information emission level.
Overall
Architecture
 Python stack
 Built as a set of loosely
coupled components
 Executed on their own
Python virtual machines
 Event-driven design
Event Streamer
 Component that holds Workers, fetches data regularly, and tests this data against the statistical
models managed by Workers
 A Worker is the main working unit that holds a set of models together with meta-information
 Workers are fully independent and every cycle is executed using a threading pool
Rule Engine
 Consumes the information provided by the
Event Streamer
 Rules built as Abstract Syntax Tree
 Around 1500 matches per sec in one
process
We also measure dynamics
 We can take into account the speed and acceleration of the degradation of the metrics
 It correspond to, respectively, the severity and the predicted change in the severity of the incident
 Speed is an angular coefficient or a discrete derivative of a particular metric, which is calculated
for every violation
 The same applies to acceleration or the second order derivative
Some of our Rules examples
Model ensemble can be fine tuned
For every alert report is created
Alerta – open-source product for alerts
aggregation
Non Functional Moments
What future brings us
Q&A
 Thank you very much
 aleks.tavgen@gmail.com
 https://medium.com/@ATavgen/time-series-modelling-a9bf4f467687
 https://medium.com/@ATavgen/never-fail-twice-608147cb49b

Mais conteúdo relacionado

Mais procurados

Laboratory Information Managment System
Laboratory Information Managment SystemLaboratory Information Managment System
Laboratory Information Managment Systemneptunesol
 
Becoma an Ace in Analytics
Becoma an Ace in AnalyticsBecoma an Ace in Analytics
Becoma an Ace in AnalyticsKen Goossens
 
Rule based expert system
Rule based expert systemRule based expert system
Rule based expert systemAbhishek Kori
 
Software testing-and-risk-analysis
Software testing-and-risk-analysisSoftware testing-and-risk-analysis
Software testing-and-risk-analysisAjit Waje
 
Continual Monitoring
Continual MonitoringContinual Monitoring
Continual MonitoringTripwire
 
Service Assurance for Modern Apps - BigPanda NA SNO - April 2015 - Dan Turchin
Service Assurance for Modern Apps - BigPanda NA SNO - April 2015 - Dan TurchinService Assurance for Modern Apps - BigPanda NA SNO - April 2015 - Dan Turchin
Service Assurance for Modern Apps - BigPanda NA SNO - April 2015 - Dan TurchinPeopleReign, Inc.
 
New technology new approaches - tmf - july 2016
New technology new approaches - tmf - july 2016New technology new approaches - tmf - july 2016
New technology new approaches - tmf - july 2016Stevan Zivanovic
 
SplunkLive! Houston Improving Healthcare Operations
SplunkLive! Houston Improving Healthcare OperationsSplunkLive! Houston Improving Healthcare Operations
SplunkLive! Houston Improving Healthcare OperationsSplunk
 
Unified Monitoring Webinar with Dustin Whittle
Unified Monitoring Webinar with Dustin WhittleUnified Monitoring Webinar with Dustin Whittle
Unified Monitoring Webinar with Dustin WhittleAppDynamics
 
Esm application management version 1.0
Esm application management version 1.0Esm application management version 1.0
Esm application management version 1.0PaVan G Jakati
 
Unomaly - product presentation
Unomaly - product presentationUnomaly - product presentation
Unomaly - product presentationRudi Wynen
 
NuvoSys Solutions, LLC
NuvoSys Solutions, LLCNuvoSys Solutions, LLC
NuvoSys Solutions, LLCnygonz
 
Using Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps PracticesUsing Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps PracticesPeter Varhol
 
Computer Audit an Introductory
Computer Audit an IntroductoryComputer Audit an Introductory
Computer Audit an IntroductoryMNorazizi HM
 
Why Use Westech Solutions
Why Use Westech SolutionsWhy Use Westech Solutions
Why Use Westech SolutionsJhugueno
 
Why Use Wes Tech Solutions
Why Use Wes Tech SolutionsWhy Use Wes Tech Solutions
Why Use Wes Tech Solutionsdoughold
 
Perfexpert
PerfexpertPerfexpert
Perfexpertgystell
 
CCXG Special Event, November 2020, Michael Vartanyan
CCXG Special Event, November 2020, Michael VartanyanCCXG Special Event, November 2020, Michael Vartanyan
CCXG Special Event, November 2020, Michael VartanyanOECD Environment
 
Vulnerability Assessment & Analysis (VAA) Overview
Vulnerability Assessment & Analysis (VAA) OverviewVulnerability Assessment & Analysis (VAA) Overview
Vulnerability Assessment & Analysis (VAA) OverviewSusan Rantall
 

Mais procurados (19)

Laboratory Information Managment System
Laboratory Information Managment SystemLaboratory Information Managment System
Laboratory Information Managment System
 
Becoma an Ace in Analytics
Becoma an Ace in AnalyticsBecoma an Ace in Analytics
Becoma an Ace in Analytics
 
Rule based expert system
Rule based expert systemRule based expert system
Rule based expert system
 
Software testing-and-risk-analysis
Software testing-and-risk-analysisSoftware testing-and-risk-analysis
Software testing-and-risk-analysis
 
Continual Monitoring
Continual MonitoringContinual Monitoring
Continual Monitoring
 
Service Assurance for Modern Apps - BigPanda NA SNO - April 2015 - Dan Turchin
Service Assurance for Modern Apps - BigPanda NA SNO - April 2015 - Dan TurchinService Assurance for Modern Apps - BigPanda NA SNO - April 2015 - Dan Turchin
Service Assurance for Modern Apps - BigPanda NA SNO - April 2015 - Dan Turchin
 
New technology new approaches - tmf - july 2016
New technology new approaches - tmf - july 2016New technology new approaches - tmf - july 2016
New technology new approaches - tmf - july 2016
 
SplunkLive! Houston Improving Healthcare Operations
SplunkLive! Houston Improving Healthcare OperationsSplunkLive! Houston Improving Healthcare Operations
SplunkLive! Houston Improving Healthcare Operations
 
Unified Monitoring Webinar with Dustin Whittle
Unified Monitoring Webinar with Dustin WhittleUnified Monitoring Webinar with Dustin Whittle
Unified Monitoring Webinar with Dustin Whittle
 
Esm application management version 1.0
Esm application management version 1.0Esm application management version 1.0
Esm application management version 1.0
 
Unomaly - product presentation
Unomaly - product presentationUnomaly - product presentation
Unomaly - product presentation
 
NuvoSys Solutions, LLC
NuvoSys Solutions, LLCNuvoSys Solutions, LLC
NuvoSys Solutions, LLC
 
Using Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps PracticesUsing Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps Practices
 
Computer Audit an Introductory
Computer Audit an IntroductoryComputer Audit an Introductory
Computer Audit an Introductory
 
Why Use Westech Solutions
Why Use Westech SolutionsWhy Use Westech Solutions
Why Use Westech Solutions
 
Why Use Wes Tech Solutions
Why Use Wes Tech SolutionsWhy Use Wes Tech Solutions
Why Use Wes Tech Solutions
 
Perfexpert
PerfexpertPerfexpert
Perfexpert
 
CCXG Special Event, November 2020, Michael Vartanyan
CCXG Special Event, November 2020, Michael VartanyanCCXG Special Event, November 2020, Michael Vartanyan
CCXG Special Event, November 2020, Michael Vartanyan
 
Vulnerability Assessment & Analysis (VAA) Overview
Vulnerability Assessment & Analysis (VAA) OverviewVulnerability Assessment & Analysis (VAA) Overview
Vulnerability Assessment & Analysis (VAA) Overview
 

Semelhante a Monitoring Distributed Systems

Itpi metricon 0906a final
Itpi metricon 0906a finalItpi metricon 0906a final
Itpi metricon 0906a finalGene Kim
 
Product and sevices management system
Product and sevices management systemProduct and sevices management system
Product and sevices management systemVinod Gurram
 
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Migrating Monitoring to Observability – How to Transform DevOps from being Re...Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Migrating Monitoring to Observability – How to Transform DevOps from being Re...Liz Masters Lovelace
 
Automatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang ApplicationsAutomatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang ApplicationsJan Henry Nystrom
 
Different Approaches To Sys Bldg
Different Approaches To Sys BldgDifferent Approaches To Sys Bldg
Different Approaches To Sys BldgUSeP
 
Cybernetics in supply chain management
Cybernetics in supply chain managementCybernetics in supply chain management
Cybernetics in supply chain managementLuis Cabrera
 
Icai seminar kolkata
Icai seminar kolkataIcai seminar kolkata
Icai seminar kolkatasunil patro
 
Life of an event - A never ending tool chain
Life of an event - A never ending tool chainLife of an event - A never ending tool chain
Life of an event - A never ending tool chainArnold Van Wijnbergen
 
Life of an event - A never ending tool chain
Life of an event - A never ending tool chainLife of an event - A never ending tool chain
Life of an event - A never ending tool chainDevoteam
 
Technology Audit | IT Audit | ERP Audit | Database Security
Technology Audit | IT Audit | ERP Audit | Database Security Technology Audit | IT Audit | ERP Audit | Database Security
Technology Audit | IT Audit | ERP Audit | Database Security Arish Roy
 
Full Docu IT Thesis Project In Computerized Inventory System In Brother Burg...
Full Docu IT Thesis Project In Computerized Inventory System In Brother  Burg...Full Docu IT Thesis Project In Computerized Inventory System In Brother  Burg...
Full Docu IT Thesis Project In Computerized Inventory System In Brother Burg...JON ICK BOGUAT
 
Data Analytics Introduction.pptx
Data Analytics Introduction.pptxData Analytics Introduction.pptx
Data Analytics Introduction.pptxamitparashar42
 
Data Analytics Introduction.pptx
Data Analytics Introduction.pptxData Analytics Introduction.pptx
Data Analytics Introduction.pptxamitparashar42
 
20 Simple Questions from Exactpro for Your Enjoyment This Holiday Season
20 Simple Questions from Exactpro for Your Enjoyment This Holiday Season20 Simple Questions from Exactpro for Your Enjoyment This Holiday Season
20 Simple Questions from Exactpro for Your Enjoyment This Holiday SeasonIosif Itkin
 
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...Agile Testing Alliance
 
INTERNAL Assign no 207( JAIPUR NATIONAL UNI)
INTERNAL Assign no   207( JAIPUR NATIONAL UNI)INTERNAL Assign no   207( JAIPUR NATIONAL UNI)
INTERNAL Assign no 207( JAIPUR NATIONAL UNI)Partha_bappa
 
Employment Hero monitoring solution
Employment Hero monitoring solutionEmployment Hero monitoring solution
Employment Hero monitoring solutionLuong Vo
 
SplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
SplunkLive! Frankfurt 2018 - Integrating Metrics & LogsSplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
SplunkLive! Frankfurt 2018 - Integrating Metrics & LogsSplunk
 
System_Analysis_and_Design_Assignment_New2.ppt
System_Analysis_and_Design_Assignment_New2.pptSystem_Analysis_and_Design_Assignment_New2.ppt
System_Analysis_and_Design_Assignment_New2.pptMarissaPedragosa
 

Semelhante a Monitoring Distributed Systems (20)

Itpi metricon 0906a final
Itpi metricon 0906a finalItpi metricon 0906a final
Itpi metricon 0906a final
 
Product and sevices management system
Product and sevices management systemProduct and sevices management system
Product and sevices management system
 
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Migrating Monitoring to Observability – How to Transform DevOps from being Re...Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
 
Automatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang ApplicationsAutomatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang Applications
 
Different Approaches To Sys Bldg
Different Approaches To Sys BldgDifferent Approaches To Sys Bldg
Different Approaches To Sys Bldg
 
Cybernetics in supply chain management
Cybernetics in supply chain managementCybernetics in supply chain management
Cybernetics in supply chain management
 
Icai seminar kolkata
Icai seminar kolkataIcai seminar kolkata
Icai seminar kolkata
 
Life of an event - A never ending tool chain
Life of an event - A never ending tool chainLife of an event - A never ending tool chain
Life of an event - A never ending tool chain
 
Life of an event - A never ending tool chain
Life of an event - A never ending tool chainLife of an event - A never ending tool chain
Life of an event - A never ending tool chain
 
Technology Audit | IT Audit | ERP Audit | Database Security
Technology Audit | IT Audit | ERP Audit | Database Security Technology Audit | IT Audit | ERP Audit | Database Security
Technology Audit | IT Audit | ERP Audit | Database Security
 
Inspace technologies
Inspace technologiesInspace technologies
Inspace technologies
 
Full Docu IT Thesis Project In Computerized Inventory System In Brother Burg...
Full Docu IT Thesis Project In Computerized Inventory System In Brother  Burg...Full Docu IT Thesis Project In Computerized Inventory System In Brother  Burg...
Full Docu IT Thesis Project In Computerized Inventory System In Brother Burg...
 
Data Analytics Introduction.pptx
Data Analytics Introduction.pptxData Analytics Introduction.pptx
Data Analytics Introduction.pptx
 
Data Analytics Introduction.pptx
Data Analytics Introduction.pptxData Analytics Introduction.pptx
Data Analytics Introduction.pptx
 
20 Simple Questions from Exactpro for Your Enjoyment This Holiday Season
20 Simple Questions from Exactpro for Your Enjoyment This Holiday Season20 Simple Questions from Exactpro for Your Enjoyment This Holiday Season
20 Simple Questions from Exactpro for Your Enjoyment This Holiday Season
 
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...
 
INTERNAL Assign no 207( JAIPUR NATIONAL UNI)
INTERNAL Assign no   207( JAIPUR NATIONAL UNI)INTERNAL Assign no   207( JAIPUR NATIONAL UNI)
INTERNAL Assign no 207( JAIPUR NATIONAL UNI)
 
Employment Hero monitoring solution
Employment Hero monitoring solutionEmployment Hero monitoring solution
Employment Hero monitoring solution
 
SplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
SplunkLive! Frankfurt 2018 - Integrating Metrics & LogsSplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
SplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
 
System_Analysis_and_Design_Assignment_New2.ppt
System_Analysis_and_Design_Assignment_New2.pptSystem_Analysis_and_Design_Assignment_New2.ppt
System_Analysis_and_Design_Assignment_New2.ppt
 

Último

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 

Último (20)

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 

Monitoring Distributed Systems

  • 1. Never Fail Twice How Playtech Mastered Failure Detection Across Distributed Systems
  • 2. Bio  Technical Architect with more than 18 y. of experience  Passionate about IT  Financial and Data Science background  Last years in Research and Design projects
  • 3. Agenda • What is observability and monitoring • Why this is hard • Possible approaches • How we solved it • Future of the instrumentation and observability
  • 4. Objectives  Get in touch with time-series analysis  Understanding Distributed Systems pro’s and con’s  Understanding observability and instrumentation concepts
  • 5. Observability  Monitoring is for operating software/systems  Instrumentation is for writing software  Observability is for understanding systems Charity Majors
  • 6. Why is it difficult  1. Various problems may lead to non-obvious system behaviour.  2. Various metrics may have different correlations in time and space.  3. Monitoring a complex application is a significant engineering endeavor in and of itself.  4. There is a mix of different measurements and metrics.
  • 7. System monitoring in Playtech  50+ multibranded sites, distributed all over the world  Multiple products  Multichannel  Different mix of integrations
  • 8. On the shoulders of giants A lot of companies built their own solutions for monitoring their systems. There was not always success stories.
  • 9. Etsy  Etsy is a large online marketplace of handmade goods  Their engineering team collected more than 250,000 different metrics from their servers  They tried to find anomalies using complex math approaches.
  • 10. Lessons learnt from KALE 1.0 Anomalies in other metrics should be used for root cause analysis. Alerts should only be sent out when anomalies are detected in business and user metrics A one-size-fits-all type of approach will probably not fit at all Anomaly detection is more than just outlier detection
  • 11. Google SRE team’s BorgMon  Google has trended toward simpler and faster monitoring systems, with better tools for post hoc analysis  [They] avoid “magic” systems that try to learn thresholds or automatically detect causality  Rules that generate alerts for humans should be simple to understand and represent a clear failure According to the authors of Site Reliability Engineering
  • 12. Playtech case Past tool from HP is “one-fits-for- all” Low efficiency and side effects False Positives and missed incidents Horrible operability
  • 13. Time Series  A time series is a series of data points indexed (or listed or graphed) in time order  Economical processes have a regular structure  These are amount of sales in the shops, production of champagne, online transactions  Usually they have seasonal periods and trend lines  Using this information, simplifies analysis
  • 14. Stationary Time-Series Data  Is a stochastic process, which characteristics does not change  White noise
  • 15. Non Stationary Time Series  Trend line  Dispersion change
  • 16. How to model that?  Every measurement consists of a signal and an error component/noise, because our processes are affected by many factors  Point_of_measurement = signal + error  Subtract the model’s values from our measurements  The more our model resembles the real signal, the more our residue will approximate the error component or stationarity or white noise
  • 18. Cut 30 min data piece
  • 19. Regression or finding a trend line
  • 20. Trend line subtracted Looks like white noise
  • 21. Dickey-Fuller test of an initial piece of data Stationary hypothesis rejected
  • 22. And after subtraction Result is a stationary time series
  • 23. Let’s take a moving average from our example
  • 24. A bit of Salvador Dali
  • 25. Compared with a next week data
  • 26. Why Time Series DB matters Optimized for handling time series data No Updates. Facts do not change ever Appending data only Last data has been queried more often InfluxDB is one of the best time series database
  • 27. An Important Notice The second level involves receiving such information and making decisions as to whether they represent real problems or outages. This is the information consumption level. The first level involves searching for anomalies in metrics and sending out notifications if outliers are found. This is the information emission level.
  • 28. Overall Architecture  Python stack  Built as a set of loosely coupled components  Executed on their own Python virtual machines  Event-driven design
  • 29. Event Streamer  Component that holds Workers, fetches data regularly, and tests this data against the statistical models managed by Workers  A Worker is the main working unit that holds a set of models together with meta-information  Workers are fully independent and every cycle is executed using a threading pool
  • 30. Rule Engine  Consumes the information provided by the Event Streamer  Rules built as Abstract Syntax Tree  Around 1500 matches per sec in one process
  • 31. We also measure dynamics  We can take into account the speed and acceleration of the degradation of the metrics  It correspond to, respectively, the severity and the predicted change in the severity of the incident  Speed is an angular coefficient or a discrete derivative of a particular metric, which is calculated for every violation  The same applies to acceleration or the second order derivative
  • 32. Some of our Rules examples
  • 33. Model ensemble can be fine tuned
  • 34. For every alert report is created
  • 35. Alerta – open-source product for alerts aggregation
  • 38. Q&A  Thank you very much  aleks.tavgen@gmail.com  https://medium.com/@ATavgen/time-series-modelling-a9bf4f467687  https://medium.com/@ATavgen/never-fail-twice-608147cb49b

Notas do Editor

  1. Во-первых, из-за высокой сложности продуктов и огромного количества настроек существуют ситуации, когда неправильные настройки приводят к деградации финансовых показателей, или скрытые баги в логике отражаются на общем функционале всей системы.  Во-вторых, есть специфические 3d-party интеграции для разных стран, и проблемы, возникающие у партнеров, начинают протекать к нам. Проблемы такого рода не ловятся low level monitoring; для их решения нужно мониторить ключевые индикаторы (KPI), сравнивать их со статистикой по системе и искать корреляции. В компании было ранее внедрено решение от Hewlett Packard Service Health Analyzer, которое было (мягко говоря) неидеально. Судя по маркетинговому проспекту, это система, которая сама обучается и обеспечивает раннее обнаружение проблем (SHA can take data streams from multiple sources and apply advanced, predictive algorithms in order to alert to and diagnose potential problems before they occur). По факту же это был черный ящик, который невозможно настроить, со всеми проблемами нужно было обращаться в HP и ждать месяцами, пока инженеры поддержки сделают что-то, что тоже не будет работать, как нужно. А еще — ужасный пользовательский интерфейс, старая JVM экосистема (Java 6.0), и, что самое главное — большое число False Positives и (что еще хуже) False Negatives, то есть некоторые серьезные проблемы либо не обнаруживались, либо были пойманы намного позже чем следует, что выражалось во вполне конкретном финансовом убытке.
  2. Опыт проекта Kale говорит об очень важном моменте. Алертинг — это не то же самое, что и поиск аномалий и outliers в метриках, поскольку, как уже говорилось, аномалии на единичных метриках будут всегда.  В действительности, у нас есть два логических уровня.  — Первый — это поиск аномалий в метриках и посылка нотификации о нарушении, если аномалия найдена. Это уровень эмиссии информации.  — Второй уровень — это компонент, получающий информацию о нарушениях и принимающий решения о том, является это критическим инцидентом или нет.  Таким образом действуем и мы, люди, когда исследуем проблему. Мы смотрим на что-либо, при обнаружении отклонений от нормы смотрим еще, и затем принимаем решение на основании наблюдений. В начале проекта мы решили попробовать Kapacitor, поскольку в нем есть возможность определения пользовательских функций на Python. Но каждая функция сидит в отдельном процессе, что создало бы overhead для сотен и тысяч метрик. Из-за этой и некоторых других проблем от него решено было отказаться. Для построения собственной системы в качестве основного стека был выбран Python, поскольку существует отличная экосистема для анализа данных, быстрые библиотеки (pandas, numpy и т.д.), отличная поддержка веб-решений. You name it. Для меня это был первый большой проект, целиком и полностью выполненный на Python. Сам я пришел к Python из Java мира. Мне не хотелось множить зоопарк стеков для одной системы, что в конечном счете было вознаграждено.
  3. Общая архитектура. Система построена в виде набора слабо связанных компонентов или сервисов, которые крутятся в своих процессах на своих Python VM. Это естественно для общего логического разбиения (events emitter / rules engine) и дает другие плюсы. Каждый компонент делает ограниченное количество специфических вещей. В дaльнейшем это позволит очень быстро расширять систему и добавлять новые пользовательские интерфейсы, не затрагивая основную логику и не боясь ее сломать. Между компонентами проведены достаточно четкие границы. Распределенный deploy удобен, если нужно разместить агент локально к ближе к сайту, который он мониторит — или же можно аггрегировать вместе большое количество разных систем.  Коммуникация должна быть построена на базе сообщений, поскольку вся система должна быть асинхронной.  В качестве Message Queue я выбрал ActiveMQ, но при желании сменить, например, на RabbitMQ, проблем не возникнет, поскольку все компоненты общаются по стандартному протоколу STOMP.
  4. Worker — это основной рабочий юнит, который хранит одну модель одной метрики вместе с мета-информацией. Он состоит из дата коннектора и хендлера, которому передает данные. Хендлер тестирует их на статистической модели, и если обнаруживает нарушения, то передает их агенту, который посылает событие в очередь. Workers полностью независимы друг от друга, каждый цикл выполняется через пул потоков. Поскольку большая часть времени тратится на I/O операции, то Global Interpreter Lock Python не сильно влияет на результат. Количество потоков ставится в конфиге; на текущей конфигурации оптимальным количеством оказалось 8 потоков. Information emmiter
  5. Information consuming Каждое сообщение отправляется в Rule Engine, и тут начинается самое интересное. В самом начале разработки я жестко задал правила в коде: когда одна метрика падает, а другая растет, то послать алерт. Но это решение не универсально и требует залезать в код для любого расширения. Поэтому нужен был какой-то язык, задающий правила. Тут пришлось вспоминать Абстрактные Синтаксические Деревья и дефинировать простой язык для описания правил. При запуске компонента правила считываются и строится синтаксическое дерево. При получении каждого сообщения все события с этого сайта за один тик проверяются согласно заданным правилам, и если правило срабатывает, то генерируется алерт. Сработавших правил может быть несколько. Если рассматривать динамику инцидентов, развивающихся во времени, то можно учитывать также скорость падения (уровень severity) и изменение скорости (прогноз изменения severity)
  6. Скорость — это угловой коэффициент или дискретная производная, который подсчитывается для каждого нарушения. То же касается и акселерации, дискретной производной второго порядка. Это значения можно задавать в правилах. Кумулятивные производные первых и вторых порядков могут учитываться в общей оценке инцидента.
  7. Правила описываются в формате YAML, но можно использовать любой другой, достаточно добавить свой парсер. Правила задаются в виде регулярных выражений имен метрик или просто префиксов метрик. Speed — это скорость деградации метрик, об этом ниже.
  8. SHA required Oracle server tens of Gb Ram PT-Pas 28962 lines of code. 1500 matches/per sec one process, up to 125 000 matches in current configuration (sharding probabilities)
  9. Non funfunc 28982 LOC