Making sure that one application works 24/7 is never easy, but making sure all services are running in a large enterprise is definitely a challenging task.On the other hand, today everybody is talking about big data, AI, and machine learning, but many companies still struggle to find a use case where machine learning can make a difference or build a production-ready ML system. AIOPS tries to combine those two fields – big data and machine learning to automate IT operations processes. How can machine learning help in IT operations, what does it take to build machine learning system that cooperates with developers, and what can we expect in the future… these are just some of the questions we will try to answer in this talk.
6. How to train the model?
Avg latency
Raw logs 30 sec features
Normal state
for service X
Num of logs
Errors
Part of day
New data
Machine learning model
8. Building AIOPS solution
App logs
APM logs
DB logs
ETL service Feature
store
AI service
Anomaly score
Alert
Action
Model
repository
Dataset
repository
Retrain
jobs
Training environment
Data prep
jobs
Inferencing environment