AutoRemediation and Workflow at LinkedIn

•Download as PPTX, PDF•

2 likes•597 views

Nurse Tech Presentation given on May 14 2015 at the AutoRemediation meetup. http://www.meetup.com/Auto-Remediation-and-Event-Driven-Automation/events/222051597/

Technology

Brian Cory Sherwin
Site Reliability Engineer
LinkedIn
AutoRemediation and Workflow
@ LinkedIn

• Key Concepts
• Work Flow Ideas
• LinkedIn’s Solution
3
Agenda
Agenda

• Monitoring Systems
• Remediation Systems
• Action Systems
4
Separation of Powers
WorkﬂowMonitoring
Action
Systems

Restart An Application
• Grab some logs
• Start an application
• Open a ticket
5
Gather Data Restart
Ticket
Simple Work Flow Example

Key Goals
• Broker between action systems
• Linear Execution of Events
• Collaboration and ease of use
• Focus on Simple use cases
6
Remediation
Broker
Monitoring
Remote
Execution
Ticketing
Building an AutoRemedation @ LinkedIn

• Guaranteed Data Collection
• Better Accountability
• Formalized automation
• Extensibility
7
Gather Data Restart
Ticket
Why Use a Workflow

• Linear Execution
• Best Effort
• Limited Work Flow Control
8
Work Flow @ LinkedIn
Remediation
Broker
Monitoring
Remote
Execution
Ticketing

Work Flow Control Types
• Best Effort
• Guaranteed
• Abort
• OnFailure (planned)
9
Gather Data Restart
Ticket
LinkedIn: Work Flow Control

• Brian Cory Sherwin (bcs)
• LinkedIn
• bsherwin@linkedin
10
Questions?
Questions?

At least in my designs The key concept is to separate the existence of a monitoring system, a workflow system, and things doing the work. I swiss army knife might have a corkscrew and a screw driver, but they aren’t necessarily very good at any of those jobs. This is potentially a system trying to solve your issues before they become issues. Do you want a purpose built system or do you want something trying to masquerade as 3 separate systems?
lets talk about a simple example of a work flow This is a very simple example I’ll reference a few times. I’ll reference this as a plan throughout the presentation and the individual units of work as a job.
We had a healthy system together already. Monitoring, remote executions, code deployment. What we missed was the glue between these system But how to glue? Focus on simple use cases that you’re already doing. Restarting a web server? Re-kicking a box? All of these should be automated by you first so you don’t have run it manually. Ensure that using the system is easy to use. This should be one of the key design cases. Creating, running, and scheduling remediation work flows should not be challenging. Now that’s not to detract from the importance of understanding what is going on. At the end of the day you need to have faith in your monitoring system.
Lets go back to the example work flow We can a guarantee data collection attempt. I’ve been in a few situation where problems were being fixed by the ops team without gather good data to resolve. As an app owner you should be aware of what data you need to fixe an issue. Better Accountability: We know exactly how many times we’ve done something. Ops teams sometimes can toil in the darkness restarting applications (or other simpler systems that just restart automatically) By keep better records we can make better business decisions on fixing bugs. Is it a .1% problem or a .01% problem. Without good record keeping we’d never know. Related to above: Formalizing automation would mean that simpler solutions that restart applications automatically could hide problems easier. Additionally by using a formalized system we can train less technical people to use it In addition to formalizing, extensibility is key. We do similar actions across our platform. We have dozens of applications with similar infrastructure. We can recycle automations from one group to the next without have to train people use new systems.
Linear. We execute jobs with no branching. No conditionals. Many workflows can be solved using this. Allowing branching work flows is not a necessary feature and can just lead to complicated configurations. best effort: The monitoring system should be telling us to fix a problem. Each time the monitoring system tells us to fix, we begin the work flow. If we fail, it shouldn’t be an issue because the monitoring system will know its still wrong and remind us to run a work flow again. We offer users some limited workflow control options. We’ll detail that in the next slide
The key understanding here is what to do on plan health changes. If gathering data fails, do you want to not attempt to restart? The answer varies on environment. The following descriptions are some of the work flow ideas we’ve come up with during our sojourn into auto remediation. Best Effort: Runs only when plan state is healthy. A particular unit of work’s failure to succeed has no bearing on further execution Guaranteed: Runs regardless of plan state. Its failure will move the plan state to unhealthy. Abort: Only runs when plan state is healthy, on failure, makes plan state unhealthy OnFailure: Runs when plan state is unhealthy, since we’re still designing it, its success could possibly move plan state back to healthy (or perhaps leave it unhealthy).

AutoRemediation and Workflow at LinkedIn

Recommended

Recommended

More Related Content

Similar to AutoRemediation and Workflow at LinkedIn

Similar to AutoRemediation and Workflow at LinkedIn (20)

Recently uploaded

Recently uploaded (20)

AutoRemediation and Workflow at LinkedIn

Editor's Notes