Goals to strive for in automating infrastructure deployment and monitoring. This focuses on all aspects of automation, from the development cycle, to deployment, to maintaining live services, all the way through data analysis.
2. “Big Picture” Goals
What should we be aiming for?
■ Don’t try to do everything perfectly
■ Do tighten every feedback loop to respond
as quickly as possible to problems
3. Goals in Practice
How do we accomplish this?
■ One-click manual steps
■ Monitoring results at every phase
■ Automatic reporting or action
5. Development
■ One-click builds
■ Continuous builds
■ Test suites
■ Static code analysis
Deployment
■ One-click deployments
■ Minimizing down time
■ Monitoring rollout health
■ Incremental node rollout
Production
■ Instance health
■ Process/service health
■ Third-party service health
■ User activity
Analysis
■ Automatic data collection
■ Scheduled analysis
■ Report notifications
■ Automatic rollbacks
7. Development
The primary goals of development automation are:
■ Notify developers of the error(s)
■ Prevent bad code from being released
8. Development
(There shouldn’t be anything new here to most developers)
■ One-click (or one-command) builds
Ideally, this is exactly the same regardless of developer OS
■ Sanity checks proactively fail builds
Unit testing
Property testing
Static code analysis
■ Continuous builds systems
More thorough functional/integration tests
Every customer-reported issue should have an automated regression test!
Deeper code analysis
9. Development - Feedback Loop
(There shouldn’t be anything new here to most developers)
■ Developer systems: Failed builds should prevent code check-ins
■ Continuous builds failures
Send out notifications
Automatically roll back check-ins to release branches
Alternatively, success automatically integrates into release branches
Push system - the system is the gatekeeper■ Continuous builds also generate reports about lower-threshold warnings
Static code analysis, test code coverage
Pull system - up to developers to be pro-active
Minimize these as much as possible!
12. Deployment
The primary goals of deployment automation are:
■ Automatically push out changes
■ Actively monitor rollout for problems
■ Automatically roll back to known good states
13. Deployment
■ One-click (one-command) automatic rollouts
Should be staged across instances/regions
Should minimize down time - hot swap!
■ Monitor rollout health
Node availability
Process/service availability
Data migration health
■ Failure thresholds
Developer notifications
Rollouts automatically unwind to known-good states
14. Deployment
With enough seamless monitoring in place:
■ Deployments should be invisible to users
■ A “good” code check-in can automatically drive a new deployment
17. Production
The primary goals of end-user production automation are:
■ Monitor for problems
■ Proactively address problems
■ If necessary, roll back to known good states
18. Production
There are two distinct elements to monitoring in production
■ Detecting system problems
■ Monitoring users
19. Production
■ System monitoring
Notifications if systems/instances go down or are overloaded
Automatically scale up new resources upon need
■ Service watchdogs
Automatic service restarts
Capture and storage of logs
Pushed by client, service, or cron job/scheduled task
■ Third-party APIs
Periodically check health/accessibility
Notifications upon failure
A problem with a necessary third-party service is a problem for your service
20. Production
■ User monitoring
How many users are active?
What services are those users using?
What services are users hitting errors with?
■ Extended user monitoring
Email
Social media
App store reviews
Automatic notifications!■ Users like interaction - people like to be noticed
Immediate, graceful interaction is likely to earn positive public feedback
Even from users who were complaining about a problem!
21. Production
Resolve problems
■ Automatically spin up/down resources to adapt to user load
■ Proactive notifications about errors
■ For critical issues, allow the production environment to automatically rollback to
the last known-good state
■ Users who feel like you helped them personally are likely to become your
evangelists
24. Analysis
The primary goals of analysis automation are:
■ Proactive, early warning of known problems
Notifications of significant issues
Automatically resolving where possible
Unwinding bad deployments upon certain thresholds
■ Ability to more easily detect unknown problems
Requires prior collection of good enough information to resolve
Usually feeds into the next development iteration
■ Really touches all of the previous pieces, as already shown
25. Analysis
■ Really touches all of the previous pieces, as already shown
Listed here because analysis should be treated as a first-class citizen
■ If you’re not driving development (or even features) through the use of
measurement, all you’re really doing is educated guessing
Not this problem
Fix this problem first
26. Recap - Problem Resolution
The main points applicable to every stage
■ Automatic notifications of failures
■ Rollback to known-good state
■ Automatic resource scaling (up/down)
What this buys us
■ Immediate visibility to every link in the chain
■ Rapid, iterative releases for problem resolution
■ Rapid learning about your users