8. Every problem only once Stop the line if anything fails Fast response vs. prevention (Eric Ries)
9. Layer We’re learning… Culture How not to waste Test-Driven Development How our code should behave Continuous Integration How our changes affect other’s code Immune System How external behavior and internal deployments change actionable metrics Continuous Deployment How to resiliently introduce change
10. Trunk stable Small commits Trivial rollbacks Broken build: no checkins stop the line development
11. Only add fields/columns Only delete when you can prove there are no more consumers development
12. Every problem only once Only automated testing Forbidden calls Bad code snippets Stuff-not-tested-test Disallow I/O in tests testing
Imagine you need to care for a garden, to plant it, weed it, water it regularly. (It kind of looks like a server farm, racks of vegetables….) Which tool would you use to water a garden….
A bucket or a hose? A bucket is heavy, splashes water as you walk with it, is hard to place correctly, and you might pour too little or, more likely, too much. Or you could use a hose: you can place it precisely, you can control the rate of flow, etc. Watering a garden is exactly how software deployment works, and most people are using a bucket, with its deficiencies. (bucket/hose metaphor from Eishay Smith)
It’s even worse than using a bucket. We have a whole warehouse full of buckets. Just like in manufacturing, developers write code that goes into long processes that have to sit on the shelf of some warehouse. (version control, waterfall processes, etc.) Code that’s sitting on a shelf, not being used, is waste.
So I have to conclude that deploying all the time is the only way to be safe. The smaller the deployment, the less change and risk we are adding to the production system, and by reducing our risk we can deploy more and more often.
Just because you can use this idea to go faster doesn’t mean that you’re actually on the right path to a better product, more revenue, or whatever. Here’s a common situation: the product isn’t doing well in the market, releases are getting more and more behind schedule, and management concludes that something must be wrong with the way the product is being built. This seems to be the case, and the engineers respond by trying to do better: writing more tests, making the product faster or more responsive, or creating more features for the customers. Engineers are optimizing, and are good at it. But then the results don’t seem to change, customers still have problems or your company just isn’t growing any faster. The problem is that instead of *optimizing* some arbitrary metrics the company needs to be *learning* what the right metrics are and correlating the engineering work to real value created for the company.
Eric Ries calls this process “validated learning”, and the goal is not to optimize any one step, but to minimize the TOTAL time through the loop. By learning faster we can be most effective and reduce our wasted efforts.
Eric’s algorithm for this is made of three main points: Every problem only once: Fool me once, shame on you. Fool me twice, shame on me. If it happens, write automation to ensure it won’t happen again. Stop the line if anything fails: Any failure you ignore will only compound other failures in the future, i.e., technical or other kinds of debt you need to repay because you don’t understand something enough. Fast response vs. prevention: The 80/20 rule applies, don’t over-optimize your testing processes. If you can detect and fix problems quickly, without major customer and business impact, it’s more effective. Roll forward, don’t roll back.
This algorithm can be applied at many layers, and all are needed to be successful. We need to learn (as fast as possible) at each layer.
At the development layer we follow the above techniques. Re: no checkins when the build is broken, IF the build is broken then something is wrong and we need to learn something, not only with the broken code itself but with how we’re developing. Perhaps we’re not running the unit tests locally before we check in. Perhaps our commits are too large so it’s too hard to prove that we’re not breaking other (dependent) code.
It’s always a tough issue to migrate a database, and it gets more difficult if you have many deployments with potential rollbacks. Here are the basic rules.
Testing is obviously about only having every problem only once. Forbidden calls: fail the build if there are particular invocations of methods or functions that are buggy or indicate bad usage. Bad code snippets: disallow certain coding idioms (can be as simple as a grep!) and fail the build if they are found. Stuff-not-tested-test: for every new X (class, servlet, etc.) make sure that tests exist, otherwise fail the build. IO: you can actually hook into the JVM during tests and fail the build if IO is used, because you want tests to be fast and IO is slow.
Examples of forbidden calls and their alternatives.
Be lazy! If there are higher-order actions to execute across multiple systems, don’t force yourself to click on a bunch of web pages or write extra commands. You can push your intention into the commit messages and have automation perform the actions for you, orchestrating the whole process. For example, #deploy:um means that if the build passes from this commit, deploy the “um” service. No need to go click something.
Rather than branch using version control, which is just like putting code on the shelf, branch in code so that at any time you can deploy the main line. Then you can selectively turn on features when they are ready to be tested against a particular user population. Perhaps you start at 5% of the users seeing the new code, and ramp up to 100%.
This is the state machine for a service being deployed. Coordination services like Apache ZooKeeper make it easy to implement. Canaries: deploy one instance of a service, watch it for a while to see if there are problems. If so, you can roll it back without affecting the majority of your customers, If not, continue with more instances until you’ve upgraded all of them. Self-test: many times you can “only test in production”, so one technique to mitigate risk is to have a service do a self-check to see if it can satisfy its dependencies, like connecting to a database, talking to a particular remote service, etc. If not, the self-test should fail, the instance will be rolled back and the impact to customers is minimized.
Some existing and needed tools for implementing continuous deployment. Bots help orchestrate changes to your environments and give you a shared space for dealing with issues. Create a “shell” for your company, with suites of commands that can query and manipulate the production environment. New commands can be added as needed and it is a great learning tool for everyone. When in doubt, write some code to fix problems!