The presentation will discuss some architectural patterns in continuous integration, deployment and optimization and I will share some of the lessons learned from Amazon.com.
The goal of the presentation is to convince you that if you invest your time where you get the maximum learning from your customers, automate everything else in the cloud (CI + CD + CO), you get fast feedback and will be able to release early, release often and recover quickly from your mistakes. Dynamism of the cloud allows you to increase the speed of your iteration and reduce the cost of mistakes so you can continuously innovate while keeping your cost down.
13. Continuous Integration Cloud-powered Continuous Integration Goal: to have a working state of the code at any given time Benefit: Fix bugs earlier when they are cheaper to fix Metric: New guy can check out and compile at first day at job
27. Version Control CI Server Package Builder Deploy Server Commit to Git/master Dev Pull Code AMIs Send Build Report to Dev Stop everything if build failed Distributed Builds Run Tests in parallel Staging Env Test Env Code Config Tests Prod Env Push Config Install Create Repo CloudFormation Templates for Env Generate Cloud Continuous Integration
32. PROD Very Large Server Storage Bob Test Very Small Server Storage Ted Test Very Small Server Storage Mary Test Very Small Server Storage 7 AM 6 PM RDS Snapshots – Test something quickly
34. Cloud-powered Continuous Deployment Goal: 1-click deploy and 1-click rollback Benefit: Release early, release often and iterate quickly Metric: fast feedback of your feature Continuous Deployment
38. Mean Time between deployments (weekday) Max # of deployments in a single hour Mean # of hosts simultaneously receiving a deployment Max # of hosts simultaneously receiving a deployment Amazon May Continuous Deployment Stats (production hosts and environments only) 11.6 Seconds 1079 10000 30000
41. Break your problem into small batches Small deployments Incremental changes Easy rollbacks
42. Virtual Images = Real Productivity Gain Java Stack .NET Stack RoR stack Centos Ruby Runtime Your Code logger RubyGems memcached Rails Mongrel Apache Linux JEE Your Code Log4J Spring Hibernate Struts Tomcat Apache Windows .NET Your Code Log4Net Spring.NET nHibernate ASP.NET MVC ASP.NET IIS Centos Ruby Runtime Your Code logger RubyGems memcached Rails Mongrel Apache OS Framework Your Code Libraries Packages DB Caching MVC App Server Web Server
43. Push to an AMI or Pull from an Instance Inventory of AMIs Golden AMI and Fetch binaries on boot JeOS AMI and library of recipes (install scripts) Amazon EC2 Amazon EC2 Linux JEE CHEF Puppet Chef/puppet scripts Java AMI Java App Stack Java AMI JeOS AMI Fetch on boot Fetch on boot Fetch on boot Frozen Pizza Model Take N Bake Pizza Model Made to order Pizza Model Linux JEE Your Code Log4J Spring Hibernate Struts Tomcat Apache Linux JEE Your Code Log4J Spring Hibernate Struts Tomcat Apache Amazon EC2 Linux JEE Your Code Log4J Spring Hibernate Struts Tomcat Apache Linux JEE Your Code Log4J Spring Hibernate Struts Tomcat Apache Linux JEE Your Code Log4J Spring Hibernate Struts Tomcat Apache Linux JEE Your Code Log4J Spring Hibernate Struts Tomcat Apache Your Code Amazon S3 Log4J Spring Struts Linux JEE Hibernate Tomcat Apache Your Code Amazon S3 Hibernate Tomcat Log4J Spring Struts Apache Linux JEE Hibernate Tomcat Apache Linux JEE Hibernate Tomcat Apache Linux JEE Hibernate Tomcat Apache Linux JEE Linux JEE
44. Cloud Continuous Deployment (Simple) Version Control CI Server Deploy Server Commit to Git/master Dev Pull Code Send Build Report to Dev Stop everything if build failed Build Run Tests Code Config Tests CloudFormation Templates Package Builder Push Config Repo Base AMIs Instances Scripts Cloud-init
47. Cloud Continuous Deployment Version Control CI Server Deploy Server Commit to Git/master Dev Pull Code Send Build Report to Dev Stop everything if build failed Build Run Tests Code Config Tests CloudFormation Templates Chef Server (config, package dependencies) Puppet Master (Manifest, config and mappings) Chef S Puppet M Push Config New Code Reports Base AMIs Managed Instances Scripts recipes
48. AWS Region Availability Zone 1 Deploy Server Automate deployment to multiple Availability Zones (Fault Tolerant Zones) Availability Zone 2 Availability Zone 3
50. Web Server Fleet (Amazon EC2) Database Fleet (RDS or DB on EC2) Load Balancing (ELB) v1.1 v1.1 v1.1 v1.1 v1.2 v1.2 v1.2 v1.2 Monitoring (CloudWatch) High Error Rate
51. Auto Scaling Group (Min, Max # of instances, Availability Zones .. ) Health Check (Maintain Min # active…) Launch Configuration (AMIID, Instance type, UserData, Security Groups..) Scaling Trigger (Metric, Upper Threshold, Lower Threshold, Time interval …) Types of Scaling (Scale by Schedule, Scale by Policy) Alarm (Notification Email, SMS, SQS, HTTP) Availability Zones and Regions Auto Scaling
52. “ Auto scaling” Web Server Fleet (Amazon EC2) Database Fleet (RDS or DB on EC2) Load Balancing (ELB) v1.1 v1.1 v1.1 v1.1 v1.2 v1.2 v1.2 v1.2 Auto scaling Max instances Min instances Scaling Trigger Custom Metrics Upper Threshold Lower Threshold Increment by
56. Web Server Fleet (Amazon EC2) Database Fleet (RDS or DB on EC2) Load Balancing (ELB) 99% 1% … .. A/B Testing Service v1.1 v1.2 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.2 v1.2 v1.2
57. Web Server Fleet (Amazon EC2) Database Fleet (RDS or DB on EC2) Load Balancing (ELB) … .. A/B Testing Service 90% 10% v1.1 v1.2 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.2 v1.2 v1.2 v1.2 v1.2
58. Web Server Fleet (Amazon EC2) Database Fleet (RDS or DB on EC2) Load Balancing (ELB) … .. A/B Testing Service 70% 30% v1.1 v1.2 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2
59. Web Server Fleet (Amazon EC2) Database Fleet (RDS or DB on EC2) Load Balancing (ELB) … .. A/B Testing Service 70% 30% Monitoring (CloudWatch) High Error Rate Rollback v1.1 v1.2 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2
60. Web Server Fleet (Amazon EC2) Database Fleet (RDS or DB on EC2) Load Balancing (ELB) … .. A/B Testing Service 90% 10% Rollback Dev, Test v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2
61. Web Server Fleet (Amazon EC2) Database Fleet (RDS or DB on EC2) Load Balancing (ELB) … .. A/B Testing Service 70% 30% v1.1 v1.2 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2
62. Web Server Fleet (Amazon EC2) Database Fleet (RDS or DB on EC2) Load Balancing (ELB) … .. A/B Testing Service 50% 50% v1.1 v1.2 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2
63. Web Server Fleet (Amazon EC2) Database Fleet (RDS or DB on EC2) Load Balancing (ELB) … .. A/B Testing Service 30% 70% v1.1 v1.2 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2
64. Web Server Fleet (Amazon EC2) Database Fleet (RDS or DB on EC2) Load Balancing (ELB) … .. A/B Testing Service 5% 95% v1.1 v1.2 v1.1 v1.1 v1.1 v1.1 v1.1 v1.1 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2 v1.2
65. Split Users by multiple experiments A B Control Treatment 1 C D Treatment 2 Treatment 3
66. v1.1 v1.1 v1.1 v1.1 Web Server Fleet (Amazon EC2) Database Fleet (RDS or DB on EC2) Load Balancing (ELB) v1.1 v1.1 v1.1 v1.1 v1.2 v1.2 90% 5% v1.2.1 v1.2.1 3% v1.2.2 v1.2.2 2%
67.
68. Timeline V2.1 Example ID NAME ADDRESS ORDERID (Char) 23234 Joe Doe xxx 333424 45322 Rob Smith xxxx 234 2342342 Jane Smith xxxx 23424 2342265 Anne Lee xxxx 2342425
87. Continuous optimization in your architecture and cloud infrastructure results in recurring savings in your next month’s bill
88. Instance Amazon CloudWatch Alarm Free Memory Free CPU Free HDD At 1-min intervals Custom Metrics PUT 2 weeks “ You could save a bunch of money by switching to a small instance , Click on CloudFormation Script to Save” Cost-aware instances
89. Choosing the right pricing model On-demand Instances Reserved Instances Spot Instances
90. Steady State Usage Total Cost for 1 Year-term of 2 application servers Total Cost for 3 Year-term of the same 2 application servers Usage Fee One-time Fee Total Savings Option 1 On-Demand only $1493 - $1493 - Option 2 On-Demand + Reserved $1008 $227 $1234 ~20% Option 3 All reserved $528 $455 $983 ~35% Usage Fee One-time Fee Total Savings Option 1 On-Demand only $4479 - $4479 - Option 2 On-Demand + Reserved $3024 $350 $3374 ~30% Option 3 All reserved $1584 $700 $2284 ~50%
91. 1-year RI versus On Demand: cost savings realized after first 6 months of usage 3-year RI versus On Demand: cost savings realized after first 9 months of usage. 3-year RI versus 1-year RI: Net savings of 3-year RI versus 1-year RI begin by month 13 and continue throughout the RI term (additional 23 months of savings) 1 2 3
95. Summary Invest in areas where you get to learn from your customers quickly Automate everything else Cloud Continuous Integration Release early, Release often, Iterate Quickly Get fast feedback Distributed Builds, Automated Tests in Parallel Cloud Continuous Deployment Reduce the cost of mistakes Increase the speed of iteration Leverage cloud for Blue Green Deployments Cloud Continuous Optimization Keep Optimizing and further reduce costs of infrastructure
96. Jinesh Varia [email_address] Twitter: @jinman Credits to Jon Jenkins, Paul Duvall and several engineers at Amazon Thank you!
99. Optimize by Implementing Elasticity Infrastructure Cost $ time Large Capital Expenditure You just lost customers Predicted Demand Traditional Hardware Actual Demand Cloud Automated Elasticity
100. Amazon S3 Bucket AZ-1 Region Elastic Load Balancer Amazon CloudWatch Alarms Amazon SNS Notifications www.yourApp.com Amazon SimpleDB Domains Amazon Route 53 Hosted Zone Auto Scaling Group AZ-1 App Tier ElastiCache Tier Amazon SES Email Amazon CloudFront media.yourApp.com (Static data) Amazon RDS Amazon EC2 Instances
101. Amazon VPC AWS Region VPC Subnet VPC Subnet Corporate data center Corporate Headquarters Availability Zone 1 Availability Zone 2 Branch Offices VPN Gateway Customer Gateway Internet Gateway Router DirectConnect Location Amazon S3 Amazon SimpleDB Amazon SES Amazon SQS The New Cloud-Ready Enterprise IT 10G
Notas do Editor
The presentation will discuss some architectural patterns in continuous integration, deployment and optimization and I will share some of the lessons learned from Amazon.com. The goal of the presentation is to convince you that if you invest your time where you get the maximum learning from your customers, automate everything else in the cloud (CI + CD + CO), you get fast feedback and will be able to release early, release often and recover quickly from your mistakes. Dynamism of the cloud allows you to increase the speed of your iteration and reduce the cost of mistakes so you can continuously innovate while keeping your cost down.
Jinesh Varia – Penn State Data center, FDIC and 5.5 years in AWS.
customers are the center of all equations and discussions. We are continuously listening to customers for feature requests that will make it more easier. At times, where customers don’t know what they want, we rely on statistical significance and confidence intervals to really know the behavior of what impacting the business Hence learning from customers, IMO should be given importance than anything else so that we can get fast feedback, change the course of action, steer to the right direction and align ourselves that makes sense.
Typical workflow from Idea to Release
Focus on learning and Automation with Cloud helps you learn faster, iterate quickly and
There are 3 main ideas
Poka yoke is a design practice originated at Toyota Japan which essentially discusses mistake proofing. Invest in Design systems such that customers don’t make mistakes when they use it. For example, if you were to design an ATM machine and focusing on design how the ATM card will be read, if you design the system that takes the credit card, there is a probabilty that customer might forget it in the machine or the system and mechanics might fail. Instead poka yoke design will be to swiping. That way the customer will never forget as its always in his/her hand and the design is not that complex Continuous integration gives you a poka-yoke design to your build and test systems. Developers get alerted about a bugs ahead of time much before when they are cheaper to fix them.
Everything thing should be in the version control OS, patch levels, OS configuration, your app stack, its configuration, infrastructure config Infrastructure is code, Build scripts is code, test scripts is code
Automate everything People take more time, make errors, they are not auditable Plus its boring and repetitive. They have better things to do, Expensive
Tools soup
Only one way to deploy so you exactly know who, which version, what configuration, easy rollback, which hosts/machines etc.
Leverage the cloud for distributed builds and reduce your build and test run time so you get fast feedback
Stop when you need. Run your builds three times a day and only pay for storage.
Several tools available – most have support for EC2/cloud
you can configure Bamboo to start and shut down elastic instances automatically , based on build queue demands? Please refer to Configuring Elastic Bamboo for more information.
Lets put everything in a context of a web application
You can put this all into a CloudFromation script. I recommend that you put that in version control and let it manage the resources for you. CF engine will manage the ordering, rollback, update etc.
Architecture for Cloud Continuous integration Developer commits to git/master CI Server pulls from git/master CI Server runs all regression tests against git/master CI Server returns report to developer and halts if any fail CI Server builds the artifacts CI Server pushes the project manifest to Image manager Package Builder gets configuration information from VSS PB builds packages (gem, .deb, rpms..), stores them in repo PB installs dependencies on an instance/host and creates a virtual Image or multiple images PB generates a CF template from the Config for each environment and stores them in repo CI Server calls deployment server to deploy the image to a given environment (using CF template) Deployment server attempts to deploy the image to instances using cloudformation scripts and alerts the monitoring service
The key advance was using our continuous build system to build not only the artifact from source code, but the complete software stack, all the way up to a deployable image in the form of an AMI (Amazon Machine Image for AWS EC2).
This is where cloud really shines. Where else can you get 100s and 1000s machines to run them in parllel and shut them down when you are done Same goes with benchmarking. Stress test in the Cloud Create AMIs with libraries and dependencies Add “compute power” when needed and turn it off to reduce costs Load test in the Cloud Generate load from one Availability Zone to test on other. Startup a pre-configured TestBox (EC2 instance) in minutes Performance test in the Cloud Test at Global scale - Latency from different parts of the world Store all instrumentation data on S3, SimpleDB. Web App testing Browser based Ajax/Selenium testing from different availability zones (US and EU) Create different deployment environments using scripts Usability Testing On-demand workforce What does Testing in the Cloud mean: Automated, Virtual Test Labs that are live only when you need them Stress test in the Cloud Find the source of latency, Potential Crashes and/or points of Failure. Get Profile information thru logs and instrumentation and measure Load test in the Cloud Generating load from one Availability zone to other “staging” servers on or off 100 concurrent browsing users that randomly click on links. Then, the load can be increased by 100 users every 10 minutes until the total expected user load of 100,000 users is reached. Performance test in the Cloud How fast the page is loading for a given user in given state Usability Testing
Interesting things can be done now. Restore production enviornments and fix bugs at any time.
Software inventory - put code to prod as fast as you can Code that not in production and in the hands of the customer is lost revenue We really want to iterate quickly, deploy often, test, get instant feedback and rollback if necessary but get it out as quickly as possible to the customer
Apollo Environments and Stages Month of May At Amazon, every 11.6 seconds, someone is kicking off a deployment into production. Amazon is doing continous deployment with deployments happening every 11.6 seconds average weekday For the month of may, we kicked off a total of 1079 deployments in a single hour into prod On an average there are around 10,000 hosts receiving a deployment simultanously across a whole bunch of multiple environments For the month of may, we had a peak of around 30,000 hosts simultanously receiving a deployment
speed of iteration matters. not whether you are making the right move but you making the move really really fast. This is great for people like me who often make wrong moves. I can iterate quickly and find the right solution with trial and error. I make this progress quickly through trial and error. failing fast. I dont have a fear of failure It's really important to iterate going forward because it is through iteration, incremenet improvement, we will know what works next and make best practices.
We have to be wrong a lot in order to right a lot Cloud really helps you to reduce the cost of failure.
And we can do that by breaking your deployments into stream of small deployments
The first step is create bootstrapped instances so that when they come up they automatically install all the stuff that you need. You can two that in two ways: have a preconfigured AMI or install all the cool stuff on boot time.
Simple architecture that does continuous deployment using Cloud-init Developer commits to git/master CI Server pulls from git/master (every checkin) CI Server runs all regression tests against git/master CI Server returns report to developer and halts if any fail CI Server builds the artifacts and packages stores them in repository (for faster installs) CI Server pushes the project manifest to Image manager Config Server is either Chef Server or Puppet Master CS Manages the chef nodes (or puppet client) CS generates reports provides UI for chef/puppet- managed instances CI Server sends the new code directly to config server (Puppet Master or Chef Server) Deploy server provisions cloud resources DS attempts to deploy the image to instances using cloudformation scripts or EC2 APIs DS leverages the Base AMI with pre-installed client (chef-client or puppetrun) installed
Cloudinit and CF
Simple architecture that does continuous deployment using Chef and/or Puppet Developer commits to git/master CI Server pulls from git/master (every checkin) CI Server runs all regression tests against git/master CI Server returns report to developer and halts if any fail CI Server builds the artifacts and packages stores them in repository (for faster installs) CI Server pushes the project manifest to Image manager Config Server is either Chef Server or Puppet Master CS Manages the chef nodes (or puppet client) CS generates reports provides UI for chef/puppet- managed instances CI Server sends the new code directly to config server (Puppet Master or Chef Server) Deploy server provisions cloud resources DS attempts to deploy the image to instances using cloudformation scripts or EC2 APIs DS leverages the Base AMI with pre-installed client (chef-client or puppetrun) installed
Automate deployment to multiple Azs and increase AZs
So how can you do deployments and quickly deploy without any downtime
One simple technique is to do blue green deloyments. Green is your current deployment, you de
BG + Autoscaling = Cost savings + Scaling up
BG + Darklaunches in which you deploy both the versions side by side but customers does not know about it. Two separate code paths but only one activated. The other one is activated by a feature flag. This can be your beta test where you explicitly turn
Goal is to get instant feedback from your users and how your application behaves when real load. A/B testing is great. Control and Treatment http://20bits.com/articles/an-introduction-to-ab-testing/ Prove success of interface changes Prove interest in new features
What if you have 1000s of nodes. The blue green deployments works well but you may not want to spin up additional 100% capacity. So then you can gradually spin up in small increments of 10% so you only pay for additional 10% of your nodes. Plus you learn more about your customer behavior too. By routing only a small of users to your newly code deployed code base. Start small with 1% - A/B testing is randomizes the traffic and collect feedback on what’s working and what is not working. With a simple configuration change Autoscaling on the other hand scales. You don’t want to send all your traffic to new nodes just yet. Note: you have tested all your application, brought additional capacity before you route real traffic.
Slowly ramp of traffic by simply changing the configuration file in a/b testing and see autoscaling kicking off additional instances as needed. Note here you are not disrupting user behavior because they see only that version that they hit.
Slowly rampup traffic you are know whether the feature is working for your customer and you are learning more about customer.
Pager beeping. Something went wrong. Time to do rollback. Rollbacks are easy…
Change the config file route all the traffic back to your previous hosts…. Test what went wrong…fix….deploy new instances using the same process
Start ramping up again…..
You are gradually deploying new software, using the dynamism of the cloud with zero downtime. And that’s cool.
Autoscaling handles all the capacity planning for you.
Dial up is based on users –user get consistent users experience The beauty of this way is you can run multiple measured Experiments of different features for your customers. Get instant feedback and deploy only if makes sense Based on Statistical significance and confidence intervals, you determine which feature is really causing improvement and creating an impact on the business.
What about datatabases? Those can be tricky and they are not as simple as cutover from green to blue. For all practical purpose, I would recommend using non-relational database. But you can too do blue green style database deployments with zero downtime in several approaches. #3 is the best one because you are leveraging the decoupling of DB and application and breaking your one job into multiple tasks.
Simple Example to change character to Integer database schema change.
Autoscaling in recurring mode
Rollbacks will be easy because you can deploy application and DB combination that worked in the previous iteration.
Again, not all everything can be done when it comes to RDBMS migration. But you can do DB migration and Application in innovative ways to deploy new version in series of steps with zero downtime and easy rollback. speed of iteration matters. not whether you are making the right move but you making the move really really fast. This is great for people like me who often make wrong moves. I can iterate quickly and find the right solution with trial and error. I make this progress quickly through trial and error. failing fast. I dont have a fear of failure It's really important to iterate going forward because it is through iteration, incremenet improvement, we will know what works next and make best practices.
The goal is to do it in small batches
By doing this…and continuous deploying new versions you are reducing the cost of doing mistakes.
And you just saw how cloud really helps here
You keep optimizing and you further reduce your cost, impress your boss, individual can make a significant improvement in bottomline and savings. DB with small optimization, reduced the database footprint, you saved company a bunch of money right away. This does not happen in non-cloud area where even if you implement an optimization, you have already paid for the infrastructure. Again, Poka yoke…Cloud really helps you to design right so you work towards efficiency
Only happens in the cloud
Build websites that sleep at night. Build machines only live when you need it
Shrink your server fleet from 6 to 2 at night and bring back
Now this is just the beginning. We really havent spent enough time here. But one developer wrote a script that will dump all the free resources to custom metrics to cloud watch and if they were 2 weeks, it automatically alerts the users to save money. Now that is cool!! This intelligence can now be backed into your platform. This can only happen when you have a great API and you are only paying for your infrastructure