This is a session given by Matt Boyle at Nordic APIs 2016 Platform Summit on October 25th, in Stockholm Sweden.
Description:
We’ve spent a lot of time over the years at Shapeways building, honing, and improving our deployment and test process for our web properties and API. We started with straight-to-prod commits (which caused quite a bit of downtime!), graduated to working in two- and then one-week release cycles (which caused a lot of anxiety!), to where we are today: releasing 5-15 times a day, with automated testing, using continuous improvement and delivery best practices and tools. We’ve taken the complexity and anxiety out of our deployment process by implementing ChatOps, or using a bot to handle the sorts of tasks computers are great at, namely performing complex tasks repeatedly without error. This enables humans to focus on tasks that we’re uniquely suited for, namely solving complex problems and architecting reliable, resilient, and scalable solutions for our users. We’d love to share some of what we’ve learned along the way, from building automated testing tools, to selecting and implementing open-source solutions, to how we took our global deployment process from one hour to 4 minutes. We’d also like to share our vision of the future: what inspires us, what we hope to achieve in the coming weeks, months, and years, and how we’re going about doing it.
2. What is Shapeways?
The worlds largest 3d printing service
and marketplace
Marketplace, service hub, and
community hosted at Shapeways.com
REST API provided at
api.shapeways.com
6. Releasing Software Gets REALLY hard
Build Artifact
Run tests
Deploy to
multiple
datacenters
Intra-DC
deployment to
servers
Manage artifacts
on server
Rotate through
load balancer
Generate
documentation
Restart services
Detect
regressions
Monitor for
Breakages
Deployments like this take time and
resources
If you don’t automate, you’ll spend as
much time running deployments as you do
writing code
10. Pro’ing up - Jenkins
Active-Active Jenkins instances – one per
datacenter
Use Jenkins API for remote execution of
deployments and test jobs
Flow-based deployment using Jenkins
Pipeline DDL
11. Pro’ing up - Coyote Framework
In-house FOSS testing framework for
integration and functional testing
Thoughtfully developed testing infrastructure
to ensure consistent, reliable, and
maintainable tests
Two-level API testing: contract enforcement
and functional flow
12. Pro’ing up - ACME
In-house developed test management
application
Schedule test runs on-demand for developers
Test analysis and intelligence
13. Pro’ing up - ChatOps via Hubot
GitHub-developed extensible chat bot written
in CoffeeScript
Manage deployments, development/QA
infrastructure, permission management
Acts as interface between our different tools:
jenkins, coyote, acme, slack
14. Looking Ahead – Staged Rollouts
Present deployment is fast but requires us to
be all-in on new code
Ability to use infrastructure for new feature A/B
testing rather than user groups
Increased deployment time, but faster
recovery should something go wrong
15. Looking Ahead - Swagger for all APIs, int+ext
Easier documentation generation
Client generation/update for multiple
languages
Source of truth for service-based applications
– contract enforcement, test generation
16. Looking Ahead - Improved Testing Tooling
Generate testing profiles based on changes
made in codebase
Code coverage for functional testing via URL
mapping
Performance and capacity measurement
based on previous test runs and environments
Editor's Notes
We’re a large service and marketplace – thousands of models a day uploaded and ordered for manufacturing
Global manufacturing and fulfillment– internal factories in Netherlands and New York, external partners globally
Marketplace, community, makerspace hosted at shapeways.com
API provided for customers who prefer to own their own experiences – can upload, curate, add to cart, and purchase models via API just as you would via website
Consider your time
Time spent doing mechanical work == time not spent doing real work
Broken/bad releases
Git pull!
Site Up?
Margaritas!
Git pull! ….apache reload? Symlink new directory? Build FrontEnd?
This starts to take some time.
Build Artifact?
Deploy to multiple servers? Multiple Datacenters?
Rotate through Load Balancer?
Generate documentation?
Prevent regression?
Watch for breakage!
Buildbot-managed svn up on prod servers - ouch
Deployment jobs took up to 90 minutes due to server load
Unit tests existed, run by developers
Manual QA/Acceptance process
No integration tests to ensure API integrity
Artifact deployment based on Jenkins - single master, procedural deploys
Continuous Integration - Unit Tests run on artifact build on trunk
Functional Tests running against QA envs hourly via jenkins
Integration tests running against APIs to ensure integrity
Requiring functional tests to ship w/ code
Once Jenkins Master per DC - controlling deployment – Talk about how this abstracts responsibility for deployments to the DCs themselvse
Leverage Jenkins API for job start/stop – Permits remote execution, enabling us to run our infrastructure anywhere
Use Jenkins DDL and Pipelines to create clear flow-based deployment process for visibility and speed – talk about reuse as well here
In-hosue FOSS – we didn’t see anything quite like this out there. Acknowledge build bias. Talk about open sourcing
Test infrastcuture – again focus on reuse and reliability
API testing – contract enforcement++, becoming a key component of our development process as we move towards services. Talk about functional flow and verification, and how coyote can do this were other platforms need to be glued together
Jenkins UI has its limitations, and not interested in plugin development – needed to know more about our tests than pass/fail rates (performance profiling, branch comparison, etc)
Test scheduling means that there’s no more waiting for a test environment: acme preps, spins up, tests, cleans, and spins down environments as needed.
Test intelligence is key here: pass rates of tests, re-run of flaky tests, comparison with production run helps identify and resolve issues more quickly
1. ChatOps not invented here, but embraced fully – thanks GitHub! We do this in slack
2. Talk about ease-of-deployment, ownership, etc here – drive home correlation between developers ownership and engagement
3. Hubot allows for us to again abstract responsibility for developers to know how every part of our process works: they engage w/ the tools, they don’t have to know how they’re built. Car analogy here.
1. Presently, we deploy to all nodes at once, at the same time. Fast, but also sudden, and requires us to be all-in on new code from a traffic perspective
Instead rotate nodes out of LB, deploy, then put back in at certain traffic percentage
Safety - does new feature work?
Performance management - did we make it faster?
Analytics/intelligence - do people like it?
We fake this today w/ feature-flagging via user groups. This works but is not unbiased: people who like new things sign up for beta groups, making them more likely to react positively to change than your average user
Doing this would incrase deployment time, but would also improve our ability to react quickly if something goes wrong. Rotate out of load balancer for example
Currently using doxygen: works fine, but limited in what it can deliver for us. Prefer dynamic documentation generation which can be used to build a qualtiy doc hub
Free clients for many languages: removes need to actively maintain and update a broad collection of clients
Source of truth helps remove ambiguity around implementation intentions. It works or ot doesn’t. This also lets us perform test generation to ensure that our applications are meeting contract expectations carried by the api
1. Currently run all tests, all the time - takes 30-40 mins. Requirement for larger or core changes to prevent regression Sometimes overkill for smaller changes - what if I just changed the PUT /models/<modelid>/info endpoint pricing function?
Determine tests to run by AST dependency map - which tests leverage the /models/<modelid>/info endpoint?
Run all of them, but nothing more - nothing else is impacted by this change.
2. Code “coverage” via URL detection. Which URLs were called by this test run, and which underlying controllers/managers/whathaveyou did they trigger? Lets us know roughly if we’re testing the things we think we’re testing
3. Performance profiling will let us better understand the impact of our changes in terms of response times. Capacity profiling, achieved via monitoring our testing infrastructure for load, etc during test runs will help us measure our ability to scale