Changing Etsy's Architectural Foundations with Continuous Deployment

•Download as ODP, PDF•

6 likes•1,755 views

Matt Graham

Technology

Marketplace for
Handmade Goods

Gross Sales 2011: $537 million
Total Members: 19 million
Items For Sale: 15 million
Uniques / month: 40 million
Page Views / month: 1.4 billion

Passing Time => Change
● Scale
● Product

● Technology

● Engineering Team

Passing Time => Change
● Scale
● Product

● Technology

● Engineering Team

● The Correct Architecture

Changes

What it's all about

● Reduce Failure Time
● Start with Culture

● Tools Help

● Enable the Unfeasible

Six Bugs with
Monthly Deploys
4 caught
--->

2 missed
<---

fix live:
24 hours

Six Bugs with
Continuous Deploys
2 caught
--->

4 missed
<---

fix live:
6 hours

Failure Time

2 Bugs * 24 Hours = 48 BH
4 Bugs * 6 Hours = 24 BH

Minimize BugHours
24 < 48

Cost of Recovery

Photons Minimal

Electrons Low

Protons & Neutrons High

Humans Prohibitive

Cost of Recovery

$6 million in 1973 = $31m today

Good Excuses

● Infrequent Changes
● Infrequent Executions

● Life and Death

● Physical Investment

Culture Before Tools

● Throw out the deploy schedule

● Ship changes when tested & ready

● Software is stable & supported

Jenkins

● Unit Tests
● Functional Tests

Jenkins

● Unit Tests
● Functional Tests

● Manual Testing

Nagios & Naglite2

github.com/lozzd/Naglite2

StatsD
if ($success) {
StatsD::timing('query.runtime', $time);
} else {
StatsD::increment('query.failure');
}

github.com/etsy/statsd

Practices @ Etsy

Feature Flags
Customer Communication

Dark Launch

def get_payment_link():
return ...

Dark Launch

def get_payment_link():
if enabled('creditcards'):
return creditcard_link()
else:
return check_link()

Dark Launch

application_config:
- creditcards: admin
- NewFeatureB: off
- NewFeatureC: on

Ramp Up

application_config
- creditcards: 1%
- NewFeatureB: off
- NewFeatureC: on

Ramp Up

application_config
- creditcards: 5%
- NewFeatureB: off
- NewFeatureC: on

Whoops!

application_config:
- creditcards: admin
- NewFeatureB: off
- NewFeatureC: on

Ramp Up

application_config
- creditcards: 25%
- NewFeatureB: off
- NewFeatureC: on

Credit Cards are ON

application_config
- creditcards: 100%
- NewFeatureB: off
- NewFeatureC: on

AB Testing

● Prove success of interface
changes

● Prove interest in new
features

Deployment is First Class

Deployment is a
First Class Feature

Examples from Etsy

● Photos From Twisted to PHP

● PostgreSQL to MySQL Shards

From Twisted to PHP
● Run Apache/PHP on a new port
● Implement one service in PHP

● Ramp up users on new service

● Repeat for each service

● Shut down Twisted version

PostgreSQL to MySQL Shards
● Migrate table by table
● Tee writes to both DBs

● Copy old data from PostgreSQL

● Verify data matches

● Ramp up reads from MySQL

● Stop PostgreSQL writes

Continuous Deploy Pattern
● Change in small steps
● Dark launch via config

● Iterations to prod while dark

● Maintain old & new in parallel

● Ramp up new architecture

● Remove old architecture

Once Again
● Minimize BugHours
● Trash the Schedule

● Iterate on the Tools

● Make Big Changes

Changing Etsy's
Architectural Foundation
with
Continuous Deployment
Matt Graham
http://twitter.com/lapsu
http://lapsu.tv

Core Engineer @ Etsy
Continuous Deployer
http://codeascraft.etsy.com
http://www.etsy.com/careers

What's hot

Contributing to Koha

Libriotech

Gitlab meets Kubernetes

inovex GmbH

Managing Magento Projects by Viacheslav Kravchuk from Atwix

Atwix

Ctndeck 2 1-2011

Aaron Cohen

OpenText MBPM Q&A Webinar

convedo Group

Ivan Dryzhyruk “Ducks Don’t Like Bugs”

LogeekNightUkraine

What's hot (6)

Contributing to Koha

Gitlab meets Kubernetes

Managing Magento Projects by Viacheslav Kravchuk from Atwix

Ctndeck 2 1-2011

OpenText MBPM Q&A Webinar

Ivan Dryzhyruk “Ducks Don’t Like Bugs”

Viewers also liked

افضل برنامج محاسبة للفنادق من شركه المنارة هو أفضل برنامج محاسبى لاداره الحجوزات الفندقيه فهو يعد برنامج متكامل وبسيط كبرنامج لتنظيم الحجوزات الفندقية و برنامج ادارى للفنادق وهومن اهم واقوى البرامج المحاسبية التى حدثت فى مجال البرمجه الخاصة بمجال اداره الفنادق والشقق الفندقية

برنامج محاسبة للفنادق

almanara web

فنادق رخيصة فى الخبر Holdinn.com.sa -

holdinnsa

Hyatt Hotel Typical Floor

mcbaldwin

Hotel Architect of the Year

sidb7

MODEL MAKING_WEBSITE(LOW RES)

Mehnaj Tabassum

الاعتبارات البصرية و أسس الاضاءة في المباني

Ahmad Fahed

Burj Al Arab, Tower Of The Arabs

Laura Domínguez

5 star hotel desing.compressed

Mehnaj Tabassum

Alternaty is pleased to launch the first issue of a multi part series entitled “Common Mistakes in Hotel Planning and Operation”. The series aims to highlight the most common mistakes made by hotel developers and offers advice on how to avoid them. This first issue, Common Mistakes in Hotel Design, highlights the most common mistakes made during the design process that can have long lasting negative impacts on operations. See our blog at http://blog.alternaty.com/ to download our latest exclusive releases. Table of content: - Hotel room design - Bathroom design - Food and beverage - Lobby and public areas - Back of house - Spa, gym and swimming pool - Elevators and corridors - About Alternaty Stay tuned for the next issues!

Alternaty - Common mistakes in hotel design

Alternaty

Design of hotel

Ahmed SHoukry ELhfnawy

Portman Hotel Case Study Analysis

Mohammad Mohtashim

Case Study -Hotel design

hebasayeed

Hotel Design - Midpoint Thesis Book

rajensen00

Burj khalifa

Safa Aboelssaad

Viewers also liked (14)

برنامج محاسبة للفنادق

فنادق رخيصة فى الخبر Holdinn.com.sa -

Hyatt Hotel Typical Floor

Hotel Architect of the Year

MODEL MAKING_WEBSITE(LOW RES)

الاعتبارات البصرية و أسس الاضاءة في المباني

Burj Al Arab, Tower Of The Arabs

5 star hotel desing.compressed

Alternaty - Common mistakes in hotel design

Design of hotel

Portman Hotel Case Study Analysis

Case Study -Hotel design

Hotel Design - Midpoint Thesis Book

Burj khalifa

Similar to Changing Etsy's Architectural Foundations with Continuous Deployment

Authors: Adrian Perreau de Pinninck, Manu Cupcic Conference: CAS 2014 (Barcelona) Have you ever wondered how large software companies with an engineering culture make sure they are able to deliver software over and over to production? How do you coordinate 100+ software engineers so that there are no bottlenecks and quality is not compromised? In this talk you will see how a Continuous Delivery system was implemented at Criteo, the fastest growing IT company in EMEA 2012. Before starting the project there were 160+ code repositories with dependency hell. They were being built independently and releases to production were error prone and painful. You will see the technical architecture behind a successful implementation of a Continuous Delivery system. The system was made up of a Gerrit code review tool connected to a Jenkins build pipeline, building 160 repositories with over 7M lines of code. We will explore different architectural choices such as branching system, hot fixes, sandbox and pre-production environments, and how these were developed and used by the large R&D department.

Delivery at Scale

Agilar

Have you ever wondered how large software companies with an engineering culture make sure they are able to deliver software over and over to production? How do you coordinate 100+ software engineers so that there are no bottlenecks and quality is not compromised? In this talk you will see how a Continuous Delivery system was implemented at Criteo, the fastest growing IT company in EMEA 2012. Before starting the project there were 160+ code repositories with dependency hell. They were being built independently and releases to production were error prone and painful. You will see the technical architecture behind a successful implementation of a Continuous Delivery system. The system was made up of a Gerrit code review tool connected to a Jenkins build pipeline, building 160 repositories with over 7M lines of code. We will explore different architectural choices such as branching system, hot fixes, sandbox and pre-production environments, and how these were developed and used by the large R&D department. Authors: Adrian Perreau de Pinninck, Manu Cupcic

Delivery at Scale

Adrian Perreau de Pinninck

The promise of DevOps is that we can push new ideas out to market faster while avoiding delivering serious defects into production. Andreas Grabner explains that testers are no longer measured by the number of defect reports they enter, nor are developers measured by the lines of code they write. As a team, you are measured by how fast you can deploy high quality functionality to the end user. Achieving this goal requires testers to increase their skills. It’s all about finding solutions—not just problems. Testers must transition from reporting “app crashes” to providing details such as “memory leak caused by bad cache implementation.” Instead of reporting “it’s slow,” testers must discover “wrong hibernate configuration causes too much traffic from the database.” Using three real-life examples, Andreas illustrates what it takes for testing teams to become part of the DevOps transformation—bringing more value to the entire organization.

DevOps: Find Solutions, Not More Defects

TechWell

Gatling - Bordeaux JUG

slandelle

Presentation given at SaltConf 16. Many of today's popular DevOps practices were pioneered by and for smaller, more agile tech shops. How do these principles apply to large, slow-moving enterprise IT organizations? Learn how SaltStack can help overcome the challenges of silos, old architecture, bureaucracy, and poor communication to help large IT organizations put popular DevOps practices into action.

Using SaltStack to DevOps the enterprise

Christian McHugh

7 tools for your devops stack

Kris Buytaert

Agile conference 2013

gbgruver

Garelic: Google Analytics as App Performance monitoring

Jano Suchal

RandomTest - Random Software Integration Tests That Just Work for C/C++, Java...

dcieslak

Frappe Open Day - September 2018

Frappe Technologies Pvt. Ltd.

Continuous Infrastructure First

Kris Buytaert

Engineering Velocity @indeed eng presented on Sept 24 2014 at Beyond Agile

KenAtIndeed

Ship code like a keptn

Rob Jahn

Buytaert kris tools

kuchinskaya

Db migrations equal pain

Eugen Oskin

GitOps , done Right

Kris Buytaert

Continuous Deployment Applied at MyHeritage

Ran Levy

This presentation shows how to be a pragmatic Android programmer by showing real examples of applications/products developed in BQ attending to three important topics: - Why and how to implement a pragmatic CLEAN architecture with a custom dependency injection framework and ReactiveX features. - Why and how to implement unit, integration, view and smoke tests in every CLEAN layer and what tools should be used. - Why and how to implement pragmatic Continuous Development/Testing/Integration/Delivery by showing several tricks, plugins and snippets that you could use as a daily basis.

Droidcon Spain 2016 - The Pragmatic Android Programmer: from hype to reality

Daniel Gallego Vico

This presentation is the first in a series on Improving Rails application performance. This session covers the basic motivations and goals for improving performance, the best way to approach a performance assessment, and a review of the tools and techniques that will yield the best results. Tools covered include: Firebug, yslow, page speed, speed tracer, dom monster, request log analyzer, oink, rack bug, new relic rpm, rails metrics, showslow.org, msfast, webpagetest.org and gtmetrix.org. The upcoming sessions will focus on: Improving sql queries, and active record use Improving general rails/ruby code Improving the front-end And a final presentation will cover how to be a more efficient and effective developer! This series will be compressed into a best of session for the 2010 http://windycityRails.org conference

improving the performance of Rails web Applications

John McCaffrey

Introduction to git & github

Vinothini KadambavanaSundaram

Similar to Changing Etsy's Architectural Foundations with Continuous Deployment (20)

Delivery at Scale

DevOps: Find Solutions, Not More Defects

Gatling - Bordeaux JUG

Using SaltStack to DevOps the enterprise

7 tools for your devops stack

Agile conference 2013

Garelic: Google Analytics as App Performance monitoring

RandomTest - Random Software Integration Tests That Just Work for C/C++, Java...

Frappe Open Day - September 2018

Continuous Infrastructure First

Engineering Velocity @indeed eng presented on Sept 24 2014 at Beyond Agile

Ship code like a keptn

Buytaert kris tools

Db migrations equal pain

GitOps , done Right

Continuous Deployment Applied at MyHeritage

Droidcon Spain 2016 - The Pragmatic Android Programmer: from hype to reality

improving the performance of Rails web Applications

Introduction to git & github

Recently uploaded

A Beginners Guide to Building a RAG App Using Open Source Milvus

Zilliz

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

FWD Group - Insurer Innovation Award 2024

The Digital Insurer

AXA XL - Insurer Innovation Award Americas 2024

The Digital Insurer

Three things you will take away from the session: • How to run an effective tenant-to-tenant migration • Best practices for before, during, and after migration • Tips for using migration as a springboard to prepare for Copilot in Microsoft 365 Main ideas: Migration Overview: The presentation covers the current reality of cross-tenant migrations, the triggers, phases, best practices, and benefits of a successful tenant migration Considerations: When considering a migration, it is important to consider the migration scope, performance, customization, flexibility, user-friendly interface, automation, monitoring, support, training, scalability, data integrity, data security, cost, and licensing structure Next Wave: The next wave of change includes the launch of Copilot, which requires businesses to be prepared for upcoming changes related to Copilot and the cloud, and to consolidate data and tighten governance ShareGate: ShareGate can help with pre-migration analysis, configurable migration tool, and automated, end-user driven collaborative governance

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

sammart93

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

Manulife - Insurer Transformation Award 2024

The Digital Insurer

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

Real Time Object Detection Using Open CV

Khem

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Edi Saputra

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

apidays

Architecting Cloud Native Applications

WSO2

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

MINDCTI Revenue Release Quarter One 2024

MIND CTI

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

Recently uploaded (20)

A Beginners Guide to Building a RAG App Using Open Source Milvus

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

A Year of the Servo Reboot: Where Are We Now?

Apidays New York 2024 - The value of a flexible API Management solution for O...

FWD Group - Insurer Innovation Award 2024

AXA XL - Insurer Innovation Award Americas 2024

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Manulife - Insurer Transformation Award 2024

Boost Fertility New Invention Ups Success Rates.pdf

Exploring the Future Potential of AI-Enabled Smartphone Processors

Real Time Object Detection Using Open CV

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Architecting Cloud Native Applications

GenAI Risks & Security Meetup 01052024.pdf

MINDCTI Revenue Release Quarter One 2024

Axa Assurance Maroc - Insurer Innovation Award 2024

Changing Etsy's Architectural Foundations with Continuous Deployment

1. Changing Etsy's Architectural Foundation with Continuous Deployment Matt Graham Core Engineer @ Etsy Continuous Deployer #surgecon September 28, 2012

2. Marketplace for Handmade Goods Gross Sales 2011: $537 million Total Members: 19 million Items For Sale: 15 million Uniques / month: 40 million Page Views / month: 1.4 billion

3. Architecture is Relative

4. Organic Architecture

5. Premature Architecture

6. Premature Architecture

7. Passing Time => Change ● Scale ● Product ● Technology ● Engineering Team

8. Passing Time => Change ● Scale ● Product ● Technology ● Engineering Team ● The Correct Architecture Changes

9. Architectural Change Antipattern

10. A Brief History of Deployment

11. The Internet

12. Agility

13. Continuous

14. What it's all about ● Reduce Failure Time ● Start with Culture ● Tools Help ● Enable the Unfeasible

15. A Tale of Six Bugs

16. Six Bugs with Monthly Deploys 4 caught ---> 2 missed <--- fix live: 24 hours

17. Six Bugs with Continuous Deploys 2 caught ---> 4 missed <--- fix live: 6 hours

18. Failure Time 2 Bugs * 24 Hours = 48 BH 4 Bugs * 6 Hours = 24 BH Minimize BugHours 24 < 48

19. MTTR vs MTTF

20. Cost of Recovery Photons Minimal Electrons Low Protons & Neutrons High Humans Prohibitive

21. Cost of Recovery $6 million in 1973 = $31m today

22. Good Excuses ● Infrequent Changes ● Infrequent Executions ● Life and Death ● Physical Investment

23. Medical Devices? No

24. NASA? No

25. Enterprise Software? Yes!

26. Print of Cards? No

27. App Store? No

28. Financial Transactions?

29. Financial Transactions? Yes!

30. Getting Started

31. Culture Before Tools ● Throw out the deploy schedule ● Ship changes when tested & ready ● Software is stable & supported

32. Tools of Etsy Deployment

33. Jenkins ● Unit Tests ● Functional Tests

34. Jenkins ● Unit Tests ● Functional Tests ● Manual Testing

35. Jenkins ● Unit Tests ● Functional Tests ● Manual Testing

36. Nagios & Naglite2 github.com/lozzd/Naglite2

37. tail -f | grep

38. github.com/etsy/deployinator

39. IRC

40. Graphs!!!

41. Ganglia

42. Graphite

43. Event Overlay

44. StatsD if ($success) { StatsD::timing('query.runtime', $time); } else { StatsD::increment('query.failure'); } github.com/etsy/statsd

45. github.com/etsy/logster

46. github.com/etsy/logster

47. Practices @ Etsy Feature Flags Customer Communication

48. Feature Flags Deploy != Product Launch

49. Dark Launch def get_payment_link(): return ...

50. Dark Launch def get_payment_link(): if enabled('creditcards'): return creditcard_link() else: return check_link()

51. Dark Launch application_config: - creditcards: admin - NewFeatureB: off - NewFeatureC: on

52. Ramp Up application_config - creditcards: 1% - NewFeatureB: off - NewFeatureC: on

53. Ramp Up application_config - creditcards: 5% - NewFeatureB: off - NewFeatureC: on

54. Whoops! application_config: - creditcards: admin - NewFeatureB: off - NewFeatureC: on

55. Ramp Up application_config - creditcards: 5% - NewFeatureB: off - NewFeatureC: on

56. Ramp Up application_config - creditcards: 25% - NewFeatureB: off - NewFeatureC: on

57. Credit Cards are ON application_config - creditcards: 100% - NewFeatureB: off - NewFeatureC: on

58. AB Testing ● Prove success of interface changes ● Prove interest in new features

59. Community Communication

60. Forums / Message Boards

61. etsystatus.com

62. twitter.com/etsystatus

63. twitter.com/etsystatus

64. Deployment is First Class Deployment is a First Class Feature

65. Engineers are Users Too

66. Examples from Etsy ● Photos From Twisted to PHP ● PostgreSQL to MySQL Shards

67. From Twisted to PHP ● Run Apache/PHP on a new port ● Implement one service in PHP ● Ramp up users on new service ● Repeat for each service ● Shut down Twisted version

68. PostgreSQL to MySQL Shards ● Migrate table by table ● Tee writes to both DBs ● Copy old data from PostgreSQL ● Verify data matches ● Ramp up reads from MySQL ● Stop PostgreSQL writes

69. Continuous Deploy Pattern ● Change in small steps ● Dark launch via config ● Iterations to prod while dark ● Maintain old & new in parallel ● Ramp up new architecture ● Remove old architecture

70. Once Again ● Minimize BugHours ● Trash the Schedule ● Iterate on the Tools ● Make Big Changes

71. Mean Time To Addiction

72. Changing Etsy's Architectural Foundation with Continuous Deployment Matt Graham http://twitter.com/lapsu http://lapsu.tv Core Engineer @ Etsy Continuous Deployer http://codeascraft.etsy.com http://www.etsy.com/careers

Editor's Notes

I'm here to talk about continuous deployment and how it helps with BIG architectural changes. I'm an engineer on Etsy's Core Team and I'm coming at this from the perspective of an engineer involved with making fundamental changes to Etsy's infrastructure. I've found continuous deployment to be a very good way to work and I'm here to spread the good word and hopefully trigger some ideas about how it could help you and how to get started with it.
As the business grows, there is change pressuring the software from all over.
Start with an axiom: that good architecture is not static. The same company had 2 different architectures at 2 points in its history. While the 2005 architecture never would have scaled to 2012, the 2012 architecture would have been complete overkill in 2005. Etsy wouldn't have had time to build the features that got it started.
Here's a very not to scale graph of how the business and architecture grew together.
Here's what it would look like if they had shot straight to the 2012 architecture back in 2005. First, they probably wouldn't have gotten it right, but let's assume they did. We're done right?
What now? We overshot the 2005 architecture to be able to handle 2012, but now we're still not prepared for 2017. What do we do now? The point here is that you can't escape architectural change. All you can do is try to make it easier. Continuous deployment makes architectural change easier.
As the business grows, there is change pressuring the software from all over.
Ultimately, the correct architecture needs to change too.
We have to be able to make big architectural changes, and we need them to go better than this The key to success here is breaking down big changes into many smaller changes When we write code, we break it down into manageable modules. But when it comes time to deploy it, we mash it back together into an unmanageable chunk. This limits the scale of changes. With continuous deployment we remove that limit.
Let's step back and look at how deployment got to where it is. And we'll start here, the 80s. In the olden days... software had to be copied onto floppy disks, put in a box, shipped to a store and then finally purchased from the store And you wouldn't want to give people updates for free, so what happened? They'd all be batched up in an “upgrade” release or a new version altogether. Deploys were understandably rare
And then this happened. Likewise, we even started distributing our thick client software over the internet: Windows Update is a good example We took applications that used to be physically distributed thick clients and made them “web applications” For the most part we were still using the old school deploy cycle
But some people wanted to go faster. We realized, “Hey, it's a website, we can deploy it every month.”
But... why can't we just deploy and deploy and deploy? Well, we can. We can invert the unit of measurement from days per deploy, to be deploys per day
Makes it possible to do what might otherwise be too risky
Most modern software is dealing primarily with electrons The impact on the real world is minimal and indirect In these cases, MTTR is far cheaper than MTTF. Say you're coming up to a monthly release and you have 3 people spend 6 days testing for 100 different bugs. They find 4 and miss 2. The 2 that were missed take 2 days to get fixed and deployed. With continuous deployment, say we find 2 and miss 3.
Most modern software is dealing primarily with electrons The impact on the real world is minimal and indirect In these cases, MTTR is far cheaper than MTTF.
Continuous Deployment minimizes bug hours
Not all bugs are equal though With MTTF, you're telling yourself, if we test it enough, there won't be any bugs. With MTTR, you're saying, we know there will be bugs, let's fix them as quickly as possible.
Cost to recover Steve Austin – $6 million $31 million today
Cost to recover Steve Austin – $6 million $31 million today
Most other cases, continuous deployment may help
GE MRI? No
NASA? No
Enterprise software? Yes!
Printing Health Insurance, Credit Cards? No
Continuously deploy to the App Store?
What about when it comes to processing financial transactions?
Etsy is PCI compliant, so we are financial software. The process is different for our credit card processing software, but we don't deploy on a schedule. We push code whenever it's necessary.
This is all it is, and they're all cultural. Everything else is an implementation detail. Doing it is cultural, the technical part is just improving how well you're doing it. I'll talk about a few things that Etsy does to help us, but they're not necessary if you want to start continuous deployment. It's just a few of things you're likely to find helpful once you do start. Continuous Deployment is like everything else in software ship early and iterate.
This is all it is, and they're all cultural. Everything else is an implementation detail. Doing it is cultural, the technical part is just improving how well you're doing it. I'll talk about a few things that Etsy does to help us, but they're not necessary if you want to start continuous deployment. It's just a few of things you're likely to find helpful once you do start. Continuous Deployment is like everything else in software ship early and iterate.
This is all it is, and they're all cultural. Everything else is an implementation detail. Doing it is cultural, the technical part is just improving how well you're doing it. I'll talk about a few things that Etsy does to help us, but they're not necessary if you want to start continuous deployment. It's just a few of things you're likely to find helpful once you do start. Continuous Deployment is like everything else in software ship early and iterate.
Everything else measures how good you are at continuous deployment.
We don't have the mythical 100% automated test coverage, so we do manual testing too.
Great incentive to add automated tests Manual testing once/month or week can be tolerated Manual testing for multiple deploys/day is painful
Laurie Denness
grep and Enter the Dragon both released in 1973 Tailing a log and using grep to filter out uninteresting stuff is a great way to monitor the health of the system.
We use a tool called deployinator to actually execute our deploys. Deployinator has buttons on it to kick off each stage of the deploy It triggers a shell script that uses dsh to do stuff on each server And it logs what's happening on deploys It's designed to allow only minimal human output as a feature. All we do is say, “Start.” There are no options that we might screw up. The “deploy button” is probably the tool that most contributes to cutting down time spent on deploys
With ~100 developers, there is going to be contention for doing deploys We use another advanced technology for resolving that contention: the topic of an IRC channel. I'm mattg and I'm at the front of the queue. will_gallego is sharing my deploy and Michael Horowitz will do his deploy when he's done.
Ganglia is a common graphing tool. It's great for looking at a pool of machines. Each band here is a separate machine.
Graphite is another graphing tool. It let's you easily apply functions or stack graphs and is a better for displaying system level and busines metrics. These 2 graphs actually show the same event where we switched to an optimized version of libjpeg.
Here's another ganglia graph and there's a clear drop in memcache connections. What caused that? We draw these vertical lines at each deploys and this one is blue. That means there was a configuration change at that time that led to the drop. Now I know I can check the deploy logs to see what went out.
Graphs are great to look at, but they don't help if there's not an easy way for developers to get the right data into them. We use a tool that we've open sourced called StatsD. It's a node.js UDP server that just listens for incoming data and sends it to Graphite. From our application code, the only thing we need to write is this little bit.
Logster is another tool we use to easily get data into graphs. It scans production logs Uses plugins to parse out interesting information Pushes it to Graphite or Ganglia
Logster is another tool we use to easily get data into graphs. It scans production logs Uses plugins to parse out interesting information Pushes it to Graphite or Ganglia
A deploy is not the same as product launch Just because you deploy frequently, doesn't mean you have to give up control of when software is “launched” Feature flags are the tools that give that control through a “dark launch” Credit Card Processing
Imagine if you have this function to get a link for feature A. It formats the string and returns it.
So now to dark launch it, change that function to check a config value. If it's enabled, return the value from a new funciton otherwise return the value from the old function. This is just to generate a link, but we use these all over our code so there's not really any limit to how you use feature flags.
A sample of configuration At Etsy, we use admin to mean only Etsy employees. This so FeatureA is dark launched for employees only so we can see how the development is progressing.
Now we're ready to let real users start seeing the new feature so we increase it to 1% of all users. We can also white list which specific users get the new feature This ramp up is a powerful way to reduce risk in a change and is why continuous deployment could work for financial trading software.
If at any point we see a problem, we just roll back to admin only.
A sample of configuration At Etsy, we use admin to mean only Etsy employees. This so FeatureA is dark launched for employees only so we can see how the development is progressing.
If at any point we see a problem, we just roll back to admin only.
Now 100% of people are using NewFeatureA
Have an interface change and want to see if it moves metrics? Split users across different options and see what happens Have a new feature and want to see if people like it before getting behind it, let people select themselves into a beta group. Google does this with the Labs. With continuous deployment you can make these changes and instantly see results and then make more changes.
Communication a very helpful tool in both directions.
First, if something small breaks, you want to have a feedback path for users to inform you Not specific to continuous deployment but changes are spread with low intensity over a period of time, so you want to have a good low intensity Forums or message boards are a great, low intensity way for users to send feedback
If something big breaks, you want to be able to inform users out of band of a potentially non-functional site At etsy we have a blog hosted by wordpress where we post outages or even slowness on the site
We also have an etsystatus twitter account
And here's one reason to have 2 channels
What we're doing with all these tools is making deployment a first class member of the system Compare to tech support features or business intelligence
Listing photos are a core part of our site as it's what lets buyers see what people are selling and it gets 400k uploads per day. The postgres DB was our central DB and we're migrating all of the data there over to the shards. All this happens live.
Ultimately, the correct architecture needs to change too.
Ultimately, the correct architecture needs to change too.
Finally, get rid of the old stuff. This is the most satisfying step. Also very important to keep unused stuff from causing confusion.

Changing Etsy's Architectural Foundations with Continuous Deployment

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Viewers also liked

Viewers also liked (14)

Similar to Changing Etsy's Architectural Foundations with Continuous Deployment

Similar to Changing Etsy's Architectural Foundations with Continuous Deployment (20)

Recently uploaded

Recently uploaded (20)

Changing Etsy's Architectural Foundations with Continuous Deployment

Editor's Notes