SlideShare a Scribd company logo
1 of 38
Download to read offline
SmugMug’s Zero Downtime Migration to AWS
ARC312
Andrew Shieh, SmugMug Operations
shandrew @ smugmug.com
November 15, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Friday, November 15, 13
SmugMug—Who are we?

Friday, November 15, 13
The early days of SmugMug
• Gradual bootstrapped growth
• Multiple self-managed datacenter cages
• Too many servers of varying types
• Too many disks
• Tons of valuable skilled employee
hours spent in cages

Friday, November 15, 13
Data
Center
Fantasy

Friday, November 15, 13
Data Center Reality

Friday, November 15, 13
Data Center Reality

Friday, November 15, 13
SmugMug <3 AWS
• Early adopter of Amazon S3
• Over the years, moved rendering,
upload, archiving, payments,
permissions, email, and more
compute to AWS
• Before mid-2012, no ultra-high
performance I/O
Friday, November 15, 13
SmugMug Architecture ~2006

AWS: S3
SV: Web, DB, Image*

Friday, November 15, 13

AWS: S3
SmugMug Architecture ~2011

AWS: S3
AWS: S3, Image (upload,
SV: Web, DB

Friday, November 15, 13

processing, render, video, …)
SmugMug Architecture - Transition

AWS: S3
SV: Web, DB

Friday, November 15, 13

AWS: S3, Image*, Web
DC: Replication DB,
Direct Connect
SmugMug Architecture Today

Ø

Friday, November 15, 13

AWS: S3, Image*,
Web, DB
How did we get there?

Friday, November 15, 13
Our database I/O evolution:
Always cutting edge
• Started with MySQL on spinning
disk RAID, max RAM
• Moved to ZFS SSD + SSD cache +
spinning disks
• Moved to custom 24-SSD arrays

Friday, November 15, 13
hi1.4xlarge FTW
• our custom, obscure hardware =>
difficult to resolve problems,
difficult to upgrade
• hi1 overall DB IO performance
comparable to 8 x SSD RAID10
• < 3%/yr hi1 instance failure rate!
Friday, November 15, 13
Amazon VPC - also a big win
• Easy mapping of internal / external network security
model to AWS

Friday, November 15, 13
Zero downtime move?

Friday, November 15, 13
Friday, November 15, 13
Friday, November 15, 13
Zero Downtime Move
• Flexibility of the AWS cloud
makes a zero downtime move
inexpensive. Pay for only what
you use. Provision fast.
• Plan
• Test
• Plan and test again

Friday, November 15, 13
Major changes post-move
• Database storage goes from SSD to
hi1.4xlarge ephemeral
• Hardware load balancers become
Elastic Load Balancing load balancers

Friday, November 15, 13
Major changes post-move
• Database storage goes from SSD to
hi1.4xlarge ephemeral
• Hardware load balancers become ELB
• haproxy layer 7 load/traffic directing
goes from static to dynamic config
• Web servers autoscale for each cluster
• Membase to ElastiCache (later to
Amazon EC2)
Friday, November 15, 13
Zero Downtime Move Requirements
• Read-only site mode
• Traffic control — shadow load
• Cross country MySQL replication +
sufficient bandwidth

Friday, November 15, 13
Zero Downtime Move Requirements
• Read-only site mode
• Traffic control — shadow load
• Cross country MySQL replication +
sufficient bandwidth
• Bot testing
• Read-only live site testing w/ QA

Friday, November 15, 13
More on moving
• Full scale read-write testing
is difficult
• Be aware of AWS limits
• Talk to support for big
growth
• Roll back plan - manage
risky change

Friday, November 15, 13
Flipping the switch to AWS
• “The biggest, scariest engineering
change we've made in the company's
history” - Don, SmugMug Chief Geek
• Go read-only (1 min)
• Pre-Scale up big
• MHA to reassign MySQL
masters and their replication (30min)
• Point DNS+CDN to Elastic Load
Balancing (5-30m)
Friday, November 15, 13
Flipping the switch to AWS
• Test! (60 min)
• When Read-only is
all good, go to readwrite (5 min)
• Test! Inevitable bugs
at this step (hours)

Friday, November 15, 13
MHA?
• Facebook, DeNA
• Helps to reliably reassign
MySQL masters and
replication, maintaining
consistency

Friday, November 15, 13
MHA?
• Manual failover in MySQL
5.5 and earlier is painful, timeconsuming
• Be careful with automation for
rare events — it can bite

Friday, November 15, 13
Problems?
• Completely redundant
network links can fail
• Bugs related to IP address
change
• ElastiCache performance
• NewRelic! Use it or a similar
APM product

Friday, November 15, 13
Results

Friday, November 15, 13
Results

Friday, November 15, 13
Results
• Data Center - performance fluctuated
through day
• AWS w/scaling - flat performance
throughout the day - significant
scalability limits removed
• Networking was a key improvement
• Success!

Friday, November 15, 13
Lessons Learned
• We love AWS even more than before
• Automate everything
• Understand Amazon EBS, and
understand underlying details of AWS
services
• Unpredictable Ops schedules vs. large
projects

Friday, November 15, 13
Lessons Learned

Job #1:
Making
business
happen
Friday, November 15, 13
We made more changes, because we could
• As long as we’re moving our infrastructure,
why not rebuild most of it too?
• Linux, MySQL, package versions upgraded
• New monitoring tools
• NFS dependencies eliminated, moved to
Amazon S3 or DynamoDB
• Code pushes managed by nice distributed
tools utilizing Amazon S3 + internal torrent
Friday, November 15, 13
One last thing...
• Go Multi-availability-zone!
• Load balancers send traffic to multiple
haproxy per AZ with AZ-specific web
clusters, DB replicas
• Backed up w/ cross AZ
• Keep SPOFs in one AZ

Friday, November 15, 13
Questions?
Andrew Shieh, Sunnyvale, CA
shandrew@smugmug.com
@shandrew
http://www.smugmug.com/
http://pics.shieh.info/
Thank you!

Friday, November 15, 13
Please give us your feedback on this
presentation

ARC312 - SmugMug’s Zero
Downtime Migration to AWS
As a thank you, we will select prize
winners daily for completed surveys!

Friday, November 15, 13

Thank You

More Related Content

What's hot

Amazon.com Corporate IT apps Migration to AWS
Amazon.com Corporate IT apps Migration to AWSAmazon.com Corporate IT apps Migration to AWS
Amazon.com Corporate IT apps Migration to AWS
Amazon Web Services
 

What's hot (20)

Cloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for PartnersCloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for Partners
 
How a Global Healthcare Company Built a Migration Factory to Quickly Move Tho...
How a Global Healthcare Company Built a Migration Factory to Quickly Move Tho...How a Global Healthcare Company Built a Migration Factory to Quickly Move Tho...
How a Global Healthcare Company Built a Migration Factory to Quickly Move Tho...
 
AWS Webcast - Datacenter Migration to AWS
AWS Webcast - Datacenter Migration to AWSAWS Webcast - Datacenter Migration to AWS
AWS Webcast - Datacenter Migration to AWS
 
Amazon.com Corporate IT apps Migration to AWS
Amazon.com Corporate IT apps Migration to AWSAmazon.com Corporate IT apps Migration to AWS
Amazon.com Corporate IT apps Migration to AWS
 
Migrating to AWS
Migrating to AWSMigrating to AWS
Migrating to AWS
 
Application Portfolio Migration
Application Portfolio MigrationApplication Portfolio Migration
Application Portfolio Migration
 
(ISM313) How Delaware North Migrated 90+ Apps in Four Months
(ISM313) How Delaware North Migrated 90+ Apps in Four Months(ISM313) How Delaware North Migrated 90+ Apps in Four Months
(ISM313) How Delaware North Migrated 90+ Apps in Four Months
 
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...
 
Cloud Migration
Cloud MigrationCloud Migration
Cloud Migration
 
An Agile Approach to Accelerate Mass Migration
An Agile Approach to Accelerate Mass MigrationAn Agile Approach to Accelerate Mass Migration
An Agile Approach to Accelerate Mass Migration
 
Application Migrations at Scale
Application Migrations at ScaleApplication Migrations at Scale
Application Migrations at Scale
 
AWS Migration Planning Roadmap
AWS Migration Planning RoadmapAWS Migration Planning Roadmap
AWS Migration Planning Roadmap
 
AWS Webcast - Migrating your Data Center to the Cloud
AWS Webcast - Migrating your Data Center to the CloudAWS Webcast - Migrating your Data Center to the Cloud
AWS Webcast - Migrating your Data Center to the Cloud
 
Large-Scale AWS Migrations with CSC
Large-Scale AWS Migrations with CSCLarge-Scale AWS Migrations with CSC
Large-Scale AWS Migrations with CSC
 
AWS Partner Webcast - Step by Step Plan to Update and Migrate Microsoft Wind...
AWS Partner Webcast -  Step by Step Plan to Update and Migrate Microsoft Wind...AWS Partner Webcast -  Step by Step Plan to Update and Migrate Microsoft Wind...
AWS Partner Webcast - Step by Step Plan to Update and Migrate Microsoft Wind...
 
Cloud Migration Cookbook: A Guide To Moving Your Apps To The Cloud
Cloud Migration Cookbook: A Guide To Moving Your Apps To The CloudCloud Migration Cookbook: A Guide To Moving Your Apps To The Cloud
Cloud Migration Cookbook: A Guide To Moving Your Apps To The Cloud
 
Migrating to Cloud - A Step by Step
Migrating to Cloud - A Step by Step Migrating to Cloud - A Step by Step
Migrating to Cloud - A Step by Step
 
B1 – Migrating enterprise applications to aws
B1 – Migrating enterprise applications to awsB1 – Migrating enterprise applications to aws
B1 – Migrating enterprise applications to aws
 
Simplify Cloud Migration to AWS with RISC Network’s Complete App Analysis
Simplify Cloud Migration  to  AWS with RISC Network’s Complete App AnalysisSimplify Cloud Migration  to  AWS with RISC Network’s Complete App Analysis
Simplify Cloud Migration to AWS with RISC Network’s Complete App Analysis
 
AWS Partner Webcast - Data Center Migration to the AWS Cloud
AWS Partner Webcast - Data Center Migration to the AWS CloudAWS Partner Webcast - Data Center Migration to the AWS Cloud
AWS Partner Webcast - Data Center Migration to the AWS Cloud
 

Viewers also liked

Automate Migration to AWS with Datapipe
Automate Migration to AWS with DatapipeAutomate Migration to AWS with Datapipe
Automate Migration to AWS with Datapipe
Amazon Web Services
 
Data Center Migration to the AWS Cloud
Data Center Migration to the AWS CloudData Center Migration to the AWS Cloud
Data Center Migration to the AWS Cloud
Tom Laszewski
 

Viewers also liked (20)

Optimize Business Performance by Moving Apps to the Microsoft Cloud
Optimize Business Performance by Moving Apps to the Microsoft CloudOptimize Business Performance by Moving Apps to the Microsoft Cloud
Optimize Business Performance by Moving Apps to the Microsoft Cloud
 
Top Five Office 365 Migration Headaches and How to Avoid Them
Top Five Office 365 Migration Headaches and How to Avoid ThemTop Five Office 365 Migration Headaches and How to Avoid Them
Top Five Office 365 Migration Headaches and How to Avoid Them
 
Progressive Infotech - Global Cloud Consulting and Digital Transformation Exp...
Progressive Infotech - Global Cloud Consulting and Digital Transformation Exp...Progressive Infotech - Global Cloud Consulting and Digital Transformation Exp...
Progressive Infotech - Global Cloud Consulting and Digital Transformation Exp...
 
Considerations for large scale aws migration
Considerations for large scale aws migrationConsiderations for large scale aws migration
Considerations for large scale aws migration
 
AWS re:Invent 2016: How to move 1,000 VMs and Biz Critical Apps to AWS in 6 m...
AWS re:Invent 2016: How to move 1,000 VMs and Biz Critical Apps to AWS in 6 m...AWS re:Invent 2016: How to move 1,000 VMs and Biz Critical Apps to AWS in 6 m...
AWS re:Invent 2016: How to move 1,000 VMs and Biz Critical Apps to AWS in 6 m...
 
Seamless service migration with AWS Enterprise Support
Seamless service migration with AWS Enterprise SupportSeamless service migration with AWS Enterprise Support
Seamless service migration with AWS Enterprise Support
 
Cloud Based Rights Management with Azure RMS
Cloud Based Rights Management with Azure RMSCloud Based Rights Management with Azure RMS
Cloud Based Rights Management with Azure RMS
 
Massive Lift & Shift Migrations to Microsoft Azure with the Microsoft Migrati...
Massive Lift & Shift Migrations to Microsoft Azure with the Microsoft Migrati...Massive Lift & Shift Migrations to Microsoft Azure with the Microsoft Migrati...
Massive Lift & Shift Migrations to Microsoft Azure with the Microsoft Migrati...
 
Nuvola: a tale of migration to AWS
Nuvola: a tale of migration to AWSNuvola: a tale of migration to AWS
Nuvola: a tale of migration to AWS
 
Simplify Your Database Migration to AWS | AWS Public Sector Summit 2016
Simplify Your Database Migration to AWS | AWS Public Sector Summit 2016Simplify Your Database Migration to AWS | AWS Public Sector Summit 2016
Simplify Your Database Migration to AWS | AWS Public Sector Summit 2016
 
Office 365 Mail migration strategies
Office 365 Mail migration strategiesOffice 365 Mail migration strategies
Office 365 Mail migration strategies
 
Automate Migration to AWS with Datapipe
Automate Migration to AWS with DatapipeAutomate Migration to AWS with Datapipe
Automate Migration to AWS with Datapipe
 
AWS re:Invent 2016: Large-scale AWS Migrations (ENT204)
AWS re:Invent 2016: Large-scale AWS Migrations (ENT204)AWS re:Invent 2016: Large-scale AWS Migrations (ENT204)
AWS re:Invent 2016: Large-scale AWS Migrations (ENT204)
 
AWS re:Invent 2016: Simplify Cloud Migration with AWS Server Migration Servic...
AWS re:Invent 2016: Simplify Cloud Migration with AWS Server Migration Servic...AWS re:Invent 2016: Simplify Cloud Migration with AWS Server Migration Servic...
AWS re:Invent 2016: Simplify Cloud Migration with AWS Server Migration Servic...
 
AWS re:Invent 2016: Preparing for a Large-Scale Migration to AWS (ENT212)
AWS re:Invent 2016: Preparing for a Large-Scale Migration to AWS (ENT212)AWS re:Invent 2016: Preparing for a Large-Scale Migration to AWS (ENT212)
AWS re:Invent 2016: Preparing for a Large-Scale Migration to AWS (ENT212)
 
Migrate from SQL Server or Oracle into Amazon Aurora using AWS Database Migra...
Migrate from SQL Server or Oracle into Amazon Aurora using AWS Database Migra...Migrate from SQL Server or Oracle into Amazon Aurora using AWS Database Migra...
Migrate from SQL Server or Oracle into Amazon Aurora using AWS Database Migra...
 
Data Center Migration to the AWS Cloud
Data Center Migration to the AWS CloudData Center Migration to the AWS Cloud
Data Center Migration to the AWS Cloud
 
Andy Malone - Migrating to office 365
Andy Malone - Migrating to office 365Andy Malone - Migrating to office 365
Andy Malone - Migrating to office 365
 
An Agile Approach to Accelerate Mass Migration | AWS Public Sector Summit 2016
An Agile Approach to Accelerate Mass Migration | AWS Public Sector Summit 2016An Agile Approach to Accelerate Mass Migration | AWS Public Sector Summit 2016
An Agile Approach to Accelerate Mass Migration | AWS Public Sector Summit 2016
 
Forward thinking: What's next for AI
Forward thinking: What's next for AIForward thinking: What's next for AI
Forward thinking: What's next for AI
 

Similar to SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
Amazon Web Services
 
Introduction to amazon web services for developers
Introduction to amazon web services for developersIntroduction to amazon web services for developers
Introduction to amazon web services for developers
Ciklum Ukraine
 

Similar to SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013 (20)

How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
 
Scaling a Mobile Web App to 100 Million Clients and Beyond (MBL302) | AWS re:...
Scaling a Mobile Web App to 100 Million Clients and Beyond (MBL302) | AWS re:...Scaling a Mobile Web App to 100 Million Clients and Beyond (MBL302) | AWS re:...
Scaling a Mobile Web App to 100 Million Clients and Beyond (MBL302) | AWS re:...
 
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
 
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
 
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
 
Qcon talk
Qcon talkQcon talk
Qcon talk
 
How Trend Micro Build their Enterprise Security Offering on AWS (SEC307) | AW...
How Trend Micro Build their Enterprise Security Offering on AWS (SEC307) | AW...How Trend Micro Build their Enterprise Security Offering on AWS (SEC307) | AW...
How Trend Micro Build their Enterprise Security Offering on AWS (SEC307) | AW...
 
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CSBetter, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
 
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013
 
Ceate a Scalable Cloud Architecture
Ceate a Scalable Cloud ArchitectureCeate a Scalable Cloud Architecture
Ceate a Scalable Cloud Architecture
 
Scaling the Platform for Your Startup
Scaling the Platform for Your StartupScaling the Platform for Your Startup
Scaling the Platform for Your Startup
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
 
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
 AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
 
Introduction to Amazon Web Services (AWS)
Introduction to Amazon Web Services (AWS)Introduction to Amazon Web Services (AWS)
Introduction to Amazon Web Services (AWS)
 
SQL Azure - the good, the bad and the ugly.
SQL Azure - the good, the bad and the ugly.SQL Azure - the good, the bad and the ugly.
SQL Azure - the good, the bad and the ugly.
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
 
Introduction to amazon web services for developers
Introduction to amazon web services for developersIntroduction to amazon web services for developers
Introduction to amazon web services for developers
 
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million UsersAWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
 
Cloud Development with Camel and Amazon Web Services
Cloud Development with Camel and Amazon Web ServicesCloud Development with Camel and Amazon Web Services
Cloud Development with Camel and Amazon Web Services
 
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services 2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

  • 1. SmugMug’s Zero Downtime Migration to AWS ARC312 Andrew Shieh, SmugMug Operations shandrew @ smugmug.com November 15, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Friday, November 15, 13
  • 3. The early days of SmugMug • Gradual bootstrapped growth • Multiple self-managed datacenter cages • Too many servers of varying types • Too many disks • Tons of valuable skilled employee hours spent in cages Friday, November 15, 13
  • 5. Data Center Reality Friday, November 15, 13
  • 6. Data Center Reality Friday, November 15, 13
  • 7. SmugMug <3 AWS • Early adopter of Amazon S3 • Over the years, moved rendering, upload, archiving, payments, permissions, email, and more compute to AWS • Before mid-2012, no ultra-high performance I/O Friday, November 15, 13
  • 8. SmugMug Architecture ~2006 AWS: S3 SV: Web, DB, Image* Friday, November 15, 13 AWS: S3
  • 9. SmugMug Architecture ~2011 AWS: S3 AWS: S3, Image (upload, SV: Web, DB Friday, November 15, 13 processing, render, video, …)
  • 10. SmugMug Architecture - Transition AWS: S3 SV: Web, DB Friday, November 15, 13 AWS: S3, Image*, Web DC: Replication DB, Direct Connect
  • 11. SmugMug Architecture Today Ø Friday, November 15, 13 AWS: S3, Image*, Web, DB
  • 12. How did we get there? Friday, November 15, 13
  • 13. Our database I/O evolution: Always cutting edge • Started with MySQL on spinning disk RAID, max RAM • Moved to ZFS SSD + SSD cache + spinning disks • Moved to custom 24-SSD arrays Friday, November 15, 13
  • 14. hi1.4xlarge FTW • our custom, obscure hardware => difficult to resolve problems, difficult to upgrade • hi1 overall DB IO performance comparable to 8 x SSD RAID10 • < 3%/yr hi1 instance failure rate! Friday, November 15, 13
  • 15. Amazon VPC - also a big win • Easy mapping of internal / external network security model to AWS Friday, November 15, 13
  • 16. Zero downtime move? Friday, November 15, 13
  • 19. Zero Downtime Move • Flexibility of the AWS cloud makes a zero downtime move inexpensive. Pay for only what you use. Provision fast. • Plan • Test • Plan and test again Friday, November 15, 13
  • 20. Major changes post-move • Database storage goes from SSD to hi1.4xlarge ephemeral • Hardware load balancers become Elastic Load Balancing load balancers Friday, November 15, 13
  • 21. Major changes post-move • Database storage goes from SSD to hi1.4xlarge ephemeral • Hardware load balancers become ELB • haproxy layer 7 load/traffic directing goes from static to dynamic config • Web servers autoscale for each cluster • Membase to ElastiCache (later to Amazon EC2) Friday, November 15, 13
  • 22. Zero Downtime Move Requirements • Read-only site mode • Traffic control — shadow load • Cross country MySQL replication + sufficient bandwidth Friday, November 15, 13
  • 23. Zero Downtime Move Requirements • Read-only site mode • Traffic control — shadow load • Cross country MySQL replication + sufficient bandwidth • Bot testing • Read-only live site testing w/ QA Friday, November 15, 13
  • 24. More on moving • Full scale read-write testing is difficult • Be aware of AWS limits • Talk to support for big growth • Roll back plan - manage risky change Friday, November 15, 13
  • 25. Flipping the switch to AWS • “The biggest, scariest engineering change we've made in the company's history” - Don, SmugMug Chief Geek • Go read-only (1 min) • Pre-Scale up big • MHA to reassign MySQL masters and their replication (30min) • Point DNS+CDN to Elastic Load Balancing (5-30m) Friday, November 15, 13
  • 26. Flipping the switch to AWS • Test! (60 min) • When Read-only is all good, go to readwrite (5 min) • Test! Inevitable bugs at this step (hours) Friday, November 15, 13
  • 27. MHA? • Facebook, DeNA • Helps to reliably reassign MySQL masters and replication, maintaining consistency Friday, November 15, 13
  • 28. MHA? • Manual failover in MySQL 5.5 and earlier is painful, timeconsuming • Be careful with automation for rare events — it can bite Friday, November 15, 13
  • 29. Problems? • Completely redundant network links can fail • Bugs related to IP address change • ElastiCache performance • NewRelic! Use it or a similar APM product Friday, November 15, 13
  • 32. Results • Data Center - performance fluctuated through day • AWS w/scaling - flat performance throughout the day - significant scalability limits removed • Networking was a key improvement • Success! Friday, November 15, 13
  • 33. Lessons Learned • We love AWS even more than before • Automate everything • Understand Amazon EBS, and understand underlying details of AWS services • Unpredictable Ops schedules vs. large projects Friday, November 15, 13
  • 35. We made more changes, because we could • As long as we’re moving our infrastructure, why not rebuild most of it too? • Linux, MySQL, package versions upgraded • New monitoring tools • NFS dependencies eliminated, moved to Amazon S3 or DynamoDB • Code pushes managed by nice distributed tools utilizing Amazon S3 + internal torrent Friday, November 15, 13
  • 36. One last thing... • Go Multi-availability-zone! • Load balancers send traffic to multiple haproxy per AZ with AZ-specific web clusters, DB replicas • Backed up w/ cross AZ • Keep SPOFs in one AZ Friday, November 15, 13
  • 37. Questions? Andrew Shieh, Sunnyvale, CA shandrew@smugmug.com @shandrew http://www.smugmug.com/ http://pics.shieh.info/ Thank you! Friday, November 15, 13
  • 38. Please give us your feedback on this presentation ARC312 - SmugMug’s Zero Downtime Migration to AWS As a thank you, we will select prize winners daily for completed surveys! Friday, November 15, 13 Thank You