SlideShare uma empresa Scribd logo
1 de 8
Baixar para ler offline
AMAZON FAIL
      DC Public Library’s Lessons Learned from the Amazon Cloud Outage




Friday, June 24, 2011
BACKGROUND

    •   DClibrary.org was first major DC Government website to use
        cloud-based hosting beginning circa June 2009

    •   Initial architecture designed to leverage low cost of large instances
        Amazon Web Services (AWS) servers for database operations
        and lower cost small and mid servers for WWW services

    •   DClibrary.org Content Management System is Drupal 6

    •   Bonus: Experimental Drupal 7 amazon machine instance available
        on our website; currently undergoing user testing

Friday, June 24, 2011
WHAT WENT WRONG

    •   Background: AWS de-couples the physical hard disk space (called Elastic
        Block Storage or EBS) from the CPUs (called “compute instances”)

    •   late April 2011: an AWS engineer mistakenly routed “backplane” (internal
        server traffic) which connects EBS to the CPUS through a system that could
        not handle the load

    •   This triggered an alarm; since everything in AWS is redundant, the systems
        thought the backup EBS drives had all failed simultaneously, causing an
        overload as the system tried to compensate

    •   In a nutshell, it’s almost as if the CPUs no longer had hard drives

Friday, June 24, 2011
2009 ARCHITECTURE

    •   June 2009 architecture focused
        on load balancing and database
        replication across Amazon
        Availability Zones

    •   SVN machine was also in cloud

    •   Too reliant on one service
        provider (amazon)


Friday, June 24, 2011
PRE-OUTAGE ARCHITECTURE


    •   AWS began a new service called “RDS” for Relational Data Service in 2010.
        This was a managed database service -- mySQL -- that was more powerful
        and simpler to administer than us doing so ourselves on large servers

    •   We migrated to RDS in 2010

    •   The remaining architecture, with the mid-instance front ends and load
        balancers, remained the same




Friday, June 24, 2011
KEY LESSONS LEARNED
    •   Amazon’s multiple availability zones failover are not reliable
         •    Does not imply separate physical or logical facilities!
         •    Amazon’s poor communication during the outage compounded this problem
         •    Due to Amazon’s poor initial incidence response communications, we on the spot decided to
              create new machine instances (AMIs) in a different geographic zone (US-West vs. US-East) and
              copy over the “offsite” one-day-old SVN and DB backups
         •    Downtime minimized to 1.5 hours; many websites (Reddit, Quora, Foursquare) were down for
              days
    •   Future Worst Case: Amazon goes completely offline. Means we need a very recent full backup of
        both WWW and DB instances in a physically and logically separate facility + ability to load balance/
        change DNS quickly
         •    Solution was to scale up Rackspace instances and make daily copies to those servers

Friday, June 24, 2011
2011 ARCHITECTURE




Friday, June 24, 2011
WHAT WE RECOMMEND
    •   get physically and logically separate backup servers
    •   do nightly full copy backups to the above servers
    •   have a clear, written process in place for the following things:
         •    communicating with superiors about what’s happening
         •    what steps need to be taken to failover
         •    when the “worst-case” failover plan is implemented (can be time-based or circumstance-based

              or both)
    •   either implement automatic load balancing or (not as good) have complete control over your DNS
    •   use a very good alerts monitoring service; some of the best ones are cheap/free. We use

        binarycanary.com.




Friday, June 24, 2011

Mais conteúdo relacionado

Mais procurados

HA and DR for Cloud Workloads
HA and DR for Cloud WorkloadsHA and DR for Cloud Workloads
HA and DR for Cloud Workloadsswamybabu
 
AWS Summit Auckland 2014 | AWSome Data Protection with Veeam - Session Sponso...
AWS Summit Auckland 2014 | AWSome Data Protection with Veeam - Session Sponso...AWS Summit Auckland 2014 | AWSome Data Protection with Veeam - Session Sponso...
AWS Summit Auckland 2014 | AWSome Data Protection with Veeam - Session Sponso...Amazon Web Services
 
Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 DistilledGrig Gheorghiu
 
AWS Customer Presentation - Visiware - AWS Summit Paris
AWS Customer Presentation -  Visiware - AWS Summit ParisAWS Customer Presentation -  Visiware - AWS Summit Paris
AWS Customer Presentation - Visiware - AWS Summit ParisAmazon Web Services
 
Creating CentOS Template For CloudStack
Creating CentOS Template For CloudStackCreating CentOS Template For CloudStack
Creating CentOS Template For CloudStackShanker Balan
 
INFOGRAPHIC: #BackupExec 2014 - Backup Anything. Restore Anywhere.
INFOGRAPHIC: #BackupExec 2014 - Backup Anything. Restore Anywhere.INFOGRAPHIC: #BackupExec 2014 - Backup Anything. Restore Anywhere.
INFOGRAPHIC: #BackupExec 2014 - Backup Anything. Restore Anywhere.Symantec
 
SYMANTEC Backup Exec 2014 - infographic
SYMANTEC Backup Exec 2014 - infographicSYMANTEC Backup Exec 2014 - infographic
SYMANTEC Backup Exec 2014 - infographicMZERMA Amine
 
KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...
KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...
KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...ShapeBlue
 
Active Cloud DB at CloudComp '10
Active Cloud DB at CloudComp '10Active Cloud DB at CloudComp '10
Active Cloud DB at CloudComp '10Chris Bunch
 
A brief introduction to CloudFormation
A brief introduction to CloudFormationA brief introduction to CloudFormation
A brief introduction to CloudFormationSWIFTotter Solutions
 
Deploying your application on open stack using bosh presentation
Deploying your application on open stack using bosh presentationDeploying your application on open stack using bosh presentation
Deploying your application on open stack using bosh presentationcapouch
 
Container management with docker & kubernetes
Container management with docker & kubernetesContainer management with docker & kubernetes
Container management with docker & kubernetesKasun Rajapakse
 
Architecting in Cloud : Your Guide to Amazon Web Services
Architecting in Cloud : Your Guide to Amazon Web ServicesArchitecting in Cloud : Your Guide to Amazon Web Services
Architecting in Cloud : Your Guide to Amazon Web ServicesEdureka!
 
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018Derek Ashmore
 
Henry been database-per-tenant with 50k databases
Henry been   database-per-tenant with 50k databasesHenry been   database-per-tenant with 50k databases
Henry been database-per-tenant with 50k databasesHenry Been
 
Building a multi-tenant application using 45.000 databases - Henry Been - Cod...
Building a multi-tenant application using 45.000 databases - Henry Been - Cod...Building a multi-tenant application using 45.000 databases - Henry Been - Cod...
Building a multi-tenant application using 45.000 databases - Henry Been - Cod...Codemotion
 
AppScale @ LA.rb
AppScale @ LA.rbAppScale @ LA.rb
AppScale @ LA.rbChris Bunch
 
Architecture Best Practices on Windows Azure
Architecture Best Practices on Windows AzureArchitecture Best Practices on Windows Azure
Architecture Best Practices on Windows AzureNuno Godinho
 

Mais procurados (20)

HA and DR for Cloud Workloads
HA and DR for Cloud WorkloadsHA and DR for Cloud Workloads
HA and DR for Cloud Workloads
 
AWS Summit Auckland 2014 | AWSome Data Protection with Veeam - Session Sponso...
AWS Summit Auckland 2014 | AWSome Data Protection with Veeam - Session Sponso...AWS Summit Auckland 2014 | AWSome Data Protection with Veeam - Session Sponso...
AWS Summit Auckland 2014 | AWSome Data Protection with Veeam - Session Sponso...
 
Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 Distilled
 
AWS Customer Presentation - Visiware - AWS Summit Paris
AWS Customer Presentation -  Visiware - AWS Summit ParisAWS Customer Presentation -  Visiware - AWS Summit Paris
AWS Customer Presentation - Visiware - AWS Summit Paris
 
Creating CentOS Template For CloudStack
Creating CentOS Template For CloudStackCreating CentOS Template For CloudStack
Creating CentOS Template For CloudStack
 
INFOGRAPHIC: #BackupExec 2014 - Backup Anything. Restore Anywhere.
INFOGRAPHIC: #BackupExec 2014 - Backup Anything. Restore Anywhere.INFOGRAPHIC: #BackupExec 2014 - Backup Anything. Restore Anywhere.
INFOGRAPHIC: #BackupExec 2014 - Backup Anything. Restore Anywhere.
 
SYMANTEC Backup Exec 2014 - infographic
SYMANTEC Backup Exec 2014 - infographicSYMANTEC Backup Exec 2014 - infographic
SYMANTEC Backup Exec 2014 - infographic
 
KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...
KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...
KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...
 
Active Cloud DB at CloudComp '10
Active Cloud DB at CloudComp '10Active Cloud DB at CloudComp '10
Active Cloud DB at CloudComp '10
 
A brief introduction to CloudFormation
A brief introduction to CloudFormationA brief introduction to CloudFormation
A brief introduction to CloudFormation
 
Deploying your application on open stack using bosh presentation
Deploying your application on open stack using bosh presentationDeploying your application on open stack using bosh presentation
Deploying your application on open stack using bosh presentation
 
Container management with docker & kubernetes
Container management with docker & kubernetesContainer management with docker & kubernetes
Container management with docker & kubernetes
 
Architecting in Cloud : Your Guide to Amazon Web Services
Architecting in Cloud : Your Guide to Amazon Web ServicesArchitecting in Cloud : Your Guide to Amazon Web Services
Architecting in Cloud : Your Guide to Amazon Web Services
 
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
 
Aws ec2
Aws ec2Aws ec2
Aws ec2
 
Henry been database-per-tenant with 50k databases
Henry been   database-per-tenant with 50k databasesHenry been   database-per-tenant with 50k databases
Henry been database-per-tenant with 50k databases
 
Building a multi-tenant application using 45.000 databases - Henry Been - Cod...
Building a multi-tenant application using 45.000 databases - Henry Been - Cod...Building a multi-tenant application using 45.000 databases - Henry Been - Cod...
Building a multi-tenant application using 45.000 databases - Henry Been - Cod...
 
Ph.D. Defense
Ph.D. DefensePh.D. Defense
Ph.D. Defense
 
AppScale @ LA.rb
AppScale @ LA.rbAppScale @ LA.rb
AppScale @ LA.rb
 
Architecture Best Practices on Windows Azure
Architecture Best Practices on Windows AzureArchitecture Best Practices on Windows Azure
Architecture Best Practices on Windows Azure
 

Destaque

Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012 Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012 Rajesh Prabhakar
 
Analyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing EnvironmentAnalyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing Environmentiosrjce
 
Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?ITpreneurs
 
European Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage ManagementEuropean Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage ManagementOMNETRIC
 
DC architectures future proof
DC architectures future proofDC architectures future proof
DC architectures future proofGuido Frabotti
 
(SEC404) Incident Response in the Cloud | AWS re:Invent 2014
(SEC404) Incident Response in the Cloud | AWS re:Invent 2014(SEC404) Incident Response in the Cloud | AWS re:Invent 2014
(SEC404) Incident Response in the Cloud | AWS re:Invent 2014Amazon Web Services
 
Successful Outage Management Lessons Learned From Global Generation Leaders
Successful Outage Management   Lessons Learned From Global Generation LeadersSuccessful Outage Management   Lessons Learned From Global Generation Leaders
Successful Outage Management Lessons Learned From Global Generation LeadersTedLemmers
 
The Inevitable Cloud Outage
The Inevitable Cloud OutageThe Inevitable Cloud Outage
The Inevitable Cloud OutageNewvewm
 
Avoiding Cloud Outage
Avoiding Cloud OutageAvoiding Cloud Outage
Avoiding Cloud OutageNati Shalom
 
26 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 2026 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 20Étienne Garbugli
 

Destaque (12)

Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012 Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
 
Analyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing EnvironmentAnalyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing Environment
 
Henry
HenryHenry
Henry
 
Cloud malfunction up11
Cloud malfunction up11Cloud malfunction up11
Cloud malfunction up11
 
Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?
 
European Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage ManagementEuropean Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage Management
 
DC architectures future proof
DC architectures future proofDC architectures future proof
DC architectures future proof
 
(SEC404) Incident Response in the Cloud | AWS re:Invent 2014
(SEC404) Incident Response in the Cloud | AWS re:Invent 2014(SEC404) Incident Response in the Cloud | AWS re:Invent 2014
(SEC404) Incident Response in the Cloud | AWS re:Invent 2014
 
Successful Outage Management Lessons Learned From Global Generation Leaders
Successful Outage Management   Lessons Learned From Global Generation LeadersSuccessful Outage Management   Lessons Learned From Global Generation Leaders
Successful Outage Management Lessons Learned From Global Generation Leaders
 
The Inevitable Cloud Outage
The Inevitable Cloud OutageThe Inevitable Cloud Outage
The Inevitable Cloud Outage
 
Avoiding Cloud Outage
Avoiding Cloud OutageAvoiding Cloud Outage
Avoiding Cloud Outage
 
26 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 2026 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 20
 

Semelhante a Dcpl cloud computing amazon fail

High Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best PracticesHigh Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best PracticesRightScale
 
Running High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWSRunning High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWSAcquia
 
Ceate a Scalable Cloud Architecture
Ceate a Scalable Cloud ArchitectureCeate a Scalable Cloud Architecture
Ceate a Scalable Cloud ArchitectureAmazon Web Services
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersAmazon Web Services
 
Moving to the Cloud: AWS, Zend, RightScale
Moving to the Cloud: AWS, Zend, RightScaleMoving to the Cloud: AWS, Zend, RightScale
Moving to the Cloud: AWS, Zend, RightScalemmoline
 
Oracle Peoplesoft on AWS: A quick introduction
Oracle Peoplesoft on AWS: A quick introductionOracle Peoplesoft on AWS: A quick introduction
Oracle Peoplesoft on AWS: A quick introductionTom Laszewski
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS Tom Laszewski
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersAmazon Web Services
 
Best practices for_scaling_java_applications_with_distributed_caching
Best practices for_scaling_java_applications_with_distributed_cachingBest practices for_scaling_java_applications_with_distributed_caching
Best practices for_scaling_java_applications_with_distributed_cachingyamingd
 
Amazon Web Services Building Blocks for Drupal Applications and Hosting
Amazon Web Services Building Blocks for Drupal Applications and HostingAmazon Web Services Building Blocks for Drupal Applications and Hosting
Amazon Web Services Building Blocks for Drupal Applications and HostingAcquia
 
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million UsersAWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million UsersAmazon Web Services
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWSTom Laszewski
 
cse40822-amazon.pptx
cse40822-amazon.pptxcse40822-amazon.pptx
cse40822-amazon.pptxprathamgunj
 
Web App Security -Pradeep K.pptx
Web App Security -Pradeep K.pptxWeb App Security -Pradeep K.pptx
Web App Security -Pradeep K.pptxPradeepK344324
 
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
 AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million UsersAmazon Web Services
 
AWS Webcast - How to Migrate On-premise NAS Storage to Cloud NAS Storage
AWS Webcast - How to Migrate On-premise NAS Storage to Cloud NAS StorageAWS Webcast - How to Migrate On-premise NAS Storage to Cloud NAS Storage
AWS Webcast - How to Migrate On-premise NAS Storage to Cloud NAS StorageAmazon Web Services
 
Scaling up to your first 10 million users - Pop-up Loft Tel Aviv
Scaling up to your first 10 million users - Pop-up Loft Tel AvivScaling up to your first 10 million users - Pop-up Loft Tel Aviv
Scaling up to your first 10 million users - Pop-up Loft Tel AvivAmazon Web Services
 
Harness the Power of Hybrid Cloud with AWS and Avere
Harness the Power of Hybrid Cloud with AWS and AvereHarness the Power of Hybrid Cloud with AWS and Avere
Harness the Power of Hybrid Cloud with AWS and AvereAmazon Web Services
 
Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20Amazon Web Services
 

Semelhante a Dcpl cloud computing amazon fail (20)

High Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best PracticesHigh Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best Practices
 
Running High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWSRunning High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWS
 
Ceate a Scalable Cloud Architecture
Ceate a Scalable Cloud ArchitectureCeate a Scalable Cloud Architecture
Ceate a Scalable Cloud Architecture
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
 
Moving to the Cloud: AWS, Zend, RightScale
Moving to the Cloud: AWS, Zend, RightScaleMoving to the Cloud: AWS, Zend, RightScale
Moving to the Cloud: AWS, Zend, RightScale
 
Oracle Peoplesoft on AWS: A quick introduction
Oracle Peoplesoft on AWS: A quick introductionOracle Peoplesoft on AWS: A quick introduction
Oracle Peoplesoft on AWS: A quick introduction
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
 
Best practices for_scaling_java_applications_with_distributed_caching
Best practices for_scaling_java_applications_with_distributed_cachingBest practices for_scaling_java_applications_with_distributed_caching
Best practices for_scaling_java_applications_with_distributed_caching
 
AWS Distilled
AWS DistilledAWS Distilled
AWS Distilled
 
Amazon Web Services Building Blocks for Drupal Applications and Hosting
Amazon Web Services Building Blocks for Drupal Applications and HostingAmazon Web Services Building Blocks for Drupal Applications and Hosting
Amazon Web Services Building Blocks for Drupal Applications and Hosting
 
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million UsersAWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
 
cse40822-amazon.pptx
cse40822-amazon.pptxcse40822-amazon.pptx
cse40822-amazon.pptx
 
Web App Security -Pradeep K.pptx
Web App Security -Pradeep K.pptxWeb App Security -Pradeep K.pptx
Web App Security -Pradeep K.pptx
 
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
 AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
 
AWS Webcast - How to Migrate On-premise NAS Storage to Cloud NAS Storage
AWS Webcast - How to Migrate On-premise NAS Storage to Cloud NAS StorageAWS Webcast - How to Migrate On-premise NAS Storage to Cloud NAS Storage
AWS Webcast - How to Migrate On-premise NAS Storage to Cloud NAS Storage
 
Scaling up to your first 10 million users - Pop-up Loft Tel Aviv
Scaling up to your first 10 million users - Pop-up Loft Tel AvivScaling up to your first 10 million users - Pop-up Loft Tel Aviv
Scaling up to your first 10 million users - Pop-up Loft Tel Aviv
 
Harness the Power of Hybrid Cloud with AWS and Avere
Harness the Power of Hybrid Cloud with AWS and AvereHarness the Power of Hybrid Cloud with AWS and Avere
Harness the Power of Hybrid Cloud with AWS and Avere
 
Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20
 

Último

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 

Último (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Dcpl cloud computing amazon fail

  • 1. AMAZON FAIL DC Public Library’s Lessons Learned from the Amazon Cloud Outage Friday, June 24, 2011
  • 2. BACKGROUND • DClibrary.org was first major DC Government website to use cloud-based hosting beginning circa June 2009 • Initial architecture designed to leverage low cost of large instances Amazon Web Services (AWS) servers for database operations and lower cost small and mid servers for WWW services • DClibrary.org Content Management System is Drupal 6 • Bonus: Experimental Drupal 7 amazon machine instance available on our website; currently undergoing user testing Friday, June 24, 2011
  • 3. WHAT WENT WRONG • Background: AWS de-couples the physical hard disk space (called Elastic Block Storage or EBS) from the CPUs (called “compute instances”) • late April 2011: an AWS engineer mistakenly routed “backplane” (internal server traffic) which connects EBS to the CPUS through a system that could not handle the load • This triggered an alarm; since everything in AWS is redundant, the systems thought the backup EBS drives had all failed simultaneously, causing an overload as the system tried to compensate • In a nutshell, it’s almost as if the CPUs no longer had hard drives Friday, June 24, 2011
  • 4. 2009 ARCHITECTURE • June 2009 architecture focused on load balancing and database replication across Amazon Availability Zones • SVN machine was also in cloud • Too reliant on one service provider (amazon) Friday, June 24, 2011
  • 5. PRE-OUTAGE ARCHITECTURE • AWS began a new service called “RDS” for Relational Data Service in 2010. This was a managed database service -- mySQL -- that was more powerful and simpler to administer than us doing so ourselves on large servers • We migrated to RDS in 2010 • The remaining architecture, with the mid-instance front ends and load balancers, remained the same Friday, June 24, 2011
  • 6. KEY LESSONS LEARNED • Amazon’s multiple availability zones failover are not reliable • Does not imply separate physical or logical facilities! • Amazon’s poor communication during the outage compounded this problem • Due to Amazon’s poor initial incidence response communications, we on the spot decided to create new machine instances (AMIs) in a different geographic zone (US-West vs. US-East) and copy over the “offsite” one-day-old SVN and DB backups • Downtime minimized to 1.5 hours; many websites (Reddit, Quora, Foursquare) were down for days • Future Worst Case: Amazon goes completely offline. Means we need a very recent full backup of both WWW and DB instances in a physically and logically separate facility + ability to load balance/ change DNS quickly • Solution was to scale up Rackspace instances and make daily copies to those servers Friday, June 24, 2011
  • 8. WHAT WE RECOMMEND • get physically and logically separate backup servers • do nightly full copy backups to the above servers • have a clear, written process in place for the following things: • communicating with superiors about what’s happening • what steps need to be taken to failover • when the “worst-case” failover plan is implemented (can be time-based or circumstance-based or both) • either implement automatic load balancing or (not as good) have complete control over your DNS • use a very good alerts monitoring service; some of the best ones are cheap/free. We use binarycanary.com. Friday, June 24, 2011