O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Serverless Days Helsinki 2019 Rolf Koski - Business Driven Availability

78 visualizações

Publicada em

This talk concentrates on understanding, what issues are at play, when operating on systems run on public clouds. This talk should get you thinking, why service levels are not supposed to be thought as a sequence of 9s, but how to take more holistic approach and how to think of investing in the resilience the correct amount before going live and running in production. Also it is equally important to understanding the human element, which is where most of the errors occur in any case and being able to minimize the impact and occurrence of the human based errors. The key takeaway in this talk is to understanding that everything can and will eventually fail and how to approach your design in such a way, that you are able to handle those situations gracefully

Publicada em: Internet
  • Seja o primeiro a comentar

Serverless Days Helsinki 2019 Rolf Koski - Business Driven Availability

  1. 1. Rolf Koski – Business driven availability April 25 2019 1
  2. 2. Who am I 2 Rolf Koski CTO Cybercom AWS Business Group rolf.koski@cybercom.com rolle therolle - “Guy with the sticker” - Cloud Advisor & Evangelist - Community Leader - AWS Partner Ambassador - Well-Architected Lead
  3. 3. Why SLAs are not an excuse for poor architectures Disclaimer: this presentation makes you ask questions more than it gives answers…
  4. 4. Everything Fails (so if you think you can have 100%, you are lying to yourself)
  5. 5. SL(A) ? Objective vs. Agreement
  6. 6. Quick introduction to availability arithmetics
  7. 7. 99% 99% ~98% Aggregate availability – Series
  8. 8. 99% 99% 99,99% Aggregate availability – Parallel
  9. 9. 99% 99,98% Aggregate availability – Combination 99% 99% 99%
  10. 10. 99% 99,98% Aggregate availability – Partial failure 99% 99% 99% 20% failing
  11. 11. Function execution availability Time Parallel execution
  12. 12. Function execution availability Time Parallel execution Fail 95%?
  13. 13. Function execution availability Time Parallel execution Fail 95%? Retry Success
  14. 14. Service Level is not just nines
  15. 15. Service Level is not just nines • What service is provided • How it is supported • During which time service is to be provided • What performance is to be expected • What are responsibilities of agreement parties
  16. 16. The Serverless Promise Built-in availability & fault tolerance
  17. 17. The Serverless Promise Built-in availability & fault tolerance • What about cold starts? • What about endless retries or multiple executions? • What about ”dead letters” • What about timeouts? • What about running out of memory?
  18. 18. SLA Credits Suck (and they have no real business value whatsoever)
  19. 19. Example: S3 SLA Monthly Uptime Percentage Service Credit Percentage Equal to or greater than 99.0% but less than 99.9% 10% Less than 99.0% 25% In literal terms: For 1 TB of data which was unavailable for up to 7 hours and 12 minutes, you get service credits for $2.34
  20. 20. The Cost of Availability (and when enough is enough)
  21. 21. Total Cost of Service Level 21 Cost of breech Cost of service level target Number of 9’s Cost
  22. 22. So, how to decide what to optimize?
  23. 23. Analyze & classify
  24. 24. Analysis • How much is loss/corruption of data worth to you • How much is downtime worth to you • How much is malicious breach worth to you • How much is your public image worth to you • How much are you willing to invest in advance • How much are you willing to set aside for corrective action • How much risk are you willing to accumulate in regards of legislation, compliance and similar
  25. 25. Classification • Business criticality • Data privacy / confidentiality • Availability • Consistency • Resiliency • Original or derivative
  26. 26. Everything is not equal
  27. 27. Your most valuable availability metric is not probably in %
  28. 28. Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden). Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer). Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number of people who clicked “back” before the page even loaded (Nicole Sullivan).
  29. 29. It’s actually not IF it works, but HOW it works
  30. 30. Some real advise
  31. 31. Some real advise • Automation and deployment pipeline • Infrastructure as Code • Versioning and ability to roll back • Deployment scenarios (A/B, B/G, Canary) • Immutable and stateless • Origin data vs. recomputable data • Feature flags and support partially failing • Throttling and DLQs • Multi-AZ, multiregion • Monitoring: shallow & deep
  32. 32. Resilient Design
  33. 33. Resilient Design • People • Application implementation • Network & Data architecture • Infrastructure
  34. 34. Humans fail too. (Actually, more than you’d like)
  35. 35. Who is responsible in the Cloud? (It’s You)
  36. 36. 36

×