O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Reliability at scale

DevOpsDaysIndia2017
Talk by praveen shukla(product engineer at Go-Jek)

  • Entre para ver os comentários

  • Seja a primeira pessoa a gostar disto

Reliability at scale

  1. 1. Reliability At Scale Praveen Shukla
  2. 2. About Me @_praveenshukla @achilles42 Chaos Kid (Product Engineer)
  3. 3. - Transport, logistics, hyperlocal delivery and payments startup in Indonesia - 18 Products - 40 Millions downloads - 20 Millions Active users - 300K driver in platform - Top 50 Innovative Companies on fortune (go-jek @17) - 1M TPS(internal API call)
  4. 4. - What is reliability ? - Add simple solutions to achieve reliability when go-jek was small - How we grew tremendously - Problems while scaling - How we came up with better and scalable solutions. - Iteration Agenda
  5. 5. Define “Reliable”
  6. 6. ● 4 nines ● MTBF ● Failure Per Year(AFR) ● QoS ● SLA
  7. 7. “FAILURES” Systems operates outside specified parameter.
  8. 8. Business: Users are complaining
  9. 9. Failure is subjective!
  10. 10. In 2015
  11. 11. Always a trade-off between velocity and stability
  12. 12. ● CI / CD ● Configuration management ● Monitoring ● Alerting
  13. 13. ● CI / CD ● Configuration management ● Monitoring ● Alerting
  14. 14. ● CI / CD ● Configuration management ● Monitoring ● Alerting
  15. 15. ● CI / CD ● Configuration management ● Monitoring ● Alerting
  16. 16. ● CI / CD ● Configuration management ● Monitoring ● Alerting
  17. 17. In 2017
  18. 18. 2015 ● 4 Products ● 10+ Microservices ● 100+ Instances ● 50+ Tech People ● 18 Products ● 250+ Microservices ● 8K+ Instances across 3 datacenters ● 350+ Tech People 4 X 25X 80X 7X 2017
  19. 19. ● Pipeline access management ● Custom Deployments ● DSL Repo Management ● No Branch Based Deployments CI / CD
  20. 20. ● Every service has their own cookbook ● Cookbook dependency management CM
  21. 21. ● Alert getting lost ● Not getting alerts to a right person ● Too many people getting too many pagers ● Who is responsible to take action on a particular alert Monitoring & Alerting
  22. 22. Serious production outages Business Loss
  23. 23. SOLUTIONS
  24. 24. ● Single place for Code, Build and Deploy Access control The CI Pipeline is just a yaml. The CI file is part of the same source code. Freedom to tweak their pipeline according to specific use case. Provides us feature like branch and tag based deploys effortlessly Gitlab and Gitlab-CI
  25. 25. Configuration Management
  26. 26. ● Master cookbooks concept ● Single cookbook to manage ● No. Of Stack == Number of Cookbooks
  27. 27. Smart Alert Router ● Every product has a group ● A group can have multiple microservices ● A microservices can have multiple servers ● One member can belongs to many groups
  28. 28. Architecture Smart alert routerKapacitor VM Telegraf Grafana Alerting TS Database visualization Agent on VM
  29. 29. Multiple Dependencies R = 99 %
  30. 30. Multiple Dependencies R = 96 % R = 99 % R = 99 % R = 99 %
  31. 31. Multiple Dependencies R = 88 % R = 96 % R = 96 % R = 96 % R = 99 % R = 99 % R = 99 % R = 99 % R = 99 %
  32. 32. Multiple Dependencies R = 88 % R = 96 % R = 96 % R = 96 % R = 99 % R = 99 % R = 99 % R = 99 % R = 99 % CIRCUIT BREAKERS!!
  33. 33. OPTION 1:
  34. 34. OPTION 2:
  35. 35. Really how does it affect your system !! ● 99.9930 = 99.7 % uptime ● 0.3% of 1 billion requests failing i.e 3,000,000 failed. ● 2+ hours downtime every month even after dependent systems have excellent uptime.
  36. 36. Queuing Delay p Delay ------- 1 - p p = System Utilization
  37. 37. Queuing Delay p Delay ------- 1 - p p = System Utilization THROTTLE YOUR SYSTEM!!
  38. 38. reliability is iterative process
  39. 39. THANKS! questions? _praveenshukla

×