O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

How to Build Observable Distributed Systems

49 visualizações

Publicada em

Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2HaAz9v.

Pierre Vincent covers key techniques to build a clearer picture of distributed applications in production, including details on useful health checks, best practices for instrumentation with metrics, logging and tracing. Filmed at qconlondon.com.

Pierre Vincent is SRE manager at Poppulo, where he helps teams embrace DevOps practices, focusing on building maintainable applications and continuously improving their processes. He strongly believes that DevOps is a key turning point for our industry, finally bringing together Continuous Delivery practices and Lean philosophy, all supported by safe culture of learning and innovation.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

How to Build Observable Distributed Systems

  1. 1. @PierreVincent How to build observable distributed systems March 8th, 2018 – QCon London @PierreVincent pvincent.io
  2. 2. InfoQ.com: News & Community Site Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ observable-distributed-ststems • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  4. 4. @PierreVincent
  5. 5. @PierreVincent Pierre Vincent SRE Manager at Poppulo @PierreVincent pvincent.io
  6. 6. @PierreVincent Reaching production is only the beginning
  7. 7. @PierreVincent No system is immune to failure Be ready to recover
  8. 8. @PierreVincent When distributing a system, we’re also distributing the places where things might go wrong
  9. 9. @PierreVincent Monitoring only applies to known failure modes. What about everything else?
  10. 10. @PierreVincent Monitoring tells you whether the system works. Observability lets you ask why it's not working. “ ”– Baron Schwarz
  11. 11. @PierreVincent Healthchecks Metrics Logging Tracing
  12. 12. @PierreVincent Healthchecks
  13. 13. @PierreVincent Is it running? Can it perform its task? Can it accept more work? Healthchecks
  14. 14. @PierreVincent Broadcast Register Expose Healthchecks
  15. 15. @PierreVincent Healthchecks /health { "service": "registration-service", "healthy": true, "workload": { "healthy": true }, "dependencies": [ { "name": "cassandra", "healthy": true }, { "name": "billing-svc", "healthy": true }, ] } GET 200 OK
  16. 16. @PierreVincent Source: HTTP Healthchecks for a Resilient Platform - Chris O’Dell skeltonthatcher.com/blog/http-healthchecks-for-a-resilient-platform Overzealous Healthchecks can be counter-productive
  17. 17. @PierreVincent Metrics
  18. 18. @PierreVincent System metrics Application metrics Business metrics CPU usage Error rates Customer conversions Metrics
  19. 19. @PierreVincent Servers / VMs Appliances/Infra Services Metrics collector Metrics query engine Dashboards Alerts Metrics
  20. 20. @PierreVincent Servers / VMs Appliances/Infra Services /metrics /metrics /metrics Prometheus Metrics
  21. 21. @PierreVincent Watch out for over-reliance on metrics Limitations at high-cardinality Not every metric deserves an alert Real-time querying means some trade-offs on retention Limit alerting to user-impacting symptoms Poor fine-grained debugging e.g. CustomerId Not suitable for long-term trend analysis
  22. 22. @PierreVincent Logging
  23. 23. @PierreVincent Searchable Correlated Logging Making sense of (a lot of) logs Centralised
  24. 24. @PierreVincent A F H D J B E C G a1b2c3 a1b2c3 a1b2c3 ERROR [svc=H][trace=a1b2c3] Failed to save order Cause: Cassandra timeout exception ERROR [svc=F][trace=a1b2c3] Failed to complete order Cause: Shipping service responded with 500 ERROR [svc=A][trace=a1b2c3] Failed to process order Cause: Order process manager responded with 500 a1b2c3 INFO [svc=G][trace=a1b2c3] Items verified in stock Log Correlation
  25. 25. @PierreVincent JSON 2018-02-20T16:38:23+00:00 ERROR Read timed out timestamp 2018-02-20T16:38:23+00:00 requestID ec667cb45 level ERROR team eventsservice registration-service commit 542a8b8e build 542a8b8e.7 node node_e79f3e52 log Read timed out region europe-west2 runtime java-1.8.0_161 When did it happen? Can I trace it? What is it? What is it running? Where is it? What is the message? customerID 55123Who caused it? userID 458 ... ...Any other info? Hmmm thanks… ?
  26. 26. @PierreVincent Structured logs unleash high-cardinality exploration Error rate spike isolated by build version Activity spike isolated for single customer
  27. 27. @PierreVincent Tracing
  28. 28. @PierreVincent Tracing get_confirmed_attendees get_attendees get_confirm_status cassandra/select mysql/select event-mgt-api attendees-service registration-service Trace Spans 0ms 50ms 100ms 150ms 172ms 172ms 73ms 54ms 78ms 41ms
  29. 29. @PierreVincent
  30. 30. @PierreVincent Unknown Unknowns Known Unknowns Healthchecks Metrics (Alerting) Logs Metrics (Queries) Tracing Events Monitoring & Resiliency Debugging & Exploration
  31. 31. @PierreVincent Cheap to instrument Easy to explore Reliable & Trustworthy Usability of tooling is key to adoption
  32. 32. @PierreVincent Visibility builds trust but requires safety
  33. 33. @PierreVincent Visibility helps justify decisions
  34. 34. @PierreVincent Visibility enables operability
  35. 35. @PierreVincent Test (a little bit*) lessTest (a little bit*) lessDon’t spend all your time testing ...keep some for instrumenting Thank you! @PierreVincent pvincent.io
  36. 36. @PierreVincent Thank you! Pierre Vincent SRE Manager at Poppulo @PierreVincent pvincent.io
  37. 37. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ observable-distributed-ststems