This document contains the slides from a workshop on observability presented by Kevin Crawley of Instana and Single Music. The workshop covered distributed tracing using Jaeger and Prometheus, challenges with open source monitoring tools, and advanced use cases for distributed tracing demonstrated through Single Music's experience. The agenda included labs on setting up Kubernetes and applications, monitoring metrics with Grafana and Prometheus, distributed tracing with Jaeger, and analytics use cases.
2. w/ Jaeger and Prometheus
Observability Workshop
https://bit.ly/ot-ee-workshop
3. A system is observable if the behavior of the
entire system can be determined by only looking
at its inputs and outputs.
Lesson: control theory is a well-documented
approach which people can learn from vs trying
to reinvent
What is Observability?
Kalman, 1961 paper
On the general theory of control systems
4. "Observability aims to provide highly granular insights
into the behavior of production systems along with
rich context, perfect for debugging and performance
analysis purposes.” – Cindy Sridharan @copyconstruct
What is the goal of observability?
5. How many of you are running
staging environments?
Why does my organization need
Observability?
6. Now, how many of you
actually trust your staging
environments?
Why does my organization need
Observability?
10. • Gain a basic understanding of Distributed Tracing and
“How it works”
• Implement Metrics and Tracing in a small microservice
app using FOSS tools
• Understand how metrics and distributed tracing can help
your organization manage complexity
• Understand the limitations of FOSS and the challenges
ahead
What are the goals of this workshop?
11. • Workshop (1-1.5 hours)
• How Does Distributed Tracing Work
• Challenges with FOSS monitoring
• Advanced Use Cases w/ Distributed Tracing
(Single Music)
• Q&A
Agenda
12. Lab 01 - Setting up Kubernetes in Docker Enterprise Edition Lab
Lab 02 - Setting up Gitlab and our Microservice Application
Repository Kubernetes Integration
Lab 03 - Deploying our Microservice Application and Adding
Observability
Lab 04 - Monitoring Application Metrics with Grafana / Prometheus
Lab 05 - Observing with Jaeger and Breaking Things
Lab 06 - Advanced Analytics and Use Cases with Automated
Distributed Tracing
Workshop Overview
13. w/ Jaeger and Prometheus
Observability Workshop
https://bit.ly/ot-ee-workshop
15. At runtime custom headers / metadata are injected into
each request which includes identifiers that enable trace
backends to correlate spans between requests
• X-B3-TraceId: 128 or 64 lower-hex encoded bits
• X-B3-SpanId: 64 lower-hex encoded bits
• X-B3-ParentSpanId: 64 lower-hex encoded bits
• X-B3-Sampled: Bool
• X-B3-Flags: “1” includes DEBUG
It’s literally just headers / meta data
16. HTTP Request Example
service-a requests:
GET service-b:8080/api/groceries
X-B3-TraceId: af38bc9
X-B3-SpanId: b9ca
X-B3-ParentSpanId: nil
service-b receives:
GET service-b:8080/api/groceries
X-B3-TraceId: af38bc9
X-B3-SpanId: b9ca
X-B3-ParentSpanId: nil
service-b requests:
GET service-c:8080/api/products
X-B3-TraceId: af38bc9
X-B3-SpanId: a3bc
X-B3-ParentSpanId: b9ca
service-c receives:
GET service-b:8080/api/products
X-B3-TraceId: af38bc9
X-B3-SpanId: a3bc
X-B3-ParentSpanId: b9ca
18. • Correlation is nearly impossible across
multiple vendors / solutions (Logging, Metrics,
Traces)
• Large scale applications require equally large
scale monitoring (cpu/mem, i/o, distributed
systems, clustered storage, sharded TSDB)
Challenges of FOSS monitoring
Is anything ever truly free?
19. • Distributed tracing exposes a lot of data which
goes unanalyzed by FOSS tools
• The same holds true for Metrics and Logging
• … and Alerting
Actually, can I just show you what is possible?
Current solutions only collect / display
There is no analysis of the data
21. • Operated by 3 engineers (1 FE/1 BE/1 SRE)
• Over 20k transaction / hour, 20+ integrations,
100k LOC, with less than 15% test coverage
• Launched in 2018 with 15 microservices on
Docker Swarm – has since expanded to over 28
microservices with zero additional engineering
personnel
31. • DBO (Hibernate Query) causing O(n log n)
rise in latency and processing time
• Application Dashboard indicated an issue with
overall latency increasing
• Fix deployed and improvement was observed
immediatly
Rise in Latency + Processing Time
34. • We implemented Redis for caching, and
processing time went down
• However, we didn’t account for token policies
changing and they suddenly began to expire
after 30 seconds
• Alerting around error rates for this endpoint
raised our awareness around this issue
Caching Solved one problem
… but caused another
35.
36.
37.
38. Context is critical when doing
Contributing Factor Analysis
Metrics are not
standalone, they
have relationships
39.
40.
41. Logs can benefit from analytics too!
Let’s not forget about
Logging
42.
43.
44.
45. We utilize a mix of Instana, Logz.io
and Grafana to manage our systems
Custom Dashboards
deliver peace of mind
46.
47.
48. • Using FOSS monitoring is a great way to both
learn and demonstrate the value of
observability to your peers
• Understand the limitations of FOSS and be
prepared to invest in either 3rd party tooling our
managing your own monitoring infrastructure
Focus on what matters to your business
… at Single Music we focus on delivering music
49. Schedule a meeting with me!
Want to learn more?
Come visit our booth@
Instana Booth #S23
50. Rate & Share
Rate this session in the
DockerCon App
Follow me @notsureifkevin
and share #DockerCon