Talk about how we at Expedia are trying get to greater observability into stack using our opensourced distributed tracing and analysis system Haystack.
3. @ All Things Open 2018, Raleigh
Observability Events
Logs : stateless events generated by
the application
Metrics : timeseries events containing
measurements
Traces : correlated events to track
cause of ordering
6. @ All Things Open 2018, Raleigh
Traces
Span typically represents a service call
or a block of code
Trace represents a collection of spans
correlated by an identifier
7. Distribution tracing tracks production requests as they track different parts of
the architecture
@ All Things Open 2018, Raleigh
Traces – Context Propagation
8. @ All Things Open 2018, Raleigh
Traces –Why do I care
In todays microservice architecture, there’s a lot going on at the backend while serving a request. Multiple service interactions, levels of resiliency, multiple layers of caching etcs. So in case something goes wrong its not always evident as to why it happened.
Observability is the ability to understand and troubleshoot our systems in production by collecting a series of timestamped events.
These events can be either request scoped/system scoped. A garbage collection event would most likely not associated with a request, whereas a response time event is.
For the sake of this presentation we are going to talk about events, which are request scoped.
So what are the kind of events we are talking about here, I think they can be broadly classified into three types
1. Logs
2. Metrics
3. Traces
Collecting each kind of events have their own use-cases but they don’t really have very clear boundaries. For instance an audit log which logs the response time for an incoming request in the system can be used to compute the average response time metric. In this case as you see you don’t explicitly collect the metric event.
1 minute
1 minute
2 minutes
Distribution tracing tracks production requests by correlating different service interactions in the architecture
2 minutes
Context propagation
Reduce time to triage by contextualizing errors and delays
Visualizing latencies over the network
2 minutes
1 minute
2 minutes
3 minutes
3 minutes
3 minutes
This is the architecture of haystack system. We have kafka as central nervous system backing haystack.
1. Componentized: Haystack includes all of the necessary subsystems to make the system ready to use. But we have also ensured that the overall system is designed in such a way that you can replace any given subsystem to better meet your own needs.
2. Resilient: There is no single point of failure.
3. Scalable: We have completely decentralized our system which helps us to scale every component individually
The architecture can be broken down into 3 parts :
Subsystems : Haystack includes various subsystems to perform tracing, trending, service graph etc. We will go over these subsystems in a bit.
Data Stores : We have 3 data stores, namely Cassandra : To store the raw stitched spans ,ie, traces. ElasticSearch is used as an indexer to query the data faster and MetricTank backed by Cassandra to store trends in metrics 2.0 format.
Visualization : Haystack UI is a central place to visualize the processed data such as traces, trends, alerts from various haystack sub-systems.
Let’s see the subsystems one by one.
I will be doing deep dives about usecases and architecture about each of the current subsystems haystack has.
Traces subsystem is mandatory, others are optional. If you deploy you can configure haystack to have only a subset of them, except Traces.
Some of them are dependent on others, to be specific Anomaly detection requires Trends as you need trends to detect anomalies. Outcome of Trends goes in Kafka and Anomaly detection picks it up from there.
We would love you to feel free and add any new subsystem on top of Kafka backone. It doesn’t need to be part of haystack’s repositories, if you need something specific to your companies need, you can build that and run on top of haystack’s Kafka. Don’t need to come and talk to us about adding any new thing in.
Demo
If you know the traceId you can jump to see the timeline/waterfall showing how a single end user request got severed inside your system.
In case of this example, user request was to stark service at /stark/endpoint
You might have used Zipkin or Jaeger before
Usecase
Identifying root cause of errors
Perf bottlenecks
Understanding of flow of requests
Open tracing compliant
Use 3 IDs traceId, spanId, and parentSpanId
spanId needs to be passed on from a service to the next one, which is your logic pass it in http header or in payload. For the next service when it is logging span it will use the caller’s spanId as its parentSpanId.
We are also looking into supporting zipkin style ids, they have a slight but crucial difference in Ids.
Usecase
You might not have traceIds handy
For example, lets say your site has started showing intermittent errors for US SiteId, you might want to see traces where error = true and siteid = us and check traces for that scenario
You can setup a number of whitelisted fields and they become searchable on haystack-ui.
Click on any of these traces and you will get the timeline/waterfall view
About the architecture, two apps in traces subsystem
Indexer
Reader
The Trends subsystem is responsible for reading spans and generating vital service health trends.
Introduce a new term operation. What is [user service -> loyalty service example]
service
operation
The Trends subsystem is responsible for reading spans and generating vital service health trends.
This system is loosely coupled and can be run on demand. It has two components :
haystack-span-timeseries-transformer - This component is responsible for reading span and converting them to metrics 2.0 compatible MetricPoints. These metricpoints are then pushed back to kafka.
haystack-timeseries-aggregator - This app is responsible for reading metric points, aggregating them based on rules and pushing the aggregated metric points to Kafka. The metric points are MetricTank compliant and can be directly consumed by metrictank which is a timeseries database.
Currently we compute four trends for each combination of service and operation . These are
Total count
success_count [count]
failure_count [count]
duration [mean, median, std-dev, 99 percentile, 95 percentile]
Each trend is computed for 4 intervals [1min, 5min, 15min, 1hour].
The Trends subsystem is responsible for reading spans and generating vital service health trends.
This system is loosely coupled and can be run on demand. It has two components :
haystack-span-timeseries-transformer - This component is responsible for reading span and converting them to metrics 2.0 compatible MetricPoints. These metricpoints are then pushed back to kafka.
haystack-timeseries-aggregator - This app is responsible for reading metric points, aggregating them based on rules and pushing the aggregated metric points to Kafka. The metric points are MetricTank compliant and can be directly consumed by metrictank which is a timeseries database.
Currently we compute four trends for each combination of service and operation . These are
Total count
success_count [count]
failure_count [count]
duration [mean, median, std-dev, 99 percentile, 95 percentile]
Each trend is computed for 4 intervals [1min, 5min, 15min, 1hour].
The alerts view is used to show up alerts for any anomalous behavior in service health trends. Currently haystack alerts on total count, failure count and duration (TP99) . These alerts would be powered by adaptive alerting system which is one of the other OSS projects by Expedia.