The document proposes a distributed tracing method and tools for Swift object storage. It involves adding middleware to collect trace data with unique IDs as requests pass through Swift components. Trace data including timing would be sent to a repository and correlated to reconstruct processing paths. Analysis tools would allow querying trace data and visualizing span trees to diagnose performance and identify bottlenecks across the distributed Swift infrastructure.
2. Agenda
Background
Tracing Proposal
Tracing Architecture
Tracing Data Model
Tracing Analysis Tools
Reference
3. Background
• Swift is a large scale distributed object store span thousands of nodes
across multiple zones and different regions.
– End to end performance is critical to success of Swift.
– Tools that aid in understanding the behavior and reasoning about performance issue are
invaluable.
• Motivation
– For a particular client request X, what is the actual route when it is being served by
different services? Is there any difference b/w actual route and expected route even we
know the access patterns?
– What is the performance behavior of the server components and third-party services?
Which part is slower than expected?
– How can we quickly diagnose the problem when it breaks at some points ?
e.g. PUT request X: Client(1) X Proxy-Server (1) Container-Server (1) X1” Account-Server (1)
X ’ Container-Server (2) X2” Account-Server (2)
Container-Server (3) X3” Account-Server (3)
4. Which part is slow? Looking at your logs?
When a request is made to Swift, it is given an unique transaction id. This id should be
in every log line that has to do with that request. This can be useful when looking at all
the services that are hit by a single request. But….is it efficient or handy to do?
5. Correlate the logs
Proxy server log @ node-P
Container server log @ node-C
Account server log @ node-A
Object server log @ node-O
Correlate the information pieces by transaction id and client IP from all logs of related hashed nodes!
7. Pros and cons of current implt.
• ReThink it
Can we provide a real time end to end performance tracing/tracking tool in Swift
infrastructure for developers and users to facilitate their analysis in development and
operation environment?
statsD logging
Pros • Real time performance metrics to monitor the
health of Swift cluster
• Performance impact is low by sending metrics
data via UDP protocol, no hit on local disk I/O
• Supported by different backend to report and
visualization
• Light-weighted
• Simple to use
• Rich logging tools
cons • Designed for cluster level healthy, not for end to
end performance.
• Can not provide metrics data for a specific set of
requests.
• No relationship between different set of metrics
for specific transactions or requests.
• Not designed for real time
• Require more efforts to collect and
analysis
• No representation for individual span
• Message size limitation
8. Our Proposal
• Goal
– Target for researchers, developers and admins, provide a method of traceability to
understand end to end performance issue and identify the bottlenecks.
• Scope
Add WSGI middleware and hooks into swift components to collect trace data
The middleware to control the activation and generation of trace
Generate trace and span ids, collect the data and tired them together
Send traced data to aggregator and saved into repository
Minor fix of current Swift implementation to allow the path to include complete hops.
Similar to trans-id, the trace-id and span-id need to be propagated through HTTP headers correctly b/w
services and components.
Analysis tools of report and visualization
Query the traced data by tiered trace ids
Reconstruct span tree for each trace
10. Span Tree of Trace
Swift
Client
Proxy
Server
Container
Server
Container
Server
Container
Server
Account
Server
Auth
Account
Server
Account
Server
Request-XPUT
X-Trace-Id: 1234
Response-XPUT
Request-X’’PUT
X-Trace_Id: 1234
X-Span-Id: 1
Request-X”’PUT
X-Trace-Id: 1234
X-Span-Id: 2
Response-
X’”PUT
Response-X’’PUT
• X-Trace-Id: identification of each
trace
Use X-Trans-Id to support
different cluster?
Or generate new id for this
purpose?
• X-Span-Id: identification of each
span to represent individual
HTTP RESTful call and WSGI call.
Generate new span id for
this purpose
(notes: UUID can be used for implementation)
Create a new container: PUT /account/container
Request-X’GET
Response-X’GET
11. X-trace Middleware Architecture
1. Generate trace ids based on configuration.
2. Create spans and collect trace data
3. Propagate trace ids to next hop
4. Send trace data into a repository via
separate transport protocol/channel
Swift
Client
Proxy
Server
Container
Server
Container
Server
Container
Server
Account
Server
Auth
Account
Server
Account
Server
x-trace
x-trace
x-
trace
Tracedatarepository
x-trace
12. Patches to fix the request path
• The trace id is passed along by proxy
server in HTTP headers, but will be lost
at some points because of recreating a
new request for next hops.
• Patches are needed to fix this problem
to form a complete tracing path for
container server, object server, etc.
Swift
Client
Proxy
Server
Container
Server
Container
Server
Container
Server
Account
Server
Auth
Account
Server
Account
Server
x-trace
x-trace
x-
trace
Tracedatarepository
x-tracepropagate
trace id in next
new request
13. Tie together tracing data
Reconstruct causal and temporal relationship view for PUT container call
Proxy-Server.PUT parent-span-id=0, span-id=1
timeline
Container-Server.PUT parent-span-id=1, span-id=2
Container-Server.PUT parent-span-id=1, span-id=3
Container-Server.PUT parent-span-id=1, span-id=4
Account-Server.PUT
parent-span-id=2, span-id=5
Account-Server.PUT
parent-span-id=3, span-id=6
Account-Server.PUT
parent-span-id=4, span-id=7
0 ms 200 ms50 ms 150 ms100 ms
Swift-Client.PUT parent-span-id=none, span-id=0
201
201
201
201
201
201 201
14. Another example: upload an object
Proxy-Server.PUT parent-span-id=0, span-id=1
timeline
Object-Server.PUT parent-span-id=1, span-id=2
Object-Server.PUT parent-span-id=1, span-id=3
Object-Server.PUT parent-span-id=1, span-id=4
Container-Server.PUT
parent-span-id=2, span-id=5
Container-Server.PUT
parent-span-id=3, span-id=6
Container-Server.PUT
parent-span-id=4, span-id=7
0 ms 200 ms50 ms 150 ms100 ms
Swift-Client.PUT parent-span-id=none, span-id=0
201
201
201
201
201
201 201
15. pipeline:main
Trace into middleware of the pipeline
• Expand the trace path into
WSGI call b/w middleware to
get more complete trace data.
• Possible choices
– Decorators for __call__
@trace_here()
def __call__(self, environ, start_response)
– Hack paste deployment package
– Profile with filters
Swift
Client
Proxy
Server
x-trace
Tracedatarepository
tempauth
cache
tempurl
dlo
Pipeline = catch_errors gatekeeper healthcheck proxy-logging cache container_sync bulk slo dlo ratelimit crossdomain tempauth tempurl formpost
staticweb container-quotas account-quotas proxy-logging proxy-serve
slo
…
17. Query and analysis tools
• Query
– Query trace data by trace_id, span_id, order or range by time, group by nodes,
annotation keys
• Trace timeline
– Plot the spans on the timeline with causal relationships
• Diagnose
– Analyze the critical path for a success response
– Identify the failure point of in the path
• Simulation
– Replay the recorded processing of the requests
• Data Mining
18. Reference
• Google Dapper – a large-scale distributed systems tracing infrastructure
• Twitter Zipkin - a distributed tracing system that helps us gather timing
data for all the disparate services at Twitter.
• Berkeley XTrace : a pervasive network tracing framework