6. 6 APRICOT 2017
• What if something happened in our system?
• How can we resolve the problems as quick as possible?
7. Current Logging solution (1)
7 APRICOT 2017
ELK, Graylog:
Collecting logs from systems and appliances.
Indexing and filtering RCA
Multiple Alert/Notify mechanisms.
Visualization based on user’s needs.
8. Current Logging solution (2)
8 APRICOT 2017
Pros:
Quickly trouble-shoot problems of systems/appliances.
Reduce cost for storing log, based on PCI DSS or HIPAA
requirements.
Cons:
Mostly depend on systems/appliances log.
Require more efforts on sizing/deploying, maintaining and operating
these logging solution.
Ate up resources (mostly storage) May not suitable for small
sensors.
9. Current Logging solution (3)
9 APRICOT 2017
Example 01:
Single request for launching 01 VM in OpenStack cloud system can
go through at least 04 micro-services.
Log INFO level sometimes contain misleading information or not-
enough information for trouble-shooting
Turn on DEBUG log level
Too much information and eat up storage.
Hard to control the overhead threshold.
10. Current Logging solution (4)
10 APRICOT 2017
Example 02:
ELK/Graylog requires some tweaks and efforts on visualize,
collecting, profiling and RCA in distributed environment.
Consider following queries in environments with >10 services:
“Find me the root cause of all error requests where the requests
process X business.”
“Find me requests where the user was logged in and the request
took more than two seconds and a DB transaction was held open
for more than 500 ms.”
11. Tracing Requirements
Address the Data
Explosion
Logs, Metrics, Events,
Active/Passive Checks,
…
End-to-End Debugging
Understand what the real
issue is and what is affected
when errors occur
Visibility
Deliver centralized
intelligence for cloud
operations at scale
Operator Needs
Resource Utilization
Understand resource
availability and
utilization
Solution Requirements
Able to Collect,
Store and Access
all types of data
in one place
Highly
Performant and
Scalable
Platform
Flexible Processing Pipeline that
can support multiple use cases:
diagnostics, root cause analysis,
SLA calculations, utilization
reporting, …
Extensible Platform that
can be extended to
support new types of data
and processing
11 APRICOT 2017
12. Tracing Requirements
• Users need centralize solution that provide enough
information related to machine centric (monitor) and
workflow centric (tracing).
– Provide general picture for every workflow: the
communication steps, req/resp time for each step
for performance reviewing purpose.
– Show monitoring metrics of hardware/services for
each step at the time of investigation.
– Provide general purpose RCA method for quickly
troubleshooting.
12 APRICOT 2017
13. Workflow Centric solution quick survey
There are many solutions aim to tracing the workflow centric, divided into
3 categories: [1]
1. Explicit metadata propagation: inject tracing metadata into current
system (Zipkin, Kieker, X-Trace, Tracelytics, Cloudera Htrace,
ExplorViz, OpenTracing - CNCF)
2. Schema-based: rely on the event semantics of system and use
temporal schema of custom log message for tracing. (Magpie)
3. Black-box tracing: rely on log analysis for inferring relationship among
events. (Fchain, Netmedic)
[1]. HANSEL: Diagnosing Faults in OpenStack – IBM Research
13 APRICOT 2017
14. Workflow centric solutions (1)
14 APRICOT 2017
• Figure of traditional workflow
Service A Service B Service C Service D
Req
15. Workflow centric solutions (2)
15 APRICOT 2017
• Explicit metadata propagation
Figure of explicit metadata tracing workflow: inject metadata in request/response
and send to tracing mechanism (Zipkin, Dapper..)
Service A Service B Service C Service D
Tracing
Mechanism
Req
16. Workflow centric solutions (3)
16 APRICOT 2017
• Explicit metadata propagation
Pros:
• Give enough detail for tracing the problems
• Highly scalability.
Cons:
• Must modify code base and inject meta-data into header of each request and
response
• Increase network packet (maybe a little bit like Zipkin - around 500bytes)
17. Workflow centric solutions (4)
17 APRICOT 2017
• Schema-based: based on sematic of event generated from system
(including OS, services and applications), then joining all related event
schema for final inference.
Service A Service B Service C Service D
Authenticate
Authenticate
Authenticate
Get Image
Create port, IP and attach
Req Read/Write
DB
Event Listener
18. Workflow centric solutions (5)
18 APRICOT 2017
• Schema-based
Pros:
• Less modification into code base
Cons:
• Low scalability. (the result is delayed until all event are collected).
• Less details than explicit meta-data. (the semantic of event, the event list and also
the way to join schemas define the success of this approach we need to build a
warehouse of event semantic)
19. Workflow centric solutions (6)
19 APRICOT 2017
• Black-box tracing: collect logs of all services, then do analyzing all the
logs and infer the root cause of problem.
Service A Service B Service C Service D
DB
Log Collector
and Analyzer
Logs
Logs Logs Logs
Logs
20. Workflow centric solutions (7)
20 APRICOT 2017
• Black-box tracing:
Pros:
• No modification to code base.
Cons:
• High error rate. (almost is probabilistic data mining approaches)