In this talk I go over the use cases for using machine learning and centralized logging for monitoring a distributed, multi layered microservices architecture.
9. LOG ANALYTICS FOR
MICROSERVICES
• Service logs
10/01/17 00:53:51 INFO apollo i.l.c.b.c.b.MappedPageFactory: Page file
/tmp/logzio-logback-buffer/listener-metrics/logzio-logback-appender/data/page-
48.dat was just deleted.
• Service metrics
10/01/17 02:53:51 INFO apollo a.b.c.metrics: Account-Incoming, key: 126, value:
54321
11. THE CHALLENGES WITH LOGGING
MICROSERVICES
• Transient
• Distributed
• Independent
• Multilayered
12. LOGGING IN A DOCKERIZED
WORLD
$ docker logs
2016-06-02T13:05:22.614090Z 0 [Note] InnoDB: 5.7.12 started; log sequence number
2522067
13. LOGGING IN A DOCKERIZED
WORLD
$ docker stats
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O
BLOCK I/O
3747bd397456 0.01% 3.641 MB / 2.1 GB 0.17% 3.366 kB / 648 B
0 B / 0 B
396e42ba0d15 0.11% 1.638 MB / 2.1 GB 0.08% 9.79 kB / 648 B
348.2 kB / 0 B
468bf755240a 3.19% 45.67 MB / 2.1 GB 2.17% 25.19 MB / 17.95 MB
774.1 kB / 0 B
5f16814a3c0e 0.01% 495.6 kB / 2.1 GB 0.02% 8.564 kB / 648 B 0
B / 0 B
74cdfa7b8a0c 0.04% 3.908 MB / 2.1 GB 0.19% 2.028 kB / 648 B 0
B / 0 B
99bafb7600fc 0.00% 32.95 MB / 2.1 GB 1.57% 0 B / 0 B 2.093
MB / 20.48 kB
14. LOGGING IN A DOCKERIZED
WORLD
$ docker daemon
time="2016-06-05T12:03:49.716900785Z" level=debug msg="received containerd event:
&types.Event{Type:"exit",
Id:"3747bd397456cd28058bb40799cd0642f431849b5c43ce56536ab7f55a98114f",
Status:0x0,
Pid:"4120a7625a592f7c95eab4b1b442a45370f6dd95b63d284714dbb58f00d0a20d",
Timestamp:0x57541525}"
15. OH, AND THERE’S THIS…
Large & complex application
& operational logs
Multiple different
formats
Multiple log files
per component /
instance
SLOW
& labor Intensive
Error-prone
processing
Relies on an
individual’s skills
Expensive
Hard to find what is relevant and
important in log data
Scaling and securing
open-source implementation is
expensive and almost impossible to
scale
16. CENTRALIZED LOGGING TO THE
RESCUE
• Centralized data collection and management
management
• Provides inferable context to logs
• Analysis, event correlation and visualization
visualization
17. OLD SCHOOL LOGGING
$ grep ' 30[1234] ' /var/logs/apache2/access.log | grep -v
baidu | grep -v Googlebot
173.230.156.8 - - [04/Sep/2015:06:10:10 +0000] "GET /morpht HTTP/1.0" 301 26
"-" "Mozilla/5.0 (pc-x86_64-linux-gnu)"
192.3.83.5 - - [04/Sep/2015:06:10:22 +0000] "GET /?q=node/add HTTP/1.0" 301
26 "http://morpht.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)
AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5"
19. A BIT ABOUT ELK
• World’s most popular open source log
analysis platform
• 4.5M downloads a month!
• Centralized logging AND: search, BI, SEO,
IoT, and more
20. THE MARKET IS
DOMINATED BY
OPEN SOURCE
SOLUTIONS
Over the past
3 years, the
market shifted
attention from
proprietary to
open source
It’s simple to
get
started and play
with ELK, and the
UI is just
beautiful
Simple and beautifulOpen Source/Flexible
Fast-growing
community, no
vendor lock-in
and no license
cost
Blazing quick
responses even
when searching
through millions
of documents
Fast. Very fast.
ELK Stack
500,000+
companies
15K companies
21. TYPICAL ELK PIPELINE
• Visualizations
and
dashboards
• Log shipper
• Collecting and
parsing
• Full-text search
and analysis
engine
• Scalable, fast,
highly available
• REST API
26. • Configure Logstash (input, filter,
output)
filter {
if [type] == "dockerlogs" {
if ([message] =~ "^tat ") {
drop {}
}
grok {
break_on_match => false
match => [ "message", " responded with %{NUMBER:status_code:int}" ]
tag_on_failure => []
}
}
}
STEP 3 – PARSING
27. • DO NOT expose
Elasticsearch
(‘network.host’)
• Use proxies
• Isolate
Elasticsearch
• Change default
ports
STEP 4 – SECURITY
28.
29. OTHER SOLUTIONS
• Hosted ELK (Logz.io, Elastic Cloud,
Sematext)
• Other logging/monitoring SaaS
(Datadog, Papertrail, Loggly)
30. THE BIG ELEPHANT (ELK) IN THE ROOM
• Not knowing what question to ask
• Needle in the haystack syndrome
• Logs cannot be analyzed by a human alone
• Anomaly detection does not work
31. ANOMALY DETECTION DOESN’T WORK
• Not every anomaly is an error
• Not every error represents itself in
an anomaly
• Apps run as step functions
34. WHAT IS MACHINE LEARNING?
“Machine learning is a type of artificial
intelligence that provides computers with
the ability to learn without being
explicitly programmed.” (TechTarget)
35. SUPERVISED MACHINE LEARNING (BY
EXAMPLE)
1. Labeling – gathering and labeling logs
• User behavior
• Inter-user similarities
• Public resources
2. Training a classifier – defining what
log is important
3. Integration within the system
36. ‘skb rides the rocket’
kernel: xen_netfront: xennet: skb rides the rocket: 19 slots
(http://serverfault.com/questions/647489/what-is-causing-
skb-rides-the-rocket-errors)
Syslog message, result of packet loss, due to a kernel bug in linux.
Syslog message, result of packet loss, due to a kernel bug in linux.
Logs are a stream of aggregated, time-ordered events collected from the output streams of running processes and backing services
Does anyone not use logs?
When running builds to identify compile errors
When you’re running a system – for troubleshooting your system
For learning about the behavior of your system
So anyone creating, deploying or running software needs logs!
Service logs – service_id, request_id (for tracing across the architecture), type, timestamp
Metric collection - to measure improvements, new code
Resource utilizations (CPU, memory, Network, Filesystem)
Runtime metrics (Jenkins build times)
Metric collection - to measure improvements, new code
Resource utilizations (CPU, memory, Network, Filesystem)
Runtime metrics (Jenkins build times)
Microservices are stateless. That means that an instance of a service can be created, stopped, restarted, and destroyed at any time without impacting other services. Any logging functionality we implement can’t rely on the service persisting for any period of time.
Microservices are independent. With microservices, only the execution environment is aware of the context. Kubernetes is aware of pods for example but not the hosting machine.
Microservices are distributed. You’ll likely find yourself logging related data from two completely independent platforms. To log effectively, we need a way to correlate events across the infrastructure.
Let’s take the Docker execution environment for example. You have three different types of logs and metrics that can be extracted.
Multiply all of this – at Logz.io for example, we’e running about 60 Docker hosts, each with 4-5 containers…
In modern environments, log analysis remains an extremely complicated and resource consuming task for even the most experienced developer, DevOps or IT operations teams out there. Despite having all the most sophisticated analytics and monitoring tools.
That’s because at the end of the day, behind these tools stands a human being who needs to connect-the-dots and make informed, timely decisions; He needs to know how to extract signals and actionable meaning out of millions of log messages.
In essence, centralize logging detaches logging from the containers running your microservices
Using parsing and filtering you can give your logs context
By structuring logs, and providing a comfortable UI, it enables easier analysis
All three services are started automatically
Image persists /var/lib/elasticsearch — which is the directory that Elasticsearch stores its data in — as a volume.
Install a log forwarder to send to Logstash – this depends on the Docker driver used.
Logspout is a log router for Docker containers that runs inside Docker. It attaches to all containers on a host, then routes their logs wherever you want. It also has an extensible module system.
Logspout is a very small Docker container (15.2MB virtual)
Install a log forwarder to send to Logstash – this depends on the Docker driver used.
docker inspect afaac897ab50 | grep LogPath
Each Docker image has it’s own logging format, so these filters will be very specific
Bind the nodes to localhost or private IP
Use proxies to communicate with clients – to add user control and to do request filtering, put in front of Kibana
Bind the nodes to localhost or private IP
Use proxies to communicate with clients – to add user control and to do request filtering, put in front of Kibana
False alarms and high signal-to-noise ratio
Not every anomaly is an error
Developer introducing a new log line
Access usage
Seasonality changes
Not every error represents itself in an anomaly
Resource utilization
Memory leak
Applications run as a step function
Anomaly detection works on continuous function
Enables you to train a self-improving system that asks the questions for us
Can sift through vast amount of data and flag relevant events
Supervised machine learning is based on the idea of learning by example
Labeling – gathering and labeling logs – coloring the data in different colors
Opened/unopened
Error logs
Exceptions logs
Training a classifier - defining what log is important. Simply put, a classifier is a formula that you build in order to answer a question. Using labels, we build a mathematical representation of a log message, which in turn is inserted into the formula – if the result of this formula passed a specific threshold, a log is relevant.
Integration within the system – using Hadoop and Spark
As IT operations become agile and dynamic, they are also getting immensely complex.
2 main challenges in logging microservices:
Logging in a distributed architecture
Finding the needle in the haystack
Proposed solutions:
Centralized logging
Machine learning approach
Turns manual Dev, DevOps and IT operations into an automated process
Poses the questions for you – revealing events that would otherwise go undetected