Observability refers to the ability to infer the internal state of a system from its external outputs. It is a property of the system, not an action like monitoring. For a system to be observable, it must externalize its state through logs, metrics, and events. Improving observability involves monitoring all components of an application from the front-end to backend services to infrastructure. Common metrics include requests processed, errors encountered, and response times for applications as well as CPU usage, disk I/O, and network traffic for infrastructure. Observability extends monitoring by helping understand why a system is not working in addition to whether it is working.
2. Observability is a measure of how well
internal states of a system can be
inferred from knowledge of its external
outputs.
— wikipedia.org/wiki/Observability
@MartinGross
3. MONITORING VS. OBSERVABILITY
▸ Monitoring: Something we do
▸ Observability: Property of a system
▸ Is the system observable?
▸ Systems need to externalise their state to be observable
@MartinGross
4. What gets measured gets maximized.
Better be careful what you measure.
@MartinGross
9. APPLICATION METRICS
▸ number of requests currently being processed
▸ number of requests handled per time period
▸ number of errors encountered when handling requests
▸ average time it took to serve requests
@MartinGross
10. INFRASTRUCTURE METRICS
▸ The CPU usage of individual processes or containers
▸ The disk I/O activity of nodes and servers
▸ The inbound and outbound network traffic of machines, clusters, or
load balancers
@MartinGross
11. OBSERVABILITY AND MONITORING
Observability extends Monitoring
Monitoring: Is the system working?
Observability: Why it‘s not working?
Observability Pipeline
Needs standard metrics format
Structured logging (e.g. JSON)
@MartinGross
14. R.E.D. PATTERN FOR SERVICES
Requests / Errors / Duration
Good for Services
Services are usually request-driven systems
▸ Requests: requests received per second
▸ Errors: Percentage of requests that returned an error
@MartinGross
15. U.S.E. PATTERN FOR RESOURCES
Utilization / Saturation / Errors
Good for physical server components like
▸ CPU
▸ disks
▸ network interfaces
@MartinGross
17. CLUSTER HEALTH
▸ Number of nodes
▸ Node health status
▸ Number of pods per node
▸ Number of pods overall
@MartinGross
18. DEPLOYMENT METRICS
▸ Number of deployments
▸ Number of configured replicas per deployment
▸ Number of unavailable replicas per deployment
@MartinGross
19. CONTAINER METRICS
▸ Number of containers/pods per node
▸ Number of containers overall
▸ Resource usage for each container against its requests / limits
▸ Liveness/readiness of containers
▸ Number of container/Pod restarts
@MartinGross
20. RUNTIME METRICS
▸ Number of processes/threads
▸ Heap and stack usage
▸ Non-heap memory usage
▸ Network I/O buffers
▸ Garbage collector runs and pause duration
@MartinGross