Our cloud-native environments are more complex than ever before! So how can we ensure that the applications we’re deploying to them are behaving as we intended them to? This is where effective observability is crucial. It enables us to monitor our applications in real-time and analyse and diagnose their behaviour in the cloud. However, until recently, we were lacking the standardization to ensure our observability solutions were applicable across different platforms and technologies. In this session, we’ll delve into what effective observability really means, exploring open source technologies and specifications, like OpenTelemetry, that can help us to achieve this while ensuring our applications remain flexible and portable.
1. Through The
Looking Glass:
GRACE, JANSEN
DEVELOPER ADVOCATE, IBM
@gracejansen27
Effective observability
for cloud native
applications
2. • Phrase by Lewis Carroll
• The sequel to Alice's
Adventures in Wonderland
• Alice passes through a
mirror over a fireplace and
finds herself (once more) in
an enchanted land
Through the Looking-Glass...
3.
4.
5. Agenda:
• Why do we need observability?
• What do we mean by “Observability”?
• How can we do this in our own apps?
• OpenTelemetry
• MicroProfile Telemetry 1.0
• Demo
• Summary and Resources
14. What is observability?
• In general…
• observability is the extent to which you can understand the internal state or
condition of a complex system based only on knowledge of its external
outputs
15. What is observability?
• In general…
• observability is the extent to which you can understand the internal state or
condition of a complex system based only on knowledge of its external
outputs
• In IT and cloud computing…
• observability also refers to software tools and practices for aggregating,
correlating and analyzing a steady stream of performance data from a
distributed application along with the hardware and network it runs on, in
order to more effectively monitor, troubleshoot and debug the
application and the network
16. Observability vs Monitoring
• Monitoring consists in using tools/techniques that highlight that an
issue occurred. A monitoring system could raise a warning when:
• average response time is getting slower and slower;
• a growing number of requests result in HTTP 500 – internal server error;
• application crashes;
• Observability is the ability to measure the internal states of a system
by examining its outputs (Control theory definition).
• An application is “observable” when it provides detailed visibility into
its behavior and always allows identifying the root cause of an issue.
19. Implementing Observability
1. Instrument systems and applications to collect relevant
data (e.g. metrics, traces, and logs).
2. Send this data to a separate external system that can
store and analyze it.
1
2
20. Implementing Observability
1. Instrument systems and applications to collect relevant
data (e.g. metrics, traces, and logs).
2. Send this data to a separate external system that can
store and analyze it.
3. Provide visualizations and insights into systems as a
whole (including query capability for end users).
1
2
3
22. Instrumentation: The Three Pillars
Logs
Observability
Distributed
Traces
Metrics
1
Profiling
End-user
monitoring
23. Instrumentation: Logs
Logs
• a timestamped message emitted by services or
other application components, providing coarser-
grained or higher-level information about system
behaviors (like errors, warnings, etc) and typically
will be stored in a set of log files.
• not necessarily associated with any particular user
request or transaction
Logs
24. Instrumentation: Metrics
Metrics
• aggregations of numeric data about
infrastructure or an application over a period
of time. Examples include system error rates,
CPU utilization, and request rates for a given
service. Metrics
26. Compatible Runtimes
Compatible with MicroProfile APIs 2.x and
3.x
4.x 5.x 6.x
Open Liberty x x x x
WebSphere Liberty x x x x
Quarkus x x
Payara Micro x x x
WildFly x x x
Payara Server x x x
TomEE x x
KumuluzEE x
Thorntail x
JBoss EAP XP x
Helidon x x
Apache Launcher x
https://microprofile.io/compatible
27. MicroProfile Metrics
“This specification aims at providing a unified way for
Microprofile servers to export Monitoring data
("Telemetry") to management agents and also a unified
Java API, that all (application) programmers can use to
expose their telemetry data.”
28. Instrumentation: Traces
• Distributed traces (i.e. Traces)
• records the paths taken by requests (made by an
application or end user) as they disseminate
through multi-service architectures, like
microservice, macroservice, and serverless
applications. Distributed Traces
29. Key Tracing Concepts
• Traces
• Traces represent requests and
consist of multiple spans.
• Spans
• Spans are representative of
single operations in a request. A
span contains a name, time-
related data, log messages, and
metadata to give information
about what occurs during a
transaction.
Image: https://blog.sentry.io/2021/08/12/distributed-tracing-101-for-full-stack-developers/
30. Key Tracing Concepts
• Context
• Context is an immutable object
contained in the span data to
identify the unique request
that each span is a part of. This
data is required for moving
trace information across
service boundaries, allowing
developers to follow a single
request through a potentially
complex distributed system.
Image: https://blog.sentry.io/2021/08/12/distributed-tracing-101-for-full-stack-developers/
32. Open Telemetry
• High-quality, ubiquitous, and portable telemetry to enable
effective observability
• OpenTelemetry is a collection of tools, APIs, and SDKs. Use it
to instrument, generate, collect, and export telemetry data
(metrics, logs, and traces) to help you analyze your software’s
performance and behaviour.
• NB: OpenTelemetry ≠ observability back-end
https://opentelemetry.io
37. MicroProfile Telemetry 1.0
• Introduced in MicroProfile
6.0 release
• Adopts OpenTelemetry
Tracing
• Set of APIs, SDKs, tooling
and integrations
• Designed for the creation
and management of
telemetry data (traces)
https://github.com/eclipse/microprofile-telemetry
38. MP Telemetry Instrumentation
• Automatic Instrumentation:
• Jakarta RESTful Web Services and MicroProfile Rest Client
automatically enlisted in distributed tracing
• Manual Instrumentation:
• Manual instrumentation can be added via annotations @WithSpan or
via CDI injection @Inject Tracer or @Inject Span or
programmatic lookup Span.current()
• Agent Instrumentation:
• Use OpenTelemetry Java Instrumentation project to gather telemetry
data without any code modification
40. Backend Exporter
• You can export the data that MicroProfile
Telemetry collects to multiple exporters.
• E.g.:
• Jaeger
• Zipkin
• Otel Collector
2
41. Visualization
• Prometheus
• Systems monitoring and alerting toolkit
• Grafana
• An open source analytics and interactivee visualization
• Kibana
• provides users with a tool for exploring, visualizing, and building dashboards
on top of the log data stored in Elasticsearch clusters.
https://blog.sebastian-daschner.com/entries/openliberty-monitoring-prometheus-grafana
3
47. Summary:
• Entering a world of increased complexity
• Effective observability is critical to monitor and understand
how our applications are behaving and performing in this
complex environment
• Many open source tools available to help us look through the
looking glass, including new standards like Open Telemetry
• OSS Java tools like MicroProfile enable us to make use of this in our
own applications
48. Resources:
• What is observability? - https://www.ibm.com/uk-
en/topics/observability
• OpenTelemetry and MicroProfile: Enabling effective
observability for your cloud-native Java applications -
https://developer.ibm.com/articles/opentelemetry-effective-
observability-for-your-cloud-native-java-apps/
• Tracing your microservices made easy with MicroProfile
Telemetry 1.0 - https://openliberty.io/blog/2023/03/10/tracing-
with-microprofile-telemetry.html
Our cloud-native environments are more complex than ever before! So how can we ensure that the applications we're deploying to them are behaving as we intended them to? This is where effective observability is crucial. It enables us to monitor our applications in real-time and analyse and diagnose their behaviour in the cloud. However, until recently, we were lacking the standardization to ensure our observability solutions were applicable across different platforms and technologies.
In this session, we'll delve into what effective observability really means, exploring open source technologies and specifications, like OpenTelemetry, that can help us to achieve this while ensuring our applications remain flexible and portable.
Through the Looking-Glass, and What Alice Found There
A metaphor for any time the world suddenly appears unfamiliar, almost as if things were turned upside down – similar to looking out from inside the mirror to find a world both recognizable and yet turned inside-out.
Like a portal or gateway to that magical world – see things that wouldn’t otherwise
Not the only reference to being transported to a secret hidden world in literature/media… The Matrix has a similar storyline.
In a couple of scenes in the movie, there a specific references to Lewis Carroll’s fantastic novels Through the Looking Glass and Alice in Wonderland.
In the beginning of the movie, Neo gets woken up by his computer sending him a direct message instructing him to “Follow the white rabbit.” – lady with white rabbit tattoo
“You take the blue pill, the story ends, you wake up in your bed and believe whatever you want to believe. You take the red pill, you stay in Wonderland, and I show you how deep the rabbit hole goes.”
Innovations within cloud infrastructure, the evolution of cloud-native architecture toward microservices-based applications, and the distributed nature of cloud-native deployments have provided significant innovations and advantages but have brought with them new challenges. One such challenge is the ability to effectively observe and monitor an application’s behavior and performance in real time. Our applications are more complex than ever, often composed of tens to hundreds of distributed microservices, making them challenging to manage and monitor.
This increased complexity is precisely the reason that effective observability of our applications is so critical. We need to be able to identify and act upon failures, bottlenecks, resource constraints, and issues that may arise within our deployed applications. With many of these applications mission-critical, it is vital that they remain responsive and resilient. This can only occur if we are aware of how our applications are performing in real time and how our own services depend on or affect other services.
Used to be single core, we can now take advantage of multi-threading and multiple cores
Cloud usage – naturally distributed
Virtualization
Containerzation
cloud is inherently distributed
Immediate response from apps
Impatient society
Reactive is perfect for this!
Link to anaologies
Used to be single core, we can now take advantage of multi-threading and multiple cores
Benefits of new modern improved infrastructure
Monolith like amoeba
Suitably simple, but as they developed, they needed to evolve
This new world makes root cause analysis and incident resolution potentially a lot harder, yet the same questions still need answering:
What is the overall health of my solution?
What is the root cause of errors and defects?
What are the performance bottlenecks?
Which of these problems could impact the user experience?
The new API first factor enables teams to work against each other’s public contracts without interfering with internal development processes. This is particularly important in cloud-native applications where it is highly likely that any services you build and deploy as part of this application will have to interact with other services in a large ecosystem of services. It is particularly well suited to cloud development in its ability to allow rapid prototyping, support a services ecosystem, and facilitate the automated deployment testing and continuous delivery pipelines that are some of the hallmarks of modern cloud-native application development.
Telemetry is also critically important for cloud-native applications. Telemetry and real-time app monitoring enable developers to monitor their application’s performance, health and key metrics in this complicated and highly distributed environment.
Authentication and authorization, essentially the security of an application, is a vital part of any application including those deployed in cloud environments. Deploying applications in a cloud environment means that applications can be transported across many data centres across the globe, executed within multiple containers and accessed by an almost unlimited number of clients. So, it’s vital that security is not an after-thought for cloud-native applications and is a very important factor to consider when designing and building applications.
Observability aims to provide highly granular insights into the behaviour of systems, along with rich context.
The more observable a system, the more quickly and accurately you can navigate from an identified performance problem to its root cause, without additional testing or coding.
The term “observability” comes from control theory, an area of engineering concerned with automating control a dynamic system — for example, the flow of water through a pipe, or the speed of an automobile over inclines and declines — based on feedback from the system.
…essentially asking the question, "How well can the internal state of a system be inferred from outside?"
A relatively new IT topic, observability is often mischaracterized as an overhyped buzzword, or a “rebranding” of system monitoring, application performance monitoring (APM), and network performance management (NPM). In fact, observability is a natural evolution of APM and NPM data collection methods that better addresses the increasingly rapid, distributed and dynamic nature of cloud-native application deployments.
What about monitoring?? Do we still need this?
The more observable a system, the more quickly and accurately you can navigate from an identified performance problem to its root cause, without additional testing or coding.
Useful in production for trouble-shootingUseful during perf tests to identify bottlenecks and areas ofimprovement
The term “observability” comes from control theory, an area of engineering concerned with automating control a dynamic system — for example, the flow of water through a pipe, or the speed of an automobile over inclines and declines — based on feedback from the system.
…essentially asking the question, "How well can the internal state of a system be inferred from outside?"
A relatively new IT topic, observability is often mischaracterized as an overhyped buzzword, or a “rebranding” of system monitoring, application performance monitoring (APM), and network performance management (NPM). In fact, observability is a natural evolution of APM and NPM data collection methods that better addresses the increasingly rapid, distributed and dynamic nature of cloud-native application deployments.
What about monitoring?? Do we still need this?
Observability doesn’t replace monitoring — it enables better monitoring, and better APM and NPM.
Overall, implementing observability is conceptually fairly simple. To enable observability in your projects, you should…
The final step—the query and visualization capabilities—are key to the power of observability.
Overall, implementing observability is conceptually fairly simple. To enable observability in your projects, you should…
The final step—the query and visualization capabilities—are key to the power of observability.
Overall, implementing observability is conceptually fairly simple. To enable observability in your projects, you should…
The final step—the query and visualization capabilities—are key to the power of observability.
To make a system observable, it must first be instrumented. What do we mean by this? Instrumented means the code should emit traces, metrics, and logs.
Logs, metrics, and traces are often known as the three pillars of observability. While plainly having access to logs, metrics, and traces doesn’t necessarily make systems more observable, these are powerful tools that, if understood well, can unlock the ability to build better systems.
The hope is that, with time, OpenTelemetry may grow to be able to collect other types of telemetry data like profiling and end-user monitoring.
To make a system observable, it must first be instrumented. What do we mean by this? Instrumented means the code should emit traces, metrics, and logs.
Logs:
They are found almost everywhere in software and have been heavily relied upon by developers and operators to help understand system behavior. Unfortunately, logs aren’t extremely useful for tracking code execution, as they typically lack contextual information, such as where they were called from. Thus, they become far more useful when they are included within a Trace.
Open Liberty has a unified logging component that handles messages that are written by applications and the runtime, and provides First Failure Data Capture (FFDC) capability. Logging data that is written by applications by using the System.out, System.err, or java.util.logging.Logger streams is combined into the server logs
Simplify log parsing with JSON
With Open Liberty, you can write your logs to the console in JSON format and rely on the server management system to automatically consume and parse console logs. This flexibility enables servers to run in different environments without modifying the server log configuration.
Can be viewed using ElasticStash
Logs are, by far, the easiest to generate. The fact that a log is just a string or a blob of JSON or typed key-value pairs makes it easy to represent any data in the form of a log line. Most languages, application frameworks, and libraries come with support for logging. Logs are also easy to instrument, since adding a log line is as trivial as adding a print statement. Logs perform really well in terms of surfacing highly granular information pregnant with rich local context, so long as the search space is localized to events that occurred in a single service.
While log generation might be easy, the performance idiosyncrasies of various popular logging libraries leave a lot to be desired.
the application as a whole becomes susceptible to suboptimal performance due to the overhead of logging
log messages can also be lost unless one uses a protocol like RELP to guarantee reliable delivery of messages
logging excessively has the capability to adversely affect application performance as a whole
Metrics
Metrics, once collected, are more malleable to mathematical, probabilistic, and statistical transformations such as sampling, aggregation, summarization, and correlation. These characteristics make metrics better suited to report the overall health of a system.
Metrics are also better suited to trigger alerts, since running queries against an in-memory, time-series database is far more efficient, not to mention more reliable, than running a query against a distributed system like Elasticsearch and then aggregating the results before deciding if an alert needs to be triggered.
The biggest drawback with both application logs and application metrics is that they are system scoped, making it hard to understand anything else other than what’s happening inside a particular system.
With logs without fancy joins, a single line doesn’t give much information about what happened to a request across all components of a system.
Distributed tracing is a technique that addresses the problem of bringing visibility into the lifetime of a request across several systems.
Eclipse MicroProfile is a set of cloud native capabilities, implemented by our Java runtimes, that provide standard applications annotations implementations for applications to use in containers – building on a core of familiar Java EE technologies. LRA is one of the newest MicroProfile specificactions, derived from WS-BA, to provide a simple compensating transaction for Java microservices.
The blue specs are part of the MicroProfile 3.3 release. ReactiveMessaging and LRA are standalone specs.
The LRA model provides a good balance between a loosely coupled transaction model that preserves independence of application lifecycle of each microservce and application simplicity over maintaining data consistency in the presence of failure. Simple Java annotations are used to register completion and compensation tasks while the platfrom takes care of ensuring these get invoked appropriately, across any form of system failure.
MicroProfile Metrics can be accessed by monitoring tools, such as Prometheus, or by any client that can make REST requests.
By default, MicroProfile Metrics provides metrics data in Prometheus format - a representation of the metrics that is compatible with the Prometheus monitoring tool,
Prometheus metrics are a strict subset of OpenTelemetry metrics.
Prometheus collects rich metrics and provides a powerful querying language
Distributed tracing is a monitoring technique that links the operations and requests occurring between multiple services. This allows developers to “trace” the path of an end-to-end request as it moves from one service to another, letting them pinpoint errors or performance bottlenecks in individual services that are negatively affecting the overall system.
Without tracing, it can be challenging to pinpoint the cause of performance problems in a distributed system. Tracing helps to improve the visibility of an application’s health and enables debuging behavior that can be difficult to reproduce locally.
A span describes an operation or “work” taking place on a service. Spans can describe broad operations – for example, the operation of a web server responding to an HTTP request – or as granular as a single invocation of a function.
A trace describes the end-to-end journey of one or more connected spans. A trace is considered to be a distributed trace if it connects spans (“work”) performed on multiple services.
The diagram above illustrates how a trace begins in one service – an application running on the browser – and continues through a call to an API web server, and even further to a background task worker. The spans in this diagram are the work performed within each service, and each span can be “traced” back to the initial work kicked off by the browser application. Lastly, since these operations occur on different services, this trace is considered to be distributed
So far we’ve identified the components of a trace, but we haven’t described how those components are linked together.
This is where context is important
The trace context is typically composed of just two values:
Trace identifier (or trace_id): the unique identifier that is generated in the root span intended to identify the entirety of the trace. This is the same trace identifier we introduced in the last section; it is propagated unchanged to every downstream service.
Parent identifier (or parent_id): the span_id of the “parent” span that spawned the current operation.
With these two values, for any given operation, it is possible to determine the originating (root) service, and to reconstruct all parent/ancestor services in order that led to the current operation.
In the past, the way in which code was instrumented would vary, as each Observability back-end would have its own instrumentation libraries and agents for emitting data to the tools.
This meant that there was no standardized data format for sending data to an Observability back-end. Furthermore, if a company chose to switch Observability back-ends, it meant that they would have to re-instrument their code and configure new agents just to be able to emit telemetry data to the new tool of choice.
OpenTelemetry has a much smaller scope. It collects metrics (as well as traces and logs) using a consolidated API via push or pull, potentially transforms them, and sends them onward to other systems for storage or query. By only focusing on the parts of Observability which applications interact with, OpenTelemetry is decoupling the creation of signals from the operational concerns of storing them and querying them. Ironically, this means OpenTelemetry metrics often end up back in Prometheus or a Prometheus-compatible system.
The OpenTelemetry community is growing. As a joint project between Splunk, Google, Microsoft, Amazon and many other organizations, it currently has the second-largest number of contributors in the Cloud Native Computing Foundation after Kubernetes.
OpenTelemetry is built around smart idea:• Every Java project uses a lot of standard (JDBC, Servlets...) and open-sourcecomponents (Apache Tomcat, Hibernate, OkHttp, Spring...)• What if we could manipulate bytecode of such libraries (instrumentation) is toinject telemetry behavior
OpenTelemetry is not an observability back-end like Jaeger or Prometheus. Instead, it supports exporting data to a variety of open source and commercial back-ends.
OpenTelemetry offers a pluggable architecture that enables you to add technology protocols and formats easily. Supported open source projects include the metrics format of Prometheus, which allows you to store and query metrics data in a variety of backends, and the trace protocols used, among others, by Jaeger and Zipkin for storing and querying tracing data.
The problems mentioned earlier wer recognized by the community, and open source projects were initiated to tackle this lack of standardization, including OpenTracing, OpenCensus, and OpenTelemetry – tools we can make use of to enable effective observability.
OpenTracing provided a vendor-neutral API for sending telemetry data over to an Observability back-end; however, it relied on developers to implement their own libraries to meet the specification.
OpenCensus provided a set of language-specific libraries that developers could use to instrument their code and send to any one of their supported back-ends.
Still not ONE standard….
In the interest of having one single standard, OpenCensus and OpenTracing were merged to form OpenTelemetry (OTel for short) in May 2019. As a CNCF incubating project, OpenTelemetry takes the best of both worlds
OTLP
Short for OpenTelemetry Protocol.
The OpenTelemetry Protocol (OTLP) specification describes the encoding, transport, and delivery mechanism of telemetry data between telemetry sources, intermediate nodes such as collectors and telemetry backends.
Collector
A vendor-agnostic implementation on how to receive, process, and export telemetry data. A single binary that can be deployed as an agent or gateway.
The OpenTelemetry Collector offers a vendor-agnostic implementation of how to receive, process and export telemetry data. It removes the need to run, operate, and maintain multiple agents/collectors. This works with improved scalability and supports open-source observability data formats (e.g. Jaeger, Prometheus, Fluent Bit, etc.) sending to one or more open-source or commercial back-ends. The local Collector agent is the default location to which instrumentation libraries export their telemetry data.
A receiver is how data gets into the OpenTelemetry Collector. Generally, a receiver accepts data in a specified format, translates it into the collector's internal format and passes it to processors and exporters defined in the applicable pipelines. An exporter is a component in the OpenTelemetry configured to send data to different systems/backends.
You might wonder…. under what circumstances does one use a collector to send data, as opposed to having each service send directly to the backend?
For trying out and getting started with OpenTelemetry, sending your data directly to a backend is a great way to get value quickly. Also, in a development or small-scale environment you can get decent results without a collector.
However, in general we recommend using a collector alongside your service, since it allows your service to offload data quickly and the collector can take care of additional handling like retries, batching, encryption or even sensitive data filtering.
Tracer
Responsible for creating Spans.
Baggage
A mechanism for propagating name/value pairs to help establish a causal relationship between events and services.
MicroProfile Telemetry 1.0 adopts OpenTelemetry tracing so your MicroProfile applications benefit from both manual and automatic traces!
In future… Evolve MicroProfile Telemetry to include OpenTelemetry Metrics and OpenTelemetry Logging
Supports three types of instrumentations:
The following diagram shows a system that consists of two services that both have applications running on Open Liberty. With MicroProfile Telemetry 1.0 installed, the requests and responses generate spans and the paths between the services are monitored using their context. The spans are then exported to dedicated backend services where they can be viewed in a user interface or the console logs.
You can export the data that MicroProfile Telemetry 1.0 collects to Jaeger or Zipkin. OpenTelemetry also provides a simple logging exporter so that you can check whether spans are created by viewing the data in your console. This exporter can be helpful for debugging
Prometheus collects rich metrics and provides a powerful querying language; Grafana transforms metrics into meaningful visualizations. Both are compatible with many, if not most, data source types. In fact, it is very common for DevOps teams to run Grafana on top of Prometheus.
Prometheus - It is not intended as a dashboarding solution
Grafana
The open-source Prometheus monitoring tool offers a built-in dashboarding functionality even though it considers Grafana its visualization solution. Both Prometheus dashboard and Grafana allow users to query and graph time-series metrics stored in the Prometheus database, so deciding whether to use Grafana with Prometheus depends on your monitoring requirements.
Kibana
The key difference between the Grafana and Kibana stems from their purpose. Grafana’s design for caters to analyzing and visualizing metrics such as system CPU, memory, disk and I/O utilization. The platform does not allow full-text data querying. Kibana, on the other hand, runs on top of Elasticsearch and is used primarily for analyzing log messages.
Sebastian Video (old, 2019, but concepts are similar)
https://openliberty.io/blog/2020/01/23/Kibana-dashboard-visualizations.html
Liberty Grafana Dashboard - https://grafana.com/grafana/dashboards/11706-open-liberty/
Prometheus - https://openliberty.io/blog/2020/01/29/alerts-slack-prometheus-alertmanager-open-liberty.html
Jamie
a simple demo application to show you how to collect traces from multiple services