This document discusses service reliability monitoring strategies. It describes a service reliability hierarchy that focuses on preventing incidents rather than just responding to them. It also discusses using metrics and alerts to monitor services at different levels of granularity. Specifically, it recommends alerting on high-level service objectives while still allowing inspection of individual components. The document then provides examples of how AWS CloudWatch can be used to collect metrics, define alerts and monitor services. It also discusses the tradeoffs of white-box vs black-box monitoring approaches.
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Kks sre book_ch10
1. SRE Book Ch 10
KKStream SRE Study Group
Presenter: Chris Huang
2018/08/29
2. Service Reliability Hierarchy
● From the most basic requirements needed
for a system to function as a service
● Permitting self-actualization and taking
active control of the direction of the service
rather than reactively fighting fires
2
3. Service Reliability Hierarchy
Incident Response
● On-call support is a tool we use to achieve
our larger mission and remain in touch with
how distributed computing systems actually
work (and fail!).
● If we could find a way to relieve ourselves of
carrying a pager, we would.
3
4. Service Reliability Hierarchy
Postmortem and Root-Cause Analysis
● We aim to be alerted on and manually solve
only new and exciting problems presented
by our service; it’s woefully boring to "fix"
the same issue over and over.
● This mindset is one of the key differentiators
between the SRE philosophy and some
more traditional operations-focused
environments.
4
6. Practical Alerting
6
● Being alerted for single-machine failures is unacceptable because such data is too noisy to be
actionable.
● Instead we try to build systems that are robust against failures in the systems they depend on.
● A large system should be designed to aggregate signals and prune outliers.
● We need monitoring systems that allow us to alert for high-level service objectives, but retain the
granularity to inspect individual components as needed.
To think about the CloudWatch functionalities that qualivent to
Borgmon
7. Getting Metrics - varz
● Every Google service has a built-in HTTP server to export internal metrics. Borgmon can easily
fetch a target’s metrics by one HTTP fetch.
● A Borgmon can collect metrics from another Borgmon, so we can build hierarchies that follow the
topology of the service, aggregating and summarizing information and discarding some strategically
at each level.
7
chris@prod-server [~] $ curl http://webserver:80/varz
http_responses map:code 200:25 404:0 500:12
chris@prod-server [~] $ curl http://webserver:80/varz
http_requests 37
errors_total 12
● The /varz HTTP handler simply lists all the exported variables in plain text. A later extension added
a mapped variable, which allows the exporter to define several labels on a variable name, and then
export a table of values or a histogram.
9. AWS CloudWatch
● AWS services (ELB, RDS) exports
default metrics to CloudWatch.
There is CloudWatch agent to
send instance metrics (CPU, disk,
memory) to CloudWatch.
● For user application, AWS
requires app to send customized
metrics.
9
11. CloudWatch Concepts
The following terminology and concepts
are central to your understanding and use
of Amazon CloudWatch:
● Namespaces
● Metrics
● Dimensions
● Statistics
● Percentiles
● Alarms
11
Metrics
● Metrics are the fundamental concept in CloudWatch. A
metric represents a time-ordered set of data points that are
published to CloudWatch.
● AWS services send metrics to CloudWatch, and you can
send your own custom metrics to CloudWatch.
● Metrics are uniquely defined by a name, a namespace, and
zero or more dimensions. Each data point has a time
stamp, and (optionally) a unit of measure. When you
request statistics, the returned data stream is identified by
namespace, metric name, dimension, and (optionally) the
unit.
12. Dimensions
● A dimension is a name/value pair that uniquely identifies a metric. You can assign up to 10 dimensions to a
metric.
● Every metric has specific characteristics that describe it, and you can think of dimensions as categories for
those characteristics. Dimensions help you design a structure for your statistics plan.
● AWS services that send data to CloudWatch attach dimensions to each metric. You can use dimensions to
filter the results that CloudWatch returns. For example, you can get statistics for a specific EC2 instance by
specifying the InstanceId dimension when you search for metrics.
● For metrics produced by certain AWS services, such as Amazon EC2, CloudWatch can aggregate data
across dimensions.
● CloudWatch does not aggregate across dimensions for your custom metrics.
12
13. Metrics Statistics
● Statistics are metric data aggregations over specified periods of time. Aggregations are made using the
namespace, metric name, dimensions, and the data point unit of measure, within the time period you
specify.
13
14. Publish Custom Metrics
You can publish your own metrics to CloudWatch using the AWS CLI or an API. You can view statistical graphs of
your published metrics with the AWS Management Console.
14
chris@prod-server [~] $ aws cloudwatch put-metric-data --namespace VP/API --metric-name
LoginCount --unit Count --value 1 --dimensions Platform=iOS,Subscribe=Freemium
chris@prod-server [~] $ aws cloudwatch put-metric-data --namespace VP/API --metric-name
LoginLatency --unit Milliseconds --value 200.0 --dimensions
Platform=iOS,Subscribe=Freemium
We can simply aggregate and visualize LoginCount on CloudWatch dashboard for
● Total login user count in last 6 hours
● iOS login user count in last 6 hours
● Average login latency for Android Freemium user count in last 6 hours
15. Black-Box v.s. White-Box Monitoring
● Borgmon (or CloudWatch) is a white-box
monitoring system—it inspects the internal state
of the target service, and the rules are written with
knowledge of the internals in mind. The transparent
nature of this model provides great power to identify
quickly what components are failing
● But you only see the queries that arrive at the
target; the queries that never make it due to a DNS
error are invisible, while queries lost due to a server
crash never make a sound.
● Black-box monitoring like Pingdom is a good way
to see from user’s perspective
15