Scylla Monitoring is a fast-paced project that exposes the cream of the Scylla native metrics in easy-to-consume and easy-to-understand dashboards. Its goal is to always make it as easy as possible for a user to understand how the system is behaving, and what can be done in reaction to issues to make the system perform better. Over the past 12 months, Scylla Monitor changed a lot, adding entirely new dashboards and refactoring existing ones to better convey relevant information to the users. In this talk, we will explore those changes, how to make the best use of the project and discuss what's in the oven for next year.
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
What's New in Scylla Monitoring 3.0
1. What’s new in Scylla
Monitoring 3.0
Amnon Heiman, Software Developer
2. Presenter
Amnon Heiman, Software Developer at ScyllaDB
Over 15 years of experience in software development of large
scale systems.
Previously worked at Convergin, which was acquired by Oracle.
Holds a BA and MSc in Computer Science from the Technion-
Machon Technologi Le’ Israel and an MBA from Tel Aviv
University.
3. What is New
■ Stack Overview
■ Versions Update
■ New Dashboards
■ Alerts
■ New Features
■ Scylla Manager Integration
10. New Dashboard - Alternator (DynamoDB API)
■ Cluster overview
■ Data Plane Actions
■ Data Plane Latencies
■ Control Plane Actions
■ Cache
■ Timeouts
11. Additional Alerts
■ Alerts are shown in the dashboard and can connect to external
systems
■ New alerts:
● Low disk size
● CQL connectivity
12. How to Add an Alert
■ Part of the Prometheus configuration (prometheus.rules.yml)
■ Structure
● Name
● What happened
● For how long
● What to report
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
13. How to Add an Alert - Example
- alert: InstanceDown
expr: up == 0
for: 30s
labels:
severity: "2"
annotations:
description:'description...'
summary: Instance is down
Name
Prometheus expression
Duration
Labels Set to the Alert
severity is important
longer description
Summary
16. Thank you Stay in touch
Any questions?
Amnon Heiman
amnon@scylladb.com
@amnonheiman
Editor's Notes
Lets start off by taking a look at an overview of our monitoring stack.
Our Monitoring stack uses Prometheus for Metrics collection and storage.
In order to create dashboards-display we use Grafana that reads these metrics from Prometheus.
Prometheus can generate alerts, the alertmanager receives these alerts and serves as a data source for Grafana as well.
Now, let’s discuss the changes. The applications and framework listed up here have all been upgraded. In particular Grafana 6 comes with a new look and extensioned abilities.
By the way you no longer need python.
The major change, is the dashboards reorganization to make the dashboards clearer and easier to use.
The overview dashboard provides at quick glance how well the cluster is operating.
Detailed - Drilldown detailed look at a Scylla Node
Let's look at the new dashboards.
The CQL dashboard is based on a talk by Shlomi in last year summit.
It has two parts, the first covers the CQL commands.
The second part is for CQL optimization.
When everything is functioning optimally, all gauge should be at zero. On the other hand when the gauge is above zero it indicates potential problem.
We recently introduced Scylla’s Alternator, which is a DynamoDB API for scylla.
The Alternator dashboard provides a picture of what the alternator is doing.
We now alert on low diskspace and cql connectivity problems.
Many of our users have been asking about adding alerts themselves.
Prometheus will fire an alert if a condition is met within a certain period of time.
The alert will contain additional text explaining what is happening.
This is what an alert configuration looks like
To add an alert, you give it a name, you write an expression, define a minimum duration, and typically you will add labels, description and summary.
Annotations are a new features which highlights events helping the users understand the system behaviour.
Finally, we have tighter integration with the Manager.
You can now set your monitoring to read its configuration directly from the manager instead of configuring it manually.