SpringOne Platform 2017
Alan Strader, Northern Trust; Jamie Christian, Northern Trust
"This presentation will cover Northern Trust's platform monitoring solution which is Grafana, Prometheus and Alertmanager. Specifically:
Enterprise need for monitoring of the platform
Options considered
Rationale for using this particular solution
Architecture of the solution and how it monitors our 5 foundations
A demo or screen captures
What we find valuable and what we look at daily to better manage the platform
Issues encountered as we deployed the solution (bosh/yml/forwarders)
Stories on how it saved us"
28. Learn More. Stay Connected.
Monitoring with MongoDB on PCF, Jordan Sumerlus
Wednesday 3:20
Introducing Spring Metrics, Jon Schneider
Thursday 10:30
Monitoring and Troubleshooting Spring Boot Microservices Architecture, Mukesh Gadiya
Thursday 11:50
28
#springone@s1p
NTAC:3NS-20
Notas do Editor
Financial Services
Founded 125+ years ago
Primary businesses Wealth Mgmt/C&IS
Hope you’re in the right place!
Highlights that we are new operators learning curve
Greek God – gave fire to man
Open Source Software – shows fire(s) to man
Alcohol fires: in many cases things we would not have otherwise seen
Open-sources systems monitoring and alerting toolkit originally built by SoundCloud
More time on slides than on Prometheus itself!
We already have an application monitoring solution (CA APM)
**Datadog, Sysdig, Pandora, Prometheus**
Cost
actual dollar amount to spend vs. time/resources put in
In use at NT?
Give preference to products already in-house where possible
is it an ADDITIONAL cost to NT?
Commercial vs. Open Source
Cost and support model considerations
On vs Off premise
Prefer on premise due to security concerns and direction
Unclear what data would be exposed due to lack of experience with product
Recommendations
PCF integration: we don’t have to build it!
Container awareness is likely on par with expectations; data makes sense
Ease of Use
Data presented in a useful way (ex: 1 metric per graph?)
PCF integration already exist?
Look and Feel
Visuals make sense
Can display similar data in multiple formats
POINT: how does the data actually travel? Where does it go?
transitions into what is in the bosh release
PCF Components: CC, MySQL, Bosh Director, …
Exporters: applications which harvest existing metrics from third-party systems as Prometheus metrics. This is useful for cases where it is not feasible to instrument a given system with Prometheus metrics directly (for example, HAProxy or Linux system stats).
Node_Exporter: One of the most widely used exporters. Added to PCF runtime-config every VM running on PCF.
I/O
Memory
Disk
CPU Pressure
Several Community exporters
Build your own!
Prometheus: scrapes and stores time series data
Time Series database:
Nginx: HTTP & reverse proxy server
Grafana: open source metric analytics & visualization suite used for visualizing time series data
Alert Manager: handles alerts sent by client applications such as the Prometheus server.
It takes care of deduplication, grouping, and routing them to the correct receiver integrations (email, Slack, PagerDuty, OpsGenie, etc)
Current state:
- 5 foundations, 3 non-prod (Sandbox, System, UAT) - one datacenter, 2 prod (live/warm) – two datacenters
- 1 grafana instance for each foundation
Future state:
- 8 foundations, 6 non-prod (Sandbox1, Sandbox2, System live/live, UAT live/live) – two datacenters, 2 prod (live/live) – two datacenters
- 4 grafana instances (non-prod DC1, non-prod DC2, prod DC1, prod DC2)
Current state:
- 5 foundations, 3 non-prod (Sandbox, System, UAT) - one datacenter, 2 prod (live/warm) – two datacenters
- 1 grafana instance for each foundation
Future state:
- 8 foundations, 6 non-prod (Sandbox1, Sandbox2, System live/live, UAT live/live) – two datacenters, 2 prod (live/live) – two datacenters
- 2 grafana instances (non-prod, prod)
Potential talking point: recommendations for others; start more granular?? Start with spanned approach?
- Easy to install as most of what is needed for CF has been built and published to GitHub for you.
** Most time spent on this was on understanding configuration items**
ex: proxy, vm configurations
ex: exporters would often die until we increased RAM
Once it is set up, provides valuable data out of the box and does not require much care and feeding, depending on existing skillset.
Our pain points were more a result of being new to Prometheus/Bosh. Trying to learn many things at once.
Tile:
when we started with prometheus, it didn’t exist
Does not give us the flexibility to span multiple foundations (as is our goal – future state), so chose not to explore this.
Grouping: categorizes alerts of similar nature into a single notification.
useful during larger outages when many systems fail at once and hundreds to thousands of alerts may be firing simultaneously.
Thus one can configure Alert Manager to group alerts by their cluster/alertname to send a single notification.
Routing: send alerts to different receivers based on “match”;
I want to know when my app crashes, but don’t want to bombard everyone else in my slack channel. Send all alerts with organization_name “arch-org” to our email distribution list
Silence: simply mute alerts for a given time.
Helpful during times like an upgrade when you expect a lot of activity and don’t care to be told your CPU usage is high!
Alternatively, can remove usless alerts entirely so that they are not time-bound.
Not one size fits all, so customization is key.
EXAMPLE: We’ve had app teams request alerting on when their apps go down; test scenario.
>> if: query that triggers the alert
**>>details: gives value to alert notifications** important because….
By default, notifications provide alert name and labels.
Details alert meaningful at a glance.
Can also customize routes (alertmanager.yml) for notifications, so for these app teams, only they will get their alerts.
1 thing in addition to the optional customizations that we had to modify was the notification URL.
OOB, notifications use alertmanager hostname in URLs
not generally available outside of the VM
1 Concern: URL takes to alert manager.
Have not figured out how to isolate alert data based on app team.
Everyone can view any alert.
A LOT of information; need to learn what metrics mean, some not obvious.
Org Memory Quota Consumption: actual vs. reserved
GOOD! Help teams use resources wisely
INSTANCES: System Apps Dashboard vs. Space Summary Dashboard
Dashboards were showing us reserved resources, not actual values.
CONFUSING. We decided to change them so that they made more sense to us.
Now we can see all instance information in one dashboard.
Running vs. Crashed WHICH are crashed??
When we start using instance quotas, compare requested instances vs. instance quota
There are plenty of dashboards we loved OOB that required no modification… some we look at regularly to better manage the platform >>
“CF: Cells Capacity”
Situations where this solution filled a void
-THEN: Failed deployments
application memory limit?
org memory quota?
ERT memory status (percentage)
probably about time we add cells or change template…
-NOW: (proactive vs. reactive) alerted on cells with low resource or total available cell memory
Getting alerts for low storage…
Went to Prometheus to see Allocated vs. Available and surprised by total cell disk– why 32GB when our template assigned 64GB?
SURPRISE SWAP
learned how Swap is allocated by Bosh
what configuration is needed to increase our usable storage
When we upgrade our buildpacks or delete old ones…
Upgrade to platform and ran into a bug; “push apps manager” errand failed;
Truncated this table and it worked
Rather than:
Jumpbox
ssh ops manager
set deployment
target and log in to mysql
set a table
read table size
NOW: Check here prior to upgrading ERT; simplifies prerequisite work for upgrades. Take that anywhere we can get it!
Plenty we love about the product, but some improvements we’d like to see…
Drilling down into different metrics
showed on instance dashboard
Config Server
Any changes require redeploy (to keep them)
Ex: alerts
Enterprise Ticketing System
Granularity: need individual notifications vs. aggregated
Searchable Dashboards
We have almost 70 dashboards, and some are very similar, though not entirely the same
Can only search on dashboard name, not the contents of it
User Provided Service Metrics
Can only see OOB (Rabbit, Redis, Autoscaler)
Haven’t found a way to query for UPS to create our own dashboards
Alert Manager Security
No authentication needed; can see all alerts for all receivers.
Happy with Prometheus and would recommend to others
Provided a huge amount of data in a short period of time;
we think we can customize the data more to better fit our needs.
Good experiences with customizations thus far.
Took time to understand the product and the data being collected to be able to meet our goals; but learning curve is made worth it by the agility and operational control we have w/ this tool