PCF Platform Monitoring with Prometheus and Grafana

PCF Platform Monitoring with
Prometheus & Grafana
By Alan Strader & Jamie Christian
1
NTAC:3NS-20

Unless otherwise indicated, these slides are © 2013 -2016 Piv otal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/
Northern Trust
2
Founded in 1889, Northern Trust is a global leader in asset servicing, asset
management, and banking for personal and institutional clients
Wealth Management
Corporate &
Institutional Services
• Insurance companies
• Pensions
• Sovereign entities
• Fund managers
• Foundations and
endowments
• Individuals
• Families
• Family offices
• Foundation
• Endowments
• Privately held businesses
NTAC:3NS-20

Agenda
3
• Monitoring Solution Requirements
• Options Considered
• What is Prometheus?
• Prometheus on Cloud Foundry
• Alerting
• Dashboarding
• How Prometheus Helps Us
NTAC:3NS-20

PCF at Northern Trust
4
• January 2017: Begin Prod and Non-Prod environment build
• March 2017: First application go-live
• June 2017: Started Prometheus journey
• Now: 750+ microservices executing across 5 foundations
4 Full Time PCF Platform Operators
250+ Spring Boot Developers
NTAC:3NS-20

5
Prometheus
NTAC:3NS-20

Prometheus at a Glance
6
NTAC:3NS-20

Prometheus Value vs. Commercial Solution
7
• Upfront time/cost (soft dollars) higher
• Estimated ongoing cost significantly lower
• Initial implementation of new features takes time (learning curve), but easily
replicated across foundations
• 4 upgrades/foundation (20 deployments) to date!
• Time/value proposition greatly improved by input of the CF community
NTAC:3NS-20

Monitoring Requirements
8
Application Monitoring
• CA APM our enterprise implementation
Platform Monitoring (Gap identified as part of day 2 operations)
• Report health of opaque platform components
• Alert when we are approaching/exceeding capacity
• GoRouters/Diego Cells/Quota/Memory/Compute
• Forecast/project approximate dates for capacity increases
NTAC:3NS-20

Options Considered – DataDog, Sysdig, Pandora FMS, Prometheus
• Cost (hard vs. soft dollars)
• Already in use at NT?
• Commercial vs. open source
• On premise vs. off premise
• Community & vendor recommendations
• Ease of use
• Look and feel
9
Prometheus:
Benefits:
• No hard dollar cost
• Existing bosh deployment
• Existing CF Dashboards
Limitations:
• No direct support
• Fragmented documentation
• Startup/Operational Learning
Curve
• Desired features missing
NTAC:3NS-20

Components of Prometheus & Grafana
10
Prometheus: scrapes/stores time series data
Exporters: applications that harvest existing
metrics from third-party systems and
expose them for Prometheus ingestion
Nginx: HTTP & reverse proxy server
Grafana: metric analytics & visualization suite
Alert Manager: provides notifications on alerts
generated by the Prometheus server
NTAC:3NS-20

NT Environment – Current State
11
NTAC:3NS-20

NT Environment – Future State
12
NTAC:3NS-20

Installing Prometheus on PCF
• Download artifacts from GitHub
• Upload BOSH releases to BOSH director
• Create UAA clients for firehose and MySQL user
• Populate manifest with new creds/Ops manager settings
• Bosh deploy
• Note: There is a tile!
13
NTAC:3NS-20

Alerting
•Prometheus:
• Rule-based alerts; conditional
• Prometheus Expression Language
• Answers the question “what is broken right now?”
• Requires additional notification solution…
•Alert Manager
• Notification solution!
• Send summarized notifications to slack, email, etc.
• De-duplication, grouping, routing
• Can “silence” noisy alerts (rule-based)
14
NTAC:3NS-20

Alerting: Custom Example
15
ALERT CFAppDown
IF cf_application_instances{deployment="cf",environment="cf",
organization_name="arch-org",application_name="ShowEnv"} < 2
FOR 30s
LABELS {service="cf", severity="warning"}
ANNOTATIONS{details="`{{$labels.organization_name}}/
{{$labels.space_name}}/{{$labels.application_name}}` has fewer application
instances than expected; there has been {{$value}}/2 app instances running during
the last 30s."}
Prometheus custom.rules:
NTAC:3NS-20

Alerting: Notifications
• Set up new receiver for independent
Slack channel
• Route particular alerts to independent
Slack channel
16
routes:
- receiver: 'slack-test'
match:
alertname: CFAppDown
NTAC:3NS-20

http://2a74212d-b53c-4dbe-b0c2-a60c031ee063:9093/#/alerts?receiver=slack-test
Alerting: Notifications
• Customize notification links to contain a generally available URL vs.
Hostname
17
http:// my.alerts.com :9093/#/alerts?receiver=slack-test
exec alertmanager
-config.file=
"/var/vcap/jobs/alertmanager/config/alertmanager.yml"
-mesh.listen-address=10.1.1.1:6783
-mesh.peer="10.1.1.1:6783"
-web.listen-address=":9093"
-web.external-url="http://my.alerts.com:9093"
alertmanager_ctl:
NTAC:3NS-20

Deciphering Metrics
18
• Grafana Dashboards can have unclear metric definitions
• Organization Memory Quota Consumption
NTAC:3NS-20

Deciphering Metrics
19
• Grafana Dashboards can have unclear metric definitions
• Ex: Instances
NTAC:3NS-20

Dashboarding
20
NTAC:3NS-20

Use Cases
Aggregating data across Orgs/Spaces/Diego Cells
• Failed deployments  increase Diego Cells
21
NTAC:3NS-20

Use Cases
Diego Cell Configuration
22
NTAC:3NS-20

Use Cases
Identify where Buildpacks are being used
23
NTAC:3NS-20

Use Cases
MySQL Table Statistics
• Bug during upgrade as a result of high record count
24
NTAC:3NS-20

Improvements We’d Like to See
• Drilldown into dashboards
• Mechanism to pull configuration from a “config server”
• Integration mechanism for Enterprise Ticketing System
• Searchable Dashboards
• User Provided Service Metrics
• Alert Manager security
25
NTAC:3NS-20

Overall…
• Would we do this again? Yes!
• Provided significant amount of data in short period of time
• Highly customizable
• Data
• Dashboards
• Alerts
• Notifications
• Learning curve worth the agility & operational control
26
NTAC:3NS-20

Reference Links
Prometheus on PCF: https://github.com/pivotal-cf/prometheus-on-PCF
Prometheus Bosh Release: https://github.com/cloudfoundry-community/prometheus-boshrelease
Prometheus Concourse Pipeline: https://github.com/pivotal-cf/pcf-prometheus-pipeline
Prometheus Documentation: https://prometheus.io/docs/
Grafana Documentation: http://docs.grafana.org/
27
NTAC:3NS-20

Learn More. Stay Connected.
Monitoring with MongoDB on PCF, Jordan Sumerlus
Wednesday 3:20
Introducing Spring Metrics, Jon Schneider
Thursday 10:30
Monitoring and Troubleshooting Spring Boot Microservices Architecture, Mukesh Gadiya
Thursday 11:50
28
#springone@s1p
NTAC:3NS-20

PCF Platform Monitoring with Prometheus and Grafana

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a PCF Platform Monitoring with Prometheus and Grafana

Semelhante a PCF Platform Monitoring with Prometheus and Grafana (20)

Mais de VMware Tanzu

Mais de VMware Tanzu (20)

Último

Último (20)

PCF Platform Monitoring with Prometheus and Grafana

Notas do Editor