Join Allen Duet and Pieter Humphrey from Pivotal, to learn how PCF Metrics enhances the developer experience on Pivotal Cloud Foundry, with a simple and powerful way to troubleshoot app health and performance issues. You will see how, with a single, unified interface for events, logs, and metrics, app devs can easily navigate graphs to identify problems and then view logs for that time slice.
Handwritten Text Recognition for manuscripts and early printed texts
Troubleshooting App Health and Performance with PCF Metrics 1.2
1. PCF Metrics – App Dev
Providing App Developers insight into app performance
PCF Metrics
Providing App Developers insight into app performance
Pieter Humphrey, Allen Duet
2. Gartner believes that more than 80% of all
mission-critical IT service outages result
from people and process errors and
failures, and of those outages, more than
50% result from a lack of coordination
between change, release and configuration
management processes.
Four Steps to Optimize Configuration Management Process and Tools, By Ronni J. Colville, Doc #G00258557 Oct 2013
3. Modern infrastructure is constantly changing
Methodologies Deployment
Sparingly at
designated times
Ready for prod at
any time
Architecture Technologies Operations
App Server on Machine
Containers,
Public / Private /
Hybrid Cloud
Monolithic App
Microservices /
Composite app
Linear / Sequential
Agile
DevOps
CI / CD Pipelines
Many tools, ad hoc
automation
Manage services,
not servers
5. 5
Outages often preventable using automation
Facebook
1 hour, Jan 26th
Config / app / net failures
Apple App Store
11 hours March 11th
Internal DNS error
NYSE, United, WSJ
4 hr, 1.5 hr, 1 hr July 8th
Software update, routing
failure, server overload
UltraDNS
2.5 hours Oct 15th
Configuration Errors
https://blog.thousandeyes.com/top-internet-outages-2015/
http://www.informationweek.com/cloud/9-spectacular-cloud-computing-fails/d/d-id/1321305?image_number=2
http://www.informationweek.com/cloud/9-spectacular-cloud-computing-fails/d/d-id/1321305?image_number=4
http://www.informationweek.com/cloud/9-spectacular-cloud-computing-fails/d/d-id/1321305?image_number=8
2015
6. “25% of customers will abandon a web page that takes more than 4 seconds to load”
“47% of consumers expect a web page to load in < 2 seconds”
“Customers prefer competitors website if it is 250ms faster”
“Increase revenue 1% for each 100ms improvement”
Sources: Gartner, Google, Amazon, Walmart
6
Speed and Availability Matters
7. 7
Speed Performance and Human Perception
Delay time
User Reaction
0 - 100 ms 100-300 ms 300-1000 ms 1 second + 10 seconds +
Instant
Feels
sluggish
Machine is
working..
Mental
context
switch
I’ll come
back later ..
Stay under 250 ms to feel "fast".
Stay under 1000 ms to keep users attention.
Breaking the 1000 ms Mobile Barrier - Velocity - Google Slides
https://docs.google.com/presentation/d/1wAxB5DPN-rcelwbGO6lCOus_S1rP24LMqA8m1eXEDRo/present?slide=id.p19
8. Changes to a single microservice
or monolithic app can impact
performance of downstream apps
and services, or cause breakage
8
9. 9
Troubleshooting apps and
microservices is hard
Most platforms have:
Disparate permissions on different apps
Data silos across subsystems
Trouble reconciling time series data
11. 4 Levels of High Availability
11
Availability Zone Fail
4
VM Fail
3
Process Fail
2
App Instance
Fail
1
V
M
V
M
Process
V
M
V
M
V
M
VM VM
VM VM
VM VM
VM VM
12. Container Scheduler Handles Workloads
12
250,000
containers
managed in a
single
environment
https://blog.pivotal.io/pivotal-cloud-foundry/products/250k-containers-in-production-a-real-test-for-the-real-world
15. Each Layer Upgradable with No Downtime
15
App Runtime*
File system mapping
Application
Linux host & kernel
Blue-Green deploy
Canary style deploy
* e.g. Embedded webserver, app configurations, JRE, agents for services packaged as buildpacks
C
o
n
t
a
i
n
e
r
16. Our Charter
To provide App Devs with data points
to assess overall solution performance
and healthProviding App Developers insight into app performance
17. • Near real-time
view
• Covers 80-90%
of the problems
• One tool correlates
events, logs, metrics
• Common set of facts
for Dev+Ops
• Designed for PCF
multi-tenancy
• Agentless, no install
• Enabled
automatically for
all applications
Immediate Integrated Automated
20. 2 weeks of app log storage
2 weeks of detailed container
and http start stop metric storage
App Log distribution histogram
App Event UI improvements
Fault tolerance on all storage
services
Testing and tuning for large
ingestion loads
v1.2.1 PCF Metrics
23. Our Journey
PCF Metrics v1.0
PCF Metrics v1.1
PCF Metrics v1.2.1
PCF Metrics v1.3
Aggregate Container
and HTTP metrics
provided for Apps
Aggregate Container
and HTTP metrics +
App events and Logs
(24 hour storage)
Aggregate Container
and HTTP metrics +
App events and Logs
(2 weeks storage)
Aggregate Container
and HTTP metrics +
App events and Logs
(2 weeks storage)
TraceID capture and
Trace Logs
24. Spring Boot actuator support
Expanded event descriptions
Additional Log sources *
Data exposed as API
Continued UX improvements
v1.3+ App Developers