This document discusses the importance of user experience, performance monitoring, error handling, and incident response for web applications. It recommends measuring key metrics like errors, response times, and availability; using tools like Graphite, ELK, and PagerDuty for logging, monitoring, and alerting; and implementing a DevOps culture and postmortem process for continuous improvement.
Scaling API-first – The story of a global engineering organization
Александр Махомет "Beyond the code или как мониторить ваш PHP сайт"
1. Beyond the code. Keep your site healthy
and users satisfied
Aleksandr Makhomet
Upwork
https://www.facebook.com/amahomet
http://twitter.com/amahomet
2. What is Upwork.com
• Formerly odesk.com
• Upwork has 12+ million registered freelancers and
5+ million registered clients. Three million jobs are
posted annually, worth a total of $1+ billion USD,
making it the world's largest freelancer
marketplace.
• Highload (alexa=420). Microservice architecture
➤
3. What I’m talking about
User Experience is extremely important
Things that matter:
➔ Low errors level
➔ High performance
➔ High site availability (no outages)
➤
6. Importance of DevOps culture
DevOps (Developers + Operations)
Is a culture that emphasizes the cooperation of both software developers and
other information-technology (IT) professionals while automating the process of
software delivery. It aims at establishing a culture and environment where
building, testing, and releasing software can happen rapidly, frequently, and
more reliably
Leads to:
Faster time to market, lower failure rate of new releases, shortened lead time
between fixes, and faster mean time to recovery
➤
8. Managing errors on production
Effective logs are important
➔ Follow PSR-3
➔ Write as many logs as possible
➔ Write full logs (user id, visitor id, stack trace, request details,
controller/action, instance info ...)
➔ Write Request Id
➔ Use meaningful log messages
➔ Do not write sensitive data
➤
9. Logging tools are important (ELK)
Effective tools are important
ELK = Logstash + ElasticSearch + Kibana
➤
➔ Logstash - collect, filter and store logs
➔ ElasticSearch - powerful fulltext search on top of Apache Lucene
➔ Kibana - UI for searching logs
Write logs in json format
Demo
12. Performance
➔ Measure
➤
➔ Group by controller/action or pageId
➔ Measure in details
Any external service, Database, Memcache, Redis, whatever
Any important component, like navigation
13. Performance: Statistic
Mean can lie
10 requests dataset, in ms (2, 3, 5, 6, 6, 7, 9, 9, 26, 37)
Mean = (2+3+5+6+6+7+9+9+26+37) / 10 = 11ms
➤
Median = (6+7) / 2 = 6.5 ms
90th percentile dataset (2, 3, 5, 6, 6, 7, 9, 9, 26)
Mean_90 = mean (90th percentile dataset) = 7.3
Upper_90 = max (90th percentile dataset) = 26
15. Performance: Graphite
Graphite collects, stores, and displays time-series
data in real time.
➔ Carbon - a high-performance service that listens for time-series data
➔ Whisper - a simple database library for storing time-series data
➔ Graphite-web - Graphite's user interface & API for rendering graphs and
dashboards
Metric format:
Data retention:
➤
<metric path> <metric value> <metric timestamp>
fwdays-demo.performance.pages.index 1 5098232342
retentions = 10:6h,60:14d,600:400d
16. Performance: Graphite vs StatsD
With StatsD works better
Statsd is a forwarder to Graphite
➔ Non blocking UDP protocol
➔ Aggregates data, high performance
➔ Supports 4 useful metrics: Counting, Timers, Gauges, Sets
To integrate, build your own simple script or use any open source, most popular
➤
19. Performance: prevent degradation
➤
➔ Make performance degradation check as a part
of your definition of done
➔ Add performance degradation check to your
code review checklist
➔ Use load testing
20. Performance: Alternatives
➤
Google Analytics
➔ Keeps history for a long time
➔ Segments are great, get performance for different types of
users
New Relic
➔ Powerful performance analytics from the box
➔ Uses magic sometime
➔ Has free light account with 1 day data retention
Demo
21. Performance: Zipkin
Zipkin is a distributed tracing system. It helps gather timing data needed to
troubleshoot latency problems in microservice architectures
➤
22. Alerting
Setup simple healthcheck at least
➤
Application metrics
➔ 5xx / 4xx / 3xx / 2xx rate
➔ Errors rate
➔ Response time
➔ Apdex
Server metrics
➔ CPU Usage
➔ Load Average
➔ Memory Usage
➔ Disk space
➔ Disk I/O
➔ Network I/O
Notification channels
➔ Chat
➔ Email
➔ SMS/push
➔ Phone call
Thresholds
➔ Warning
➔ Critical
23. Alerting: Best Practices
➤
➔ Avoid setting thresholds too low. Avoid
false positive
➔ Adjust your conditions over time
24. Alerting: Implementations
On top of Graphite
➔ List of free tools (Cabot)
New Relic
➔ Advanced in paid version
➔ Basic in free version
Cloudwatch (if Amazon)
Zabbix / Nagios / Icinga
➤
25. Alerting: PagerDuty
➤
PagerDuty is an alarm aggregation and
dispatching service for system administrators
and support teams. It collects alerts from your
monitoring tools, gives you an overall view of all
of your monitoring alarms, and alerts an on duty
engineer if there’s a problem.
Demo
26. Incidents
Incident is a critical violation
➤
➔ Create an #incident channel
➔ Define incident escalation policy
◆ Define a person who can make decisions
◆ Define a duty officer
◆ Enable Moratorium for production changes until resolved
➔ Track metrics
◆ MMTR - Mean time to resolve
◆ MMTD - Mean time to detect
◆ MMTE - Mean time to escalate
◆ MTBF - Mean time between failures
➔ Do Postmortems
➔ Have a visibility on production changes, especially with
microservices
27. Postmortems
➤
During
➔ Do not offend
➔ Do not feel offended
After
➔ Create a document with answers and share it
➔ File issues
Before
➔ What other parts of the site might also have similar
issues?
➔ How we can determine root cause faster?
➔ How can we prevent it in future.
➔ Lessons learned
28. Few more tools (Homework)
➔ Prometheus
➔ Sentry
➔ Pinba
➔ Sensu
➔ DataDog
➤