Александр Махомет "Beyond the code или как мониторить ваш PHP сайт"

Beyond the code. Keep your site healthy
and users satisfied
Aleksandr Makhomet
Upwork
https://www.facebook.com/amahomet
http://twitter.com/amahomet

What is Upwork.com
• Formerly odesk.com
• Upwork has 12+ million registered freelancers and
5+ million registered clients. Three million jobs are
posted annually, worth a total of $1+ billion USD,
making it the world's largest freelancer
marketplace.
• Highload (alexa=420). Microservice architecture
➤

What I’m talking about
User Experience is extremely important
Things that matter:
➔ Low errors level
➔ High performance
➔ High site availability (no outages)
➤

Apdex
Apdex (Application Performance Index)
[0 - 1]
➤

Importance of DevOps culture
➤
What breaks production? New Features!

Importance of DevOps culture
DevOps (Developers + Operations)
Is a culture that emphasizes the cooperation of both software developers and
other information-technology (IT) professionals while automating the process of
software delivery. It aims at establishing a culture and environment where
building, testing, and releasing software can happen rapidly, frequently, and
more reliably
Leads to:
Faster time to market, lower failure rate of new releases, shortened lead time
between fixes, and faster mean time to recovery
➤

Managing errors on production
Errare humanum est
➤

Managing errors on production
Effective logs are important
➔ Follow PSR-3
➔ Write as many logs as possible
➔ Write full logs (user id, visitor id, stack trace, request details,
controller/action, instance info ...)
➔ Write Request Id
➔ Use meaningful log messages
➔ Do not write sensitive data
➤

Logging tools are important (ELK)
Effective tools are important
ELK = Logstash + ElasticSearch + Kibana
➤
➔ Logstash - collect, filter and store logs
➔ ElasticSearch - powerful fulltext search on top of Apache Lucene
➔ Kibana - UI for searching logs
Write logs in json format
Demo

Logging: ELK Alternatives
➔ Graylog
➔ Loggly
➔ Papertrail
➤

Error level
Monitor your error level
➔ Graphite
➔ Google Analytics
➤

Performance
➔ Measure
➤
➔ Group by controller/action or pageId
➔ Measure in details
Any external service, Database, Memcache, Redis, whatever
Any important component, like navigation

Performance: Statistic
Mean can lie
10 requests dataset, in ms (2, 3, 5, 6, 6, 7, 9, 9, 26, 37)
Mean = (2+3+5+6+6+7+9+9+26+37) / 10 = 11ms
➤
Median = (6+7) / 2 = 6.5 ms
90th percentile dataset (2, 3, 5, 6, 6, 7, 9, 9, 26)
Mean_90 = mean (90th percentile dataset) = 7.3
Upper_90 = max (90th percentile dataset) = 26

Performance: Graphite stack
➤

Performance: Graphite
Graphite collects, stores, and displays time-series
data in real time.
➔ Carbon - a high-performance service that listens for time-series data
➔ Whisper - a simple database library for storing time-series data
➔ Graphite-web - Graphite's user interface & API for rendering graphs and
dashboards
Metric format:
Data retention:
➤
<metric path> <metric value> <metric timestamp>
fwdays-demo.performance.pages.index 1 5098232342
retentions = 10:6h,60:14d,600:400d

Performance: Graphite vs StatsD
With StatsD works better
Statsd is a forwarder to Graphite
➔ Non blocking UDP protocol
➔ Aggregates data, high performance
➔ Supports 4 useful metrics: Counting, Timers, Gauges, Sets
To integrate, build your own simple script or use any open source, most popular
➤

Performance: Graphite graphs
You may combine, modify and filter data
to get graph that you need
➤
Demo

Grafana
➤
Grafana is free, powerful
and nice dashboards on top
of Graphite
Demo

Performance: prevent degradation
➤
➔ Make performance degradation check as a part
of your definition of done
➔ Add performance degradation check to your
code review checklist
➔ Use load testing

Performance: Alternatives
➤
Google Analytics
➔ Keeps history for a long time
➔ Segments are great, get performance for different types of
users
New Relic
➔ Powerful performance analytics from the box
➔ Uses magic sometime
➔ Has free light account with 1 day data retention
Demo

Performance: Zipkin
Zipkin is a distributed tracing system. It helps gather timing data needed to
troubleshoot latency problems in microservice architectures
➤

Alerting
Setup simple healthcheck at least
➤
Application metrics
➔ 5xx / 4xx / 3xx / 2xx rate
➔ Errors rate
➔ Response time
➔ Apdex
Server metrics
➔ CPU Usage
➔ Load Average
➔ Memory Usage
➔ Disk space
➔ Disk I/O
➔ Network I/O
Notification channels
➔ Chat
➔ Email
➔ SMS/push
➔ Phone call
Thresholds
➔ Warning
➔ Critical

Alerting: Best Practices
➤
➔ Avoid setting thresholds too low. Avoid
false positive
➔ Adjust your conditions over time

Alerting: Implementations
On top of Graphite
➔ List of free tools (Cabot)
New Relic
➔ Advanced in paid version
➔ Basic in free version
Cloudwatch (if Amazon)
Zabbix / Nagios / Icinga
➤

Alerting: PagerDuty
➤
PagerDuty is an alarm aggregation and
dispatching service for system administrators
and support teams. It collects alerts from your
monitoring tools, gives you an overall view of all
of your monitoring alarms, and alerts an on duty
engineer if there’s a problem.
Demo

Incidents
Incident is a critical violation
➤
➔ Create an #incident channel
➔ Define incident escalation policy
◆ Define a person who can make decisions
◆ Define a duty officer
◆ Enable Moratorium for production changes until resolved
➔ Track metrics
◆ MMTR - Mean time to resolve
◆ MMTD - Mean time to detect
◆ MMTE - Mean time to escalate
◆ MTBF - Mean time between failures
➔ Do Postmortems
➔ Have a visibility on production changes, especially with
microservices

Postmortems
➤
During
➔ Do not offend
➔ Do not feel offended
After
➔ Create a document with answers and share it
➔ File issues
Before
➔ What other parts of the site might also have similar
issues?
➔ How we can determine root cause faster?
➔ How can we prevent it in future.
➔ Lessons learned

Few more tools (Homework)
➔ Prometheus
➔ Sentry
➔ Pinba
➔ Sensu
➔ DataDog
➤

Questions?
Aleksandr Makhomet
https://www.facebook.com/amahomet
http://twitter.com/amahomet
http://fwdays.com
http://ergo.place
Upwork is hiring, if you are looking for an remote php senior dev position, ping me

Александр Махомет "Beyond the code или как мониторить ваш PHP сайт"

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Александр Махомет "Beyond the code или как мониторить ваш PHP сайт"

Semelhante a Александр Махомет "Beyond the code или как мониторить ваш PHP сайт" (20)

Mais de Fwdays

Mais de Fwdays (20)

Último

Último (20)

Александр Махомет "Beyond the code или как мониторить ваш PHP сайт"