Monitoring your API

WHAT ISTHISTALK ABOUT?
• Passive monitoring with graphite (collect statistics).
• What metrics to monitor.
• What tools.
• Graph examples.

ASSUMPTIONS
• You are using Nginx as a proxy for your API.
• You are using Ubuntu (but works in other Linux
distributions).
• You’ll be using graphite to store metrics sent by
collectl for system metrics and logster for Nginx
logs.

WHATTO MONITOR?
“The 15 Essential Nginx Metrics to Monitor” by
Scalyr https://www.scalyr.com/community/guides/
how-to-monitor-nginx-the-essential-guide
•Requests per second
•Response time
•Active connections
•Connection backlog queue
•Response codes
•Process open ﬁle handlers
•Process state*
•Server status*
•Server load average
•Server network usage
•Server disk space
•Hosting provider status*
•DNS expiration*
•SSL certiﬁcate expiration*
•User activity*
* Not the kind of thing you would measure so not talking about them in this talk

WHATTO MONITOR?
• “The USE Method” by Brendan Gregg http://www.brendangregg.com/
usemethod.html
• Methodology for analyzing the performance of any system.
• Summarized as:“For every resource, check utilization, saturation, and
errors.”
• Consider software a resource as well
• “USE Method: Rosetta Stone of Performance Checklists” by Brendan
Gregg http://www.brendangregg.com/USEmethod/use-rosetta.html

WHATTO MONITOR ?
Utilization Saturation Errors
App Performance
Response time, #
Requests
— 5xx code
Nginx Connections Active
Accepted -
Handled
—
Open ﬁle descriptors # open ﬁles — —
CPU % Util Run queue size —
Network Rx orTx / Max Dropped Errors
Memory Used Swap —
Disk % Util
Wait time and
queue length
—

WHATTOOLS?
Utilization Saturation Errors
App Performance
Response time, #
Requests
— 5xx code
Nginx Connections Active
Accepted -
Handled
—
Open ﬁle descriptors # open ﬁles — —
CPU % Util Run queue size —
Network Rx orTx / Max Dropped Errors
Memory Used Swap —
Disk % Util
Wait time and
queue length
—

WHATTOOLS? COLLECTL
• Created by HP
• Low overhead
• Available in all major Linux distributions
• Measure a rich set of metrics
• Store locally and exports to ganglia and graphite, custom imports and
exports can be added
• Problem: doesn’t export all metrics to graphite

WHATTOOLS? COLLECTL
• Install:  
$ sudo apt-get install collectl libwww-curl-perl
• Patch graphite export (to ﬁx metrics that aren't included by default): 
$ wget graphite.patch https://gist.githubusercontent.com/andphe/
2a08eab7fb4148d33888/raw/5d416d8faa5a9ca535cd5e062622d712f74c6f11/
graphite.patch 
$ sudo patch -p0 /usr/share/collectl/graphite.ph graphite.patch
• Install nginx import module 
$ git clone https://github.com/andphe/collectl-imports.git 
$ cd collectl-imports 
$ sudo cp nginx.ph /usr/share/collectl/

WHATTOOLS? COLLECTL
• Conﬁgure (/etc/colletcl.conf): 
DaemonCommands = -i 10 -s+YZDN --netopts e --import
nginx,s=http,h=localhost,p=80,u=nginx_status --export graphite,<ip
address>,p=.collectl
• Enable Nginx status (/etc/nginx/sites-available/default) 
location /nginx_status { 
stub_status on; 
access_log off; 
allow 127.0.0.1; 
deny all; 
}
• Restart: 
$ sudo /etc/init.d/nginx reload 
$ sudo /etc/init.d/collectl restart

WHATTOOLS? LOGSTER
• Created by Etsy
• Export to ganglia, graphite, statsd, cloudwatch, nagios
• Few dependencies
• New parsers can be added
• 1 minute resolution
• Problem: only sends requests / sec per response code

WHATTOOLS? LOGSTER
• Nginx allows to log the request time via $request_time
• I created a parser for logster that takes advantage of
$request_time
• Sends percentiles and max
• DOESN’T USE AVERAGES
• Sends total of requests per responde code

WHATTOOLS? LOGSTER
• Why a new parser that doesn't use averages: 
 
“#LatencyTipOfTheDay:Average (def): a random
number that falls somewhere between the
maximum and 1/2 the median. Most often used to
ignore reality.” by GilTene http://
latencytipoftheday.blogspot.com.co/2014/06/
latencytipoftheday-average-random.html

WHATTOOLS? LOGSTER
 
“#LatencyTipOfTheDay: If you are not measuring
and/or plotting Max, what are you hiding (from)?”
by GilTene http://
latencytipoftheday.blogspot.com.co/2014/06/
latencytipoftheday-if-you-are-not.html

WHATTOOLS? LOGSTER
 
More resources about response times in web
apps: 
http://www.infoq.com/presentations/latency-pitfalls
by GilTene 
https://vimeo.com/104129953 by Andre Arko

WHATTOOLS? LOGSTER
• Install:  
$ sudo apt-get install logtail 
$ git clone https://github.com/etsy/logster.git 
$ cd logster && sudo python setup.py install
• Conﬁgure (add a cron job): 
* * * * * logster --output=graphite —graphite-
host=<ip address>:2003 -p “<hostname>.logster.api"
NginxLogster /var/log/nginx/access.log 2>&1 > /tmp/
logster_out.txt

WHATTOOLS? LOGSTER
• Install NginxParser (copy it to parsers folder) 
$ git clone https://github.com/andphe/logster-parsers.git 
$ cd logster-parsers 
$ sudo cp NginxParser.py /usr/local/lib/python2.7/dist-packages/
logster-0.0.1-py2.7.egg/logster/parsers/
• Conﬁgure Nginx to log the request time:  
log_format request_time '$remote_addr - $remote_user [$time_local] ' 
'"$request" $status "$request_time" $bytes_sent ' 
'"$http_referer" "$http_user_agent"'; 
access_log /var/logs/nginx/access.log request_time;

GRAPH EXAMPLES:APP
PERFORMANCE/render?
from=-15minutes&until=now&width=400&height=300&target=aliasByNode(<hostname>.logs
ter.api.requests.*%2C%204)&lineMode=staircase&areaAlpha=0.8&title=App
%20Performance%20(%23%20Requests%2C%20HTTP%20Codes)&areaMode=all

GRAPH EXAMPLES:APP
PERFORMANCE
/render?
from=-15minutes&until=now&width=400&height=300&target=aliasByNode(<hostname>.logs
ter.api.latency.*%2C4)&areaAlpha=0.8&title=App%20Performance%20(Response%20Time)

GRAPH EXAMPLES: CPU
/render?
from=-1hours&until=now&width=400&height=300&target=exclude(aliasByNode(<hostname>
.collectl.cputotals.*%2C%203)%2C%20'idle')&title=CPU%20(Utilization
%20%25)&areaMode=all&areaAlpha=0.8

GRAPH EXAMPLES: CPU
/render?
from=-1hours&until=now&width=400&height=300&target=alias(<hostname>.collectl.ctxi
nt.run%2C%20'Run%20queue')&title=CPU%20(Saturation
%20Tasks)&areaAlpha=0.8&areaMode=all

GRAPH EXAMPLES: MEMORY
/render?
from=-1hours&until=now&width=400&height=300&target=aliasByNode(<hostname>.collect
l.meminfo.used%2C%203)&title=Memory%20(Utilization%20KB)&vtitle=
%20&areaMode=stacked&areaAlpha=0.8

GRAPH EXAMPLES: MEMORY
/render?
from=-1hours&until=now&width=400&height=300&target=alias(<hostname>.collectl.swap
info.used%2C%20'swap%20used')&title=Memory%20(Saturation
%20KB)&areaMode=all&areaAlpha=0.8

GRAPH EXAMPLES: NETWORK
/render?
from=-1hours&until=now&width=400&height=300&target=alias(scale(highestMax(<hostna
me>.collectl.netinfo.kb*.eth0%2C%201)%2C%200.00008)%2C%20'eth0')&title=Network
%20(Utilization%20%25%2C%20100Mb)&areaMode=all&areaAlpha=0.8

/render?
from=-1hours&until=now&width=400&height=300&target=alias(scale(<hostname>.collectl.net
info.drpout.eth0%2C-1)%2C'eth0%20out')&target=alias(<hostname>.collectl.netinfo.drpin.
eth0%2C'eth0%20in')&title=Network%20(%20Saturation%2C
%20Drops)&areaMode=all&areaAlpha=0.8

/render?
from=-1hours&until=now&width=400&height=300&target=alias(scale(<hostname>.collectl
.netinfo.errout.eth0%2C-1)%2C'eth0%20out')&target=alias(<hostname>.collectl.netinf
o.errin.eth0%2C'eth0%20in')&title=Network%20(%20Errors)&areaMode=all&areaAlpha=0.8

GRAPH EXAMPLES: DISK
/render?
l.diskinfo.util.sda%2C%204)&title=Disk%20(Utilization
%20%25)&areaMode=all&areaAlpha=0.8

/render?
l.diskinfo.quelen.sda%2C%204)&title=Disk%20(Saturation%2C%20Queue%20Len%20%2F
%20sec)&areaMode=all&areaAlpha=0.8

/render?
l.diskinfo.wait.sda%2C%204)&title=Disk%20(Saturation%2C%20Time%20wait%20%2F
%20sec)&areaMode=all&areaAlpha=0.8

GRAPH EXAMPLES: NGINX
/render?
l.ngix.conn.active%2C%204)&title=Nginx%20(Utilization%2C
%20Connections)&areaMode=all&areaAlpha=0.8

GRAPH EXAMPLES: NGINX
/render?
from=-1hours&until=now&width=400&height=300&target=alias(diffSeries(<hostname>.colle
ctl.ngix.conn.accepted%2C%20<hostname>.collectl.ngix.conn.handled)%2C
%20'dropped')&title=Nginx%20(Saturation%2C%20Connections)&areaMode=all&areaAlpha=0.8

THANKYOU!
QUESTIONS & ANSWERS
@andphe
andphe@gmail.com

Monitoring your API

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Monitoring your API

Semelhante a Monitoring your API (20)

Último

Último (20)

Monitoring your API