Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
1. Nick Galbreath http://client9.com/20130501 @ngalbreath
Care and Feeding of
Large Scale
Graphite
InstallationsNick Galbreath ★ IPONWEB
DevOpsDays ★ Austin Texas ★ 2013-04-30
7. Nick Galbreath http://client9.com/20130501 @ngalbreath
Who cares?
• Making it easy to create, analyze and
share data can change your
organization
• Making a data-driven culture
• Empowering developers, operations,
qa, security and business to be more
confident in the changes they make.
8. Nick Galbreath http://client9.com/20130501 @ngalbreath
What is it you say
you do here?
• Your job is likely invisible to rest of the
organization
• invisible things aren't valued
• so make what you do visible
9. Nick Galbreath http://client9.com/20130501 @ngalbreath
Why Graphite?
• Many innovations in each part of the
stack
• But, it's the Full Stack that really makes
it special.
• On-disk layout to UI to API to... the
community around it.
10. Nick Galbreath http://client9.com/20130501 @ngalbreath
Sharing in Caring
• Allows data to be easily accessed
• And easily shared. This makes it
different than many monitoring
solutions.
• It's your own in-house mashup
generator.
13. Nick Galbreath http://client9.com/20130501 @ngalbreath
Installation
• 3 python-twistd servers
• "carbon-cache"
• "cache-aggregator"
• "carbon-relay"
• Apache / Django Web UI and API
• Uses SQLite3/MySQL for
dashboards / events
14. Nick Galbreath http://client9.com/20130501 @ngalbreath
Be Current
• Don't use the OS default version.
• Newer point releases of graphite have
significant improvements in storage
engine and webui/api
• It's 100% Python so
"building it yourself" shouldn't to hard.
• pip install works and is current.
15. Nick Galbreath http://client9.com/20130501 @ngalbreath
The Documentation
• everyone complains about it
• historically bad, but getting a lot better
• Switched locations, but not all
searchengines are updated to use:
http://graphite.readthedocs.org/
• Source code is quite good, so RTFS
17. Nick Galbreath http://client9.com/20130501 @ngalbreath
What is it?
• Storage engine
• Handles reads, writes and creates of a
single metric to a fixed size file.
• One file, kinda dumb (good).
• Here's the API:
https://github.com/graphite-project/
whisper/blob/master/whisper.py
18. Nick Galbreath http://client9.com/20130501 @ngalbreath
Graphite Math
• About 12 bytes per point.
• Store 1 minute points for 1 month and
15 minutes for 11 months.
• (60×24×30 + 4×24×30×11) ×12 =
878kB
• If you can keep all your points in
memory, then magic!
19. Nick Galbreath http://client9.com/20130501 @ngalbreath
Disk Layout
• Each metric create a directory tree
server123.myapp.logins.failed
• Makes 3 directories
• This creates a very branchy directory
structure
• This has good and bad points.
22. Nick Galbreath http://client9.com/20130501 @ngalbreath
Write Buffer
• Most important feature is write
buffering to protect the disk.
• Data is buffered and written out once
per minute (or so).
• But
23. Nick Galbreath http://client9.com/20130501 @ngalbreath
The Cache
• It's a write cache.
• Once data is written,
it's out of the cache
• In other words, the cache is
metrics not on disk.
• If the cache dies, you lose metrics
• (btw: the read cache is the os disk
cache)
24. Nick Galbreath http://client9.com/20130501 @ngalbreath
New Metrics
• New metrics are created automatically
• But, it is very expensive.
• MAX_CREATES_PER_MINUTE.=.50
• Saves your disk, but new metrics will
"pile up" in cache.
• May take 10m+ for your metrics to start
flowing....
25. Nick Galbreath http://client9.com/20130501 @ngalbreath
FALLOCATE
• WHISPER_FALLOCATE_CREATE.=.True
• Linux&Kernel&>=&2.6.23
• fallocate is used to preallocate blocks
to a file. For filesystems which
support the fallocate system call, this
is done quickly by allocating blocks
and marking them as uninitialized,
requiring no IO to the data blocks.
This is much faster than creating a
file by filling it with zeros.
• https://bugs.launchpad.net/whisper/
+bug/957827
26. Nick Galbreath http://client9.com/20130501 @ngalbreath
Limit the Size
Limit the size of the cache to avoid
swapping or becoming CPU bound.Sorts
and serving cache queries gets more
expensive as the cache grows.
Use the value "inf" (infinity) for an
unlimited cache size.
MAX_CACHE_SIZE = inf
No!.Infinity.does.not.exist.on.your.system!
.&
27. Nick Galbreath http://client9.com/20130501 @ngalbreath
Graphite for Graphite
By&default,&carbon&itself&will&log&
statistics&(such&as&a&count,
metricsReceived)&with&the&top&level&prefix&
of&'carbon'&at&an&interval&of&60
seconds.&Set&CARBON_METRIC_INTERVAL&to&0&to&
disable&instrumentation
CARBON_METRIC_PREFIX&=&carbon
CARBON_METRIC_INTERVAL&=&60
30. Nick Galbreath http://client9.com/20130501 @ngalbreath
Pre-Aggregation
• Sum or Average metrics based on
wildcards and regexps
• Helps eliminate very slow queries on
webui
• You can emit the final sum & all the
individual components or just the final
sum (via blacklists)
33. Nick Galbreath http://client9.com/20130501 @ngalbreath
Also...
• has support for broadcasting data to
multiple downstream caches
• but.. never used it.. and seems at odds
with the next middleware
35. Nick Galbreath http://client9.com/20130501 @ngalbreath
It's a Router!
• Consistent Hashing (Sharding)
• Or more rule-based routing
• Output to multiple carbon servers
have not really used it much, but should
work similarly to scale outs of
memcache, redis
37. Nick Galbreath http://client9.com/20130501 @ngalbreath
StatsD
• https://github.com/etsy/statsd/
• nodejs based but lots of other
implementations
• Receives UDP, send graphite-
compatible output, flushed periodically.
• Aggregation for all by default
• Beside sum, also can compute other
basic statistics (mean, 90% percentile),
do sampling, have counters, etc.
38. Nick Galbreath http://client9.com/20130501 @ngalbreath
StatsD use case
• It's UDP based, so it excels at
embedding a client inside the
application
• UDP can't block or break the sending
application
• Not so good for bulk metrics
• Use both! Can work together with
aggregator.
39. Nick Galbreath http://client9.com/20130501 @ngalbreath
Of Note
• https://github.com/armon/statsite
• Need to look at this more
• c + libev based
• modern time series algorithms
• very flexible output
41. Nick Galbreath http://client9.com/20130501 @ngalbreath
Backup
• Doing naive backup causes graphite
performance goes to crap.
• File system cache is trashed
• Metrics are not written to disk (lag)
• If OOM occurs then you lose metrics.
44. Nick Galbreath http://client9.com/20130501 @ngalbreath
ionice is better
IONICE(1) User Commands IONICE(1)
NAME
ionice - set or get process I/O scheduling class and priority
SYNOPSIS
ionice [-c class] [-n level] [-t] -p PID...
ionice [-c class] [-n level] [-t] command [argument...]
DESCRIPTION
This program sets or gets the I/O scheduling class and priority for a
program. If no arguments or just -p is given, ionice will query the current I/O
scheduling class and priority for that process.
When command is given, ionice will run this command with the given
arguments. If no class is specified, then command will be executed with the
"best-effort" scheduling class. The default priority level is 4.
NOTES
Linux supports I/O scheduling priorities and classes since 2.6.13 with the
CFQ I/O scheduler.
util-linux July 2011 IONICE(1)
45. Nick Galbreath http://client9.com/20130501 @ngalbreath
Even Better
• Just write the metrics to two graphite
servers in your client
• Script to copy / resync "holes" when
restoring.
47. Nick Galbreath http://client9.com/20130501 @ngalbreath
WebUI
• Hey, it's a web server
• do all the usual stuff
• Ask for known stats,
• check for 200
• check for valid json output
50. Nick Galbreath http://client9.com/20130501 @ngalbreath
MySQL
• If you use SQLite3 -- uhh nothing to
monitor
• If you use MySQL -- use the regular
suspects
• And don't forget to backup!!
56. Nick Galbreath http://client9.com/20130501 @ngalbreath
Tune Apache
• By default, your Apache install is likely
to be "unlimited" in CPU and Memory
usage.
• Select a wildcard metric for a long time
period can easily turn a httpd process
in 1GB. (this seems like a bug actually)
• OOM death.
59. Nick Galbreath http://client9.com/20130501 @ngalbreath
/events/
• Ad-Hoc Events that don't deserve their
own metric type.
• has tags, time, and text
• Stored in SQLite3 by default by the
webapp.
• Rest UI is primitive
60. Nick Galbreath http://client9.com/20130501 @ngalbreath
The WebUI
• it's "ok".. good for experiments
• You will want to make your own
dashboard.
• Good news! The API is a URL, so it's
very easy.
61. Nick Galbreath http://client9.com/20130501 @ngalbreath
WebUI Dashboards
• The WebUI has a dashboard feature for
loading and saving graphs
• It saves data in SQLite3 by default
• Since it's there people will use it
• So hack to remove it or, switch to
MySQL.
62. Nick Galbreath http://client9.com/20130501 @ngalbreath
Granularity
• Like RRDTool, the resolution of the
graph depends on number of pixels
used. No sub-pixel rendering!
• Rapid spikes can be "averaged away"
in week-long views in small graphs.
69. Nick Galbreath http://client9.com/20130501 @ngalbreath
Really Long URLs
• Making graph but the URL is so long
browsers are clipping them?
• Send query string data as a POST
70. Nick Galbreath http://client9.com/20130501 @ngalbreath
Client Side
Rendering
• yeah...
• works ok with a small number of points
• crashes existing browsers with large
number of points
• Server side faster in many cases!
• We'll try again in 2014
71. Nick Galbreath http://client9.com/20130501 @ngalbreath
Colors and
ChartJunk
• Default color scheme is gross
• Be kind to the handicapped (uhh, me)
http://colorbrewer2.org/
• Good overview here:
http://bit.ly/10Hu7zU
73. Nick Galbreath http://client9.com/20130501 @ngalbreath
Accelerate with PyPy
• JIT for Python
• ~ 5.9x performance improvement
• Actually works and is stable
• Compatible with twisted and Django
74. Nick Galbreath http://client9.com/20130501 @ngalbreath
Accelerate with numpy
• numpy provides fast vector
manipulation (C code)
• graphite web gui does a lot of vector
manipulation
• hmmmm.....
75. Nick Galbreath http://client9.com/20130501 @ngalbreath
Ceres Storage Engine
• "Eventually Fixed Size" storage
• More space efficient == more
performance
• see
http://blog.sweetiq.com/2013/01/
using-ceres-as-the-back-end-
database-to-graphite/
76. Nick Galbreath http://client9.com/20130501 @ngalbreath
OpenTSB
• Not Graphite, but similar in spirit
• Has "collectors" for basic ops stats
• Used by StumbleUpon, Box.net,
pintrest
• Good: Stores data in HBASE/Hadoop
• Bad: Stores data in HBASE/Hadoop
77. Nick Galbreath http://client9.com/20130501 @ngalbreath
Add More Functions
• coursen (I'm looking at you Ian Malpass,
that's useful for client-side rendering)
• Real vertical lines (our hacks are stupid)
• Better operators (would nice to know
easily how many metrics you have, e.g.
select count(*))
78. Nick Galbreath http://client9.com/20130501 @ngalbreath
Mine the Apache Log
• Which stats are used the most?
• What are really slow queries?
• Can you optimize them?
• What time frames are used?
• How much old data do you really need
to store?
it's in the
query string
79. Nick Galbreath http://client9.com/20130501 @ngalbreath
Add a TinyURL
Feature
• The URLs get really long and are hard
to put into email, etc.
• Make a tinyurl feature into the django
app and integrate into dashboard.
81. Nick Galbreath http://client9.com/20130501 @ngalbreath
Nick Galbreath
http://www.client9.com/
nickg@client9.com
http://www.iponweb.com/
ngalbreath@iponweb.net
Lets Make
Some Graphs!