This document discusses scaling a social game to support millions of daily users. It summarizes strategies for:
1. Simulating large numbers of users to test the system using Tsung, which can simulate meaningful game sessions.
2. Collecting metrics on system performance from HAProxy and analyzing them using Python and R to understand performance at both the request level and for specific calls.
3. Digging deeper into slow calls by profiling with Erlang to optimize a Redis query, and benchmarking a Redis driver called Eredis using Basho Bench.
4. Measuring internal application performance in production using a homegrown system inspired by Basho Bench to generate latency statistics without overhead.
10. Simulating users
• Must not be too synthetic (like
apachebench)
• Must look like a meaningful game session
• Users must come online at a given rate and
play
11. Tsung
• Multi protocol (HTTP, XMPP) benchmarking tool
• Able to test non trivial call sequences
• Can actually simulate a scripted gaming session
http://tsung.erlang-projects.org/
13. Tsung - configuration
• Not something you fancy writing
• We’re in development, calls change and we
constantly add new calls
• A session might contain hundreds of
requests
• All the calls must refer to a consistent game
state
http://tsung.erlang-projects.org/
14. Tsung - configuration
• From our ruby test code
user.resources(:column => 5, :row => 14)
• Same as
<request subst="true">
<http url="http://server.wooga.com/users/%
%ts_user_server:get_unique_id%%/resources/column/5/
row/14?%%_routing_key%%"
method="POST" contents='{"parameter1":"value1"}'>
</http>
</request>
http://tsung.erlang-projects.org/
15. Tsung - configuration
• Session A session is a
• requests group of requests
• Arrival phase Sessions arrive in
• duration phases with a
specific arrival
• arrival rate rate
http://tsung.erlang-projects.org/
16. Tsung - setup
Application Benchmarking
cluster
app server tsung
HTTP reqs worker
ssh
app server
tsung
master
app server
http://tsung.erlang-projects.org/
17. Tsung
• Generates ~ 2500 reqs/sec on AWS
m1.large
• Flexible but hard to extend
• Code base rather obscure
http://tsung.erlang-projects.org/
19. Tsung-metrics
• Tsung collects measures and provides
reports
• But these measure include tsung network/
cpu congestion itself
• Tsung machines aren’t a good point of view
http://tsung.erlang-projects.org/
20. HAproxy
Application Benchmarking
cluster
app server tsung
HTTP reqs worker
haproxy ssh
app server
tsung
master
app server
21. HAproxy
“The Reliable, High Performance TCP/
HTTP Load Balancer”
• Placed in front of http servers
• Load balancing
• Fail over
22. HAproxy - syslog
• Easy to setup
• Efficient (UDP)
• Provides 5 timings per each request
23. HAproxy
• Time to receive request from client
Application Benchmarking
cluster
app server tsung
haproxy worker
ssh
app server
tsung
master
24. HAproxy
• Time spent in HAproxy queue
Application Benchmarking
cluster
app server tsung
haproxy worker
ssh
app server
tsung
master
25. HAproxy
• Time to connect to the server
Application Benchmarking
cluster
app server tsung
haproxy worker
ssh
app server
tsung
master
26. HAproxy
• Time to receive response headers from server
Application Benchmarking
cluster
app server tsung
haproxy worker
ssh
app server
tsung
master
27. HAproxy
• Total session duration time
Application Benchmarking
cluster
app server tsung
haproxy worker
ssh
app server
tsung
master
28. HAproxy - syslog
• Application urls identify directly server call
• Application urls are easy to parse
• Processing haproxy syslog gives per call
metric
30. Reading/aggregating
metrics
• Python to parse/normalize syslog
• R language to analyze/visualize data
• R language console to interactively explore
benchmarking results
31. R is a free software environment for
statistical computing and graphics.
32. What you get
• Aggregate performance levels (throughput,
latency)
• Detailed performance per call type
• Statistical analysis (outliers, trends,
regression, correlation, frequency, standard
deviation)
35. Digging into the data
• From HAproxy log analisys one call
emerged as exceptionally slow
• Using eprof we were able to determine
that most of the time was spent in a redis
query fetching many keys (MGET)
36. Tracing erldis query
• More than 60% of runtime is spent
manipulating the socket
• gen_tcp:recv/2 is the culprit
• But why is it called so many times?
39. A different approach
• Two ways to use gen_tcp: active or passive
• In passive, use gen_tcp:recv to explicitly ask
for data, blocking
• In active, gen_tcp will send the controlling
process a message when there is data
• Hybrid: active once
40. A different approach
• Is active sockets faster?
• Proof-of-concept proved active socket
faster
• Change erldis or write a new driver?
41. A different approach
• Radical change => new driver
• Keep Erldis queuing approach
• Think about error handling from the start
• Use active sockets
43. Circuit breaker
• eredis has a simple circuit breaker for when
Redis is down/unreachable
• eredis returns immediately to clients if
connection is down
• Reconnecting is done outside request/
response handling
• Robust handling of errors
44. Benchmarking eredis
• Redis driver critical for our application
• Must perform well
• Must be stable
• How do we test this?
45. Basho bench
• Basho produces the Riak KV store
• Basho build a tool to test KV servers
• Basho bench
• We used Basho bench to test eredis
51. Measure internals
HAproxy point of view is valid but how to
measure internals of our application, while
we are live, without the overhead of
tracing?
52. Think Basho bench
• Basho bench can benchmark a redis driver
• Redis is very fast, 100K ops/sec
• Basho bench overhead is acceptable
• The code is very simple
53. Cherry pick ideas from
Basho Bench
• Creates a histogram of timings on the fly,
reducing the number of data points
• Dumps to disk every N seconds
• Allows statistical tools to work on already
aggregated data
• Near real-time, from event to stats in N+5
seconds
54. Homegrown stats
• Measures latency from the edges of our
system (excludes HTTP handling)
• And at interesting points inside the system
• Statistical analysis using R
• Correlate with HAproxy data
• Produces graphs and data specific to our
application
56. Recap
Measure:
• From an external point of view (HAproxy)
• At the edge of the system (excluding
HTTP handling)
• Internals in the single process (eprof)
57. Recap
Analyze:
• Aggregated measures
• Statistical properties of measures
• standard deviation
• distribution
• trends