While the worlds of ecommerce, search, and application platforms might seem as far from the gaming industry as one might imagine, lessons learned in those environments are surprisingly applicable to online games. Real-time games in particular face many of the same challenges faced -- and solved -- by companies like eBay and Google. They are extremely latency-sensitive, are subject to unpredictable growth and scalability curves, and exhibit extremely spiky load profiles. The real-time player experience is critical to the success of the company -- if a game is down or slow, players will leave and never come back. This session will discuss how experiences with large-scale websites like eBay and Google have informed our approach to building, testing, and operating real-time games at KIXEYE.
This session tells several war stories from eBay and Google about performance, consistency, iterative development, and autoscaling. It further puts it all together by connecting those experiences with what we are now doing in our next-generation gaming platform at KIXEYE.
Everything I Learned About Scaling Online Games I Learned at Google and eBay [Part 1, QConSF 2013]
1. Everything I Learned About
Scaling Online Games I
Learned at Google and
eBay
Randy Shoup
@randyshoup
linkedin.com/in/randyshoup
2. Background
CTO at KIXEYE
• Making awesome games awesomer (and
scalabler and reliabler)
Director of Engineering for Google App
Engine
• World’s largest Platform-as-a-Service
Chief Engineer at eBay
• Multiple generations of eBay’s real-time
search infrastructure
3. Engineering “Fun”
Whole user / player experience
• Think holistically about the full end-to-end
experience of the user
• UX, functionality, performance, bugs, etc.
All useful metrics are *proxies* for fun
• Performance: load time, frame rate, lag
• Technology: latency, availability
• Business: acquisition, retention, monetization
4. Real-Time Strategy Games are
…
Real-time
Spiky
Diverse
Constantly evolving
Constantly pushing boundaries
Technically and operationally demanding
5. Know Your Requirements
Less is more
• More wood, fewer arrows
• Solve 100% of one problem rather than 50%
of two
• Release one great feature instead of two iffy
ones
Understand the requirements
• e.g., Battle replay
• Ephemeral combat
• Immutable recording
• Manageable storage footprint
6. Know Your Bottlenecks
Log everything
Monitor relentlessly
Measure bottlenecks and attack the first
• “When you solve problem one, problem two gets
a promotion”
• Theory of Constraints: attacking *any* other
problem yields no improvement
Accept that your intuition is WRONG (!)
7. Know Your Distributions
“Normal” distribution is *not* normal
• Only works for quantities physically
constrained on both sides, clustered around a
mean
• E.g., adult height or weight
Leads to invalid analysis and conclusions
• Removing outliers
• Ignoring real problems
• Your (trained) intuition is WRONG (!)
8. Know Your Distributions
Exponential (“Long Tail”) distribution *much*
more common
• Income, latency, human connections, etc.
• Also easy to reason about – only single
parameter
Percentiles are your best friends (!)
• Reasonably characterize any distribution
• Measure 90%ile, 99%ile, 99.9%ile
• Focus on the real problems
• Mean and Standard Deviation are useless
9. Layering and Responsibility
Multiple layers
• Client
• Game server
• Services
• Persistence
Clarify roles and responsibilities
• Client- vs. server-authoritative
• Google service layering (+)
10. Distribution of Data / Work
Load-balancing (for stateless work)
• Web servers, proxies
• Most services
Sharding (for stateful work)
• Combat servers
• Matchmaking
• Leaderboards
• Databases
12. Component Isolation
Combat server for TOME
• Highly “twitchy” real-time MOBA combat
• Very latency-sensitive
Real-time interactions isolated to a single,
ephemeral component
• No coordination with any central service
Highly dynamic load distribution
• Router assigns battle to least-loaded server
• Requires latency-fairness between players
13. Asynchrony: Do Work Up Front
Custom asset pipeline
• Spriting, compression, etc
Pre-render “movies” instead of real-time
particle effects
Tons of caching
14. Asynchrony: Client Liveness
Client continues seamlessly if disconnected
• Gameplay more important than immediate
synchronization
Event loop for rendering
• Keep up with the frame rate (!)
Default to background processing
• Refresh assets
• Save client state
15. Asynchrony: Reactive Server
Minimize request latency
• Respond as rapidly as possible to client
• Queue events / messages for complex work
• Service interactions via reliable events
Functional Reactive programming
• Heavy use of Scala and Akka
• Never block (!)
• eBay, Google programming models (-)
16. Small, Independent Teams
Studio System
• Full-stack, independent game teams
• Near-complete autonomy on technology
choices, development processes
Vendor-customer discipline
• Google service teams (+)
Reduces contention and coherence
17. Hire and Retain Top People
Hire „A‟ Players
• Difference between top and bottom
performers is not 1.5x; it’s 10x (!)
• (+) Google hiring process
Virtuous Cycle
• A players bring A players
• B players bring C players
• Constantly raise the bar
Reduces contention and coherence
18. Play to People‟s Strengths
People are not cogs, not fungible
• (-) eBay “Train seats”
• Destroyed incentives, personal pride, long-term
ownership
Align work with skills and passion
• Symphony instead of Factory (!)
• Skills in Flash, Scala, etc.
• Build customizability for target developer, not
builder (DSL >> code)
19. Small Details Matter
In the very large, the very small matters a
*lot*
• Subatomic physics and cosmology
• eBay and variable-byte encoding (+)
• GAE and memcache slab memory allocation
(+)
Discipline is *which* details matter
• Combat server and memory contention
• 40% improvement from six characters …
• “const ”