Riot Games uses Hadoop to analyze over 630 million minutes of gameplay data per day from League of Legends. They built a Hadoop cluster with Hive as the data warehouse to store massive tables with over 50 billion rows. This allows business analysts to run queries and gain insights into game balance, matchmaking effectiveness, and regional metrics. Riot transforms some data into cubes for direct consumption in Tableau since Hive is best for massive jobs and MySQL is not optimized for joining large tables. They have learned the importance of experimenting with different ETL tools and leveraging open source UDFs and libraries to enable new capabilities like region-based analysis.
4. 1
THIS PRESENTATION IS ABOUT…
INTRO
2
• The history of Riot’s data warehouse
3 • Why we incorporated Hadoop
• Our high level architecture
4 • Usecases Hadoop has enabled
• Lessons learned
5
• Where we’re headed
6
7
5. 1
WHO?
INTRO
2
• Developer and publisher of League of Legends
3 • Founded 2006 by gamers for gamers
• Player experience focused – requires data
4
5
6
7
6. 1 INTRO
2
4.2 MILLION 32.5 MILLION
DAILY REGISTERED
3
4
5
1.3 MILLION 11.5 MILLION
CONCURRENT MONTHLY
6
7
8. 1
MEET ANDY HO
2 HISTORY
“With enough data,
even simple questions
3
become difficult
questions”
4
5
6
7
9. 1 SCRAPPY START-UP PHASE
2 HISTORY START-UP
3
• One initial beta environment for North America
• Queries done directly off production MySQL slaves
4 • This is obviously not a good practice
5
6
7
10. 1 AROUND OUR INITIAL LAUNCH
INITIAL
2 HISTORY START-UP
LAUNCH
3 • Moved to a dedicated, single MySQL instance for the DW
• Data ETL’d from production slaves into this instance (by Andy)
4 • Queries run in MySQL (by Andy)
• Reporting was done in Excel (by Andy)
5
6 This worked great!
7
11. 1 THEN WE STARTED GROWING
INITIAL
2 HISTORY START-UP
LAUNCH
GROWTH
3 • Resources were focused elsewhere
– We had competition
– Focused on producing features and scaling our systems
4 • Opened EU environment June 2010
• Needed something speedy – created parallel installation
– This was bad
5
– But we could still get the answers we wanted
6
7
12. 1 AND THEN – CRAZY GROWTH!
INITIAL CRAZY
2 HISTORY START-UP
LAUNCH
GROWTH
GROWTH
3
# unique logins
TOTAL ACTIVE PLAYERS
4
4.2M
5 NOV. 2011
1.5MM
JULY 2011
6
7
time
13. 1 THE BREAKING POINT
INITIAL CRAZY BREAKING
2 HISTORY START-UP
LAUNCH
GROWTH
GROWTH POINT
3 • NA Data Warehouse reached a breaking point 9 months ago
– 24 hours of data took 24.5 hours to ETL
• We couldn’t handle…
4 – multiple environments in a vertical MySQL instance
– a single environment in a vertical MySQL instance
5 • We needed to change!
6
7
15. 1
WHY HADOOP?
2 COST EFFECTIVE
Expanding rapidly, so CAPEX was a concern
3 SOLUTION
SCALABLE
Handles massive data sets and diverse data sets
4 (both structured and unstructured)
OPEN SOURCE
5
Our engineers can dive into problems
6 SPEED OF EXECUTION
We needed to move fast!
7
16. 1
HIGH LEVEL ARCHITECTURE – CURRENT
Business
2 Audit Plat
Analyst
LoL
Tableau
3 SOLUTION NORTH AMERICA
Pentaho
Audit Plat +
Custom ETL Hive Data Pentaho MySQL
4 +
Warehouse
LoL
Sqoop
EUROPE
5
Audit Plat
LoL
6 Analysts
KOREA
7
17. 1
WHAT MAKES UP OUR ETL
Business
2 Audit Plat
Analyst
LoL
Tableau
3 SOLUTION NORTH AMERICA
Pentaho
Audit Plat +
Custom ETL Hive Data Pentaho MySQL
4 +
Warehouse
LoL
Sqoop
EUROPE
5
Audit Plat
LoL
6 Analysts
KOREA
7
18. 1
WHAT MAKES UP OUR ETL
2
3 SOLUTION
Pentaho
All of these orchestrated by Pentaho
+
Custom ETL
4 +
Sqoop
We use Sqoop for staging data only
5
Then dynamically partition data into Hive tables
6
7
19. 1
WHAT MAKES UP OUR ETL
Business
2 Audit Plat
Analyst
LoL
Tableau
3 SOLUTION NORTH AMERICA
Pentaho
Audit Plat +
Custom ETL Hive Data Pentaho MySQL
4 +
Warehouse
LoL
Sqoop
EUROPE
5
Audit Plat
LoL
6 Analysts
KOREA
7
20. 1
WHAT MAKES UP OUR ETL
Hive Data Warehouse
2
3 SOLUTION
Data Temp
Staging
4 Area
5
1
Data written into
temp staging area
6
Prevents analysts from running queries out of partially written tables
Helps us leverage Hive’s merging and compression settings
7
21. 1
WHAT MAKES UP OUR ETL
Hive Data Warehouse
2
Partition A
3 SOLUTION Partition B
Data Temp
Staging Partition C
4 Area
Partition D
Partition E
5
2
Hive dynamically
inserts data into
6 appropriate partitions
According to value generated for partition key in the target table
7 Non-existent partitions will be created by Hive
22. 1
WHAT MAKES UP OUR ETL
Hive Data Warehouse
2
Partition A1
Partition A Partition A2
Partition A3
Partition B1
3 SOLUTION Partition B Partition B2
Data Temp
Partition B3
Partition C1
Staging Partition C Partition C2
Partition C3
4 Area Partition D1
Partition D Partition D2
Partition D3
Partition E1
Partition E Partition E2
Partition E3
5
3
Layered partitioning
= very helpful for
6 region-based partitioning
Helps maintain one table definition across regions
7
23. 1
WHAT MAKES UP OUR ETL
Hive Data Warehouse
2
3 SOLUTION
Data Temp
Staging
4 Area
5
TO OPTIMIZE DISK IO FOR USER QUERIES,
6 WE ENABLED COMPRESSION
7
24. Hive Data Warehouse
1
Data Temp
2 Staging
Area
3 SOLUTION
WHY COMPRESSION?
We have 24 cores and disk IO is always the bottleneck,
4 so compression is essential
WHY SNAPPY COMPRESSED
5 SEQUENCEFILE BLOCKS?
Lots of “why Snappy” discussion on the interwebs already
SequenceFile can be split by Hadoop and can run
6
multiple maps in parallel
Block compression yields better compression ratio while
keeping the file splittable; this block size is configurable
7
25. 1
WHAT WE DO IN HIVE
2
3 SOLUTION
4
Hive Data
Warehouse
5
We ETL data from OLTP MySQL slaves daily
6
7
26. 1
WHAT WE DO IN HIVE
2 Our analysts shoot
Hive queries
every day
3 SOLUTION
4
Hive Data
Warehouse
5
Translating to 1000s of MR jobs daily
6
7
27. 1
WHAT WE DO IN HIVE
2
We have some pretty large tables:
3 SOLUTION
4 e.g., one with 50,795,997,734 rows
Hive Data
Warehouse
5
We use metrics derived from Hive queries to
6 improve our matchmaking system and player behavior
7
28. 1
WHAT DID WE LEARN FROM ETL?
2 • If you use custom ETL, keep an eye out for block distribution
• DRY: Re-inventing the wheel is not a good idea
3 SOLUTION – Invest time in researching proper tools that suit your needs
– Tons of options for ETL and workflow management
– Just because company X is using a particular ETL or workflow
4 management tool, it may or may not work effectively for you
5
6
7
29. 1
WHY TABLEAU?
Business
2 Audit Plat
Analyst
LoL
Tableau
3 SOLUTION NORTH AMERICA
Pentaho
Audit Plat +
Custom ETL Hive Data Pentaho MySQL
4 +
Warehouse
LoL
Sqoop
EUROPE
5
Audit Plat
LoL
6 Analysts
KOREA
7
30. 1
WHY TABLEAU?
Business
2 Analyst
• We needed to democratize access for
Tableau
non-technical folks
3 SOLUTION
– Design
– Execs
MySQL – Player Support
4
• Great visualization capability
• Easy to work with
5 • Has a Hive connector*
6
7
31. 1
LEAGUE OF LEGENDS GAMEPLAY BASICS
2
3 SOLUTION
4
5
6
7
35. 1
2
3 WAIT, SO WHAT’S A YORDLE?
• Yordles = very cute race of champions in League of Legends
4
• We track Yordles (and the rest of our champions) because game
balance is exceptionally important
5
6
7
36. 1
DESIGN BALANCE IS IMPORTANT
2
• Highly competitive game
• Updated every 2-3 weeks
3
– New champions
– New items
4
USECASE
#1
• Game is a living, breathing service that’s always in motion
• Have to maintain a level playing field
5
6
7
37. 1
QUICKLY REACTING TO CHANGES
2 = wins
3
USECASE
4 #1
5
6
total plays
7 time
38. 1
HOW DID WE CREATE THAT?
2
3
USECASE
4 #1
5
6
7
*All logos are trademarks of respective owners
39. 1
WHY NOT JUST HIVE?
2
3
USECASE
4 #1
5
6
7
*All logos are trademarks of respective owners
40. 1
WHY NOT JUST HIVE?
2
3
HIVE IS FOR
MASSIVE JOBS
USECASE
4 #1
5
6
7
41. 1
HIVE TO MYSQL TRANSFORMATION
2 • Many of our stakeholders use Tableau
• Transformed required data into cubes for direct Tableau
consumption using Pentaho
3 • Initially experimented with Hive-to-Tableau connector
– Had issues, e.g., triggering MR jobs for every change and non-
USECASE
persistent Hive-Server
4 #1
5
6
7
42. 1
WE WANTED TO KNOW MORE ABOUT…
2
Which champions and skins are popular across all regions?
3
USECASE
4 #1
What are the win-rates of champions across all regions?
5
Are better players choosing different champions?
6
7
43. 1
WE CREATED CUBES OF AGGREGATED DATA
2
win rates
3
USECASE
4 #1
5
6
champions
7
44. 1
HOW WE DID IT: TRANSFORMATION++
2
Massive tables
reside in Hive
3
Hive MySQL TABLEAU
transformation transformed
creates into cubes for
USECASE
4 #1
dimension tables Tableau consumption
5
6 Some dimension tables
moved to join with
other fact tables in Hive
7
45. 1
WHY DID WE GO THIS ROUTE?
2
3
Not good for slowly changing MySQL is not awesome for joining
dimensions massive tables
USECASE • No automatic primary key
4 #1
generation
• Can’t regenerate dimension
table quickly enough since it
requires a full-table scan
5
6 • Decided to use best of both worlds
• Also leveraged map-side joins and distributed cache
7
47. 1
FIRST, SOME CONTEXT
2 • League of Legends is global in scale, with players
logging in from >145 countries in a typical day
3
• No-fee play means very low barrier to play
• Players often play on multiple environments regularly
(e.g. EU players on NA environments and vice versa)
4 • Same features and mechanics deployed in all territories
• It’s vitally important that we understand game
5 USECASE
#2
performance metrics by geography and region
6
7
48. 1
MATCHMAKING
2 • One of the most important features outside of gameplay
• Like a dating service, the objective is to match people up;
3 • Number of different queues that players can line up in, depending
on the type of match they’re looking for
4
USECASE
5 #2
6 Critical that this system is balanced
balanced
and able to create good matches quickly
7
49. 1
MATCHMAKING – IS IT WORKING?
2 • Matchmaking algorithm based on modified Elo system
• Inspecting the “curve” of these scores:
3 – Should show a similar distribution in all regions
– May show interesting trends, such as win/lose ratios
4
USECASE
5 #2
6
7
50. 1
MATCHMAKING – IS IT WORKING?
2
% players
ELO DISTRIBUTION GRAPH
3
4
USECASE
5 #2
6
7
ELO score
51. 1
WHAT WAS NEEDED TO GENERATE IT?
1
2 Had to join massive tables with session and player data
MASSIVE MASSIVE MASSIVE
3 TABLE TABLE TABLE
WITH WITH WITH
SESSION PLAYER GAME
4 DATA DATA DATA
2
USECASE Needed to lookup and range-query IP-addresses in same join
5 #2
Required for many region-based metrics
6
7
52. 1
LIMITATIONS OF HIVE
2
Hive
3
4 No good indexing Not efficient for
mechanism in our lookup and range
version queries
USECASE
5 #2
6 This made region-based queries computationally difficult
7
53. 1
SOLUTION
2
Hive
3
leveraged
open-source
4 libraries online
GeoIP UDFs
USECASE UDFs = user-defined functions that one
5 #2
can add to the Hive interpreter
6
7
61. 1
OUR IMMEDIATE GOALS
2
• Shorten time to insight
• Increase depth of insight
3
• Enable data analysis for client-side features
• Log ingestion and analysis
4
• Flexible auditing framework
• International data infrastructure
5
6
THE
7 FUTURE
62. 1
CHALLENGE: MAKE IT GLOBAL
2 • Data centers across the globe since latency has huge effect on
gameplay log data scattered around the world
3
• Large presence in Asia -- some areas (e.g., PH) have bandwidth
challenges or bandwidth is expensive
4
5
6
THE
7 FUTURE
63. 1
CHALLENGE: WE HAVE BIG DATA
STRUCTURED DATA
2
500G DAILY
APPLICATION AND OPERATIONAL LOGS
3
4.5TB DAILY
4 OFFICIAL LOL SITE TRAFFIC
6MM HITS DAILY
5 RIOT YOUTUBE CHANNEL
1.7MM SUBSCRIBERS
270+MM VIEWS
6
+ chat logs
+ detailed gameplay event tracking
7 THE
FUTURE
+ so on….
64. 1
OUR AUDACIOUS GOALS
2
Build a world-class data and analytics organization
• Deeply understand players across the globe
• Apply that understanding to improve games for players
3
• Deeply understand our entire ecosystem, including social media
4 Have ability to identify, understand and react to
meaningful trends in real time
5
Have deep, real-time understanding of our systems
from player experience and operational standpoints
6
THE
7 FUTURE
65. 1
SHAMELESS HIRING PLUG
2 • Like most everybody else at this conference… we’re
hiring!
3
• The Riot Manifesto
Player experience first
4 Challenge convention
Focus on talent and team
5
Take play seriously
Stay hungry, stay humble
6
THE
7 FUTURE
66. 1
SHAMELESS HIRING PLUG
2
3
4
5
6
THE
And yes, you can play games at work.
7 FUTURE
It’s encouraged!
Andy was a designer and analyst that began our data warehouseHe was the only resource focused on building out the DW and our analytical capability for the first year of its existenceHe’s also a really nice guy!He made an excellent point, and one that I want to carry through this presentation.
Times where there were 20% month over month growth in a single environment2 environments w/~200K CCU to 16 environments and 1.3million CCU in the space of 12 monthsResources were focused on getting our operational systems to scale along with demand
One table has hand-entered values that lives only in MySQL.Hive cannot generate primary key our-of-the box, we need to associate fact with dimensions in further steps.For intance, we introduce new champions, skins, (mysql ) elo range expands, game types etc (in Hive)
One table has hand-entered values that lives only in MySQL.Hive cannot generate primary key our-of-the box, we need to associate fact with dimensions in further steps.For intance, we introduce new champions, skins, (mysql ) elo range expands, game types etc (in Hive)
One table has hand-entered values that lives only in MySQL.Hive cannot generate primary key our-of-the box, we need to associate fact with dimensions in further steps.For intance, we introduce new champions, skins, (mysql ) elo range expands, game types etc (in Hive)
One table has hand-entered values that lives only in MySQL.Hive cannot generate primary key our-of-the box, we need to associate fact with dimensions in further steps.For intance, we introduce new champions, skins, (mysql ) elo range expands, game types etc (in Hive)
Before we talk about our first usecase, we need to give you a little bit of context about the game and gameplay (super high level)Session Based Team play - basic idea is like “capture the flag” – MOBA!If you die, you re-spawn after a certain amount of time (that time grows as the game progresses)Lots of strategy to the game
Each player “summons” a Champion that he playsEach champion has very different abilities
All players begin at level 1 in a gameplay session and can progress to a maximum of level 18Gain abilitiesGain gold and use that gold to equip your player
Shen is not a Yordle. Shen is a ninja
Early this year, Shen was underpoweredWe decided to fix himHowever, we accidentally made him highly overpoweredWe recognized this fact quickly, and a fix was in place within 2 days
One table has hand-entered values that lives only in MySQL.Hive cannot generate primary key our-of-the box, we need to associate fact with dimensions in further steps.For intance, we introduce new champions, skins, (mysql ) elo range expands, game types etc (in Hive)
Shen is not a Yordle. Shen is a ninja
For international player populations on the North American environment
One table has hand-entered values that lives only in MySQL.Hive cannot generate primary key our-of-the box, we need to associate fact with dimensions in further steps.For intance, we introduce new champions, skins, (mysql ) elo range expands, game types etc (in Hive)
One table has hand-entered values that lives only in MySQL.Hive cannot generate primary key our-of-the box, we need to associate fact with dimensions in further steps.For intance, we introduce new champions, skins, (mysql ) elo range expands, game types etc (in Hive)