How we at Plumbee collect and process data at scale and how this data is used to send relevant mobile push notifications to our players to keep them engaged.
Presented as part of a Tech Talk: http://engineering.plumbee.com/blog/2014/11/07/tech-talk-push-notifications-big-data/
Driving Behavioral Change for Information Management through Data-Driven Gree...
Transforming Mobile Push Notifications with Big Data
1. Transforming Mobile
Push Notifications with
Big Data
Dennis Waldron, Data Engineering
Pablo Varela, Systems Engineering
2. Who is Plumbee?
● 12.8M Installs
● 209K Daily Active Users
● 818K Monthly Active Users
● Social Games Studio
● Mirrorball Slots & Bingo
● Facebook Canvas, iOS
3. Data Providers
Inhouse data = 99.9% of all data
In Total:
● 98TB (907 days of data)
● All stored in Amazon S3
Daily:
● 78GB compressed
● ~450M events/day
● 4,800 events/second (peak)
5. Amazon Web Service
Application/Game Servers
End Users (Desktop & Mobile)
● Collect everything!
● RPC events intercepted by
annotated endpoints. (Requests)
● All mutating state changes
recorded:
○ DynamoDB, MySQL, Memcache
(Blobs Updates)
● Custom Telemetry (Other):
○ Client: click tracking, loading time
statistics, GPU data...
○ Server: promotions, transactions,
Facebook user data...
Game Data
MySQL
MemCache
RPC
77%
9%
OTHER 15%
GENERATES
DynamoDB
6. Game Data - Example RPC Endpoint Annotation
/**
* Example annotation
*/
@SQSRequestLog(requestMessage = SpinRequest.class)
@RequestMapping(“/spin”)
public SpinResponse spin(SpinRequest spinRequest) {
…
}
7. Example Event - userStats
● All events are recorded in JSON.
● Structure:
○ Headers
○ Categorization Data (metadata)
○ Payload (message)
● Important Headers:
○ timestamp
○ testVariant
○ plumbeeUid
9. Data Collection (I) - PUT
Application/Game Servers
Events (JSON)
SQS Queue
Log Aggregators
Producers Consumers
What is SQS (Simple Queue Service)?
A cloud-based message queue for transmitting
messages between producers and consumers
SQS Provides:
● ACK/FAIL semantics
● Unlimited number of messages
● Scales transparently
● Buffer zone
10. Data Collection (II) - GET
SQS Queue
What is Apache Flume?
A distributed, reliable, and available service
for efficiently collecting, aggregating, and
moving large amounts of log data
Apache Flume
Consumers
Amazon S3
(Simple Storage Service)
S3 Data:
● Partitioned by: date / type / sub_type
● Compressed with: Snappy
● Aggregated in 512MB chunks
11. Data Collection (III) - Flume
Flume Agent
Source
(Custom)
Sink
(HDFS)
SQS Queue
Channel
(File Based)
● Pluggable component architecture
● Durability via transactions
● File channel use Elastic Book Store (EBS) volumes (network attached storage)
○ Protects against Hardware failure
● SQS Flume Plugin: https://github.com/plumbee/flume-sqs-source
S3 Bucket
Transactions
A + B + C = Flow
A B C
13. Extract, Transform, Load
● Daily activity
● Orchestrated by Amazon DataPipeline
● Includes generation of reports
● Configured with JSON
What is DataPipeline?
A cloud-based data workflow service that
helps you process and move data between
different AWS services
RESOURCE COMMAND SCHEDULE
14. Extract & Transform (I)
What is Elastic Map Reduce?
Cloud-based MapReduce implementation to
process vast amounts of data built on top of
the open-sourced Hadoop framework.
Two phases:
● Map() Procedure -> Filtering & Sorting
● Reduce() -> Summary operation
Penguin
Horse
Cake
Cake
Penguin
Penguin
Penguin
Horse
Horse
Cake
Cake
Horse
Horse
Horse
MAP()
Penguin
Penguin
Penguin
Penguin
REDUCE()
Cake: 2 Horse: 3
RESULT SORTED QUEUES RAW DATA
Penguin:
4
15. Extract & Transform (II)
What is Hive?
An open-sourced Apache project with provides a
SQL-Like interface to summarize, query and
analysis large datasets by leveraging Hadoop’s
MapReduce infrastructure.
● Not really SQL, HQL -> HiveQL
● No transactions, materialized views,
limited subquery support, ...
SELECT plumbeeuid,
COUNT(*) AS spins
FROM eventlog
-- Partitioned data access
WHERE event_date = '2014-11-18'
AND event_type = 'rpc'
AND event_sub_type = 'rpc-spin'
-- Aggregation
GROUP BY plumbeeuid;
Table: Eventlog
● Mounted on top of raw data
● SerDe provides JSON parsing
● Target data via partition filters
16. Extract & Transform (III)
● Hive has limitations!
○ Speed, JSON
● Most of our transformations use:
Streaming MapReduce Jobs
What is Streaming?
“A Hadoop utility that allows you to create
and run MapReduce jobs using any
executable script as a mapper or reducer”
for line in sys.stdin:
data = json.loads(line)
print data['plumbeeUid'] + 't' + 1
Emits, Key value Pairs
466264 => 1, 376166 => 1
983131 => 1, 466264 => 1
Hadoop sorts and shuffles the data making sure
matching keys are processed by a single reducer!
results = defaultdict(int)
for line in sys.stdin:
plumbee_uid, count = line.split('t')
results[plumbee_uid] += int(count)
print results
JSON rpc-spin
Data
Result:
{ 466264: 2, 376166: 1, 983131: 1 }
map()
reduce()
17. Results
Load (I) - Problem
Raw S3 JSON Data Aggregated Data
EMR Transformed data:
● Referred to as aggregates
● Stored in S3
● Accessible via EMR cluster
EMR Transformation
(Hive & Streaming Jobs)
5.4TB
Problem
● We don’t run long-lived EMR clusters.
EMR requires:
● Specialists knowledge
● Is slow, processing and booting “offline”.
Use Amazon Redshift for fast “online” data access
18. What is Redshift?
A column-oriented database which uses
Massive Parallel Processing (MPP) techniques
to support analytics style SQL based
workloads across large datasets.
Power comes from:
● Query parallelization
● Column-oriented design
Redshift Provides:
● Low latency JDBC and ODBC access
● Fault Tolerance
● Automated Backups
Load (II) - Redshift
Redshift (x3 nodes): 0.33s
EMR (x20 nodes): 135.46s
19. Load (II) - Column-Oriented Databases
Row-oriented Database - MySQL
ID First Name Last Name Country
1 Penguin Situation GB
2 Cheese Labs US
3 Horse Barracks GB
Column-oriented Database - Redshift
ID First Name Last Name Country
1 Penguin Situation GB
2 Cheese Labs US
3 Horse Barracks GB
● East to add/modify records
● Could read irrelevant data.
● Great for fast lookups (OLTP)
● Only read in relevant data
● Adding rows requires multiple
updates to column data.
● Great for aggregation queries
(OLAP)
32. User targeting
Run SQL queries directly against Redshift
SQL Query
Amazon Redshift User Segment
33. User targeting: Query example
-- Target all mobile users
SELECT plumbee_uid, arn
FROM mobile_user
34. User targeting: Query example (II)
-- Target lapsed users (1 week lapse)
SELECT plumbee_uid, arn
FROM mobile_user
WHERE last_play_time < (now - 7 days)
44. Amazon SNS: Mobile Push
private void publishMessage(UserData userData, String jsonPayload) {
amazonSNS.publish(new PublishRequest()
.withTargetArn( userData.getEndpoint())
.withMessageStructure( "json")
.withMessage( jsonPayload ));
}
Payload example
{"default": "The 5 day Halloween Challenge has started today! Touch to play NOW!"}