2. A LITTLE BIT ABOUT
US & APPBOY
(who we are and what we do)
Appboy is a mobile relationship
management platform for apps
Jon Hyman
CIO :: @jon_hyman
!
Harvard
Bridgewater
3. Appboy improves
engagement by helping you
understand your app users
•
IDENTIFY - Understand demographics,
social and behavioral data
•
SEGMENT - Organize customers into
groups based on behaviors, events, user
attributes, and location
•
ENGAGE - Message users through
push notifications, emails, and multiple
forms of in-app messages
4. Use Case: Customer engagement begins with onboarding
Urban Outfitters
textPlus
Shape Magazine
5. Agenda
•
How to quickly store time series data in
MongoDB using flexible schemas
•
Learn how flexible schemas can easily
provide breakdowns across dimensions
•
Counting quickly: statistical analysis on top
of MongoDB queries
6. What kinds of analytics does Appboy track?
•
Lots of time series data
•
App opens over time
•
Events over time
•
Revenue over time
•
Marketing campaign stats and efficacy over time
7. What kinds of analytics does Appboy track?
•
Breakdowns*
•
Device types
•
Device OS versions
•
Screen resolutions
•
Revenue by product
* We also care about this over time!
8. What kinds of analytics does Appboy track?
•
User segment membership
•
How many users are in each
segment?
•
How many can be emailed or
reached via push notifications?
•
What is the average revenue
per user in the segment?
•
Per paying user?
10. Typical time series collection
Log a new row for each open received
!
{!
timestamp: 2013-11-14 00:00:00 UTC,!
app_id: App identifier!
}!
!
db.app_opens.find({app_id: A, timestamp: {$gte: date}})!
Pro: Really, really simple. Easy to add attribution to users.
Con: You need to aggregate the data before
drawing the chart; lots of documents read into
memory, lots of dirty pages
11. Fewer documents with pre-aggregation iteration 1
Create a document that groups by the time period
!
{!
app_id: App identifier,!
date: Date of the document,!
hour: 0-23 based hour this document represents,!
opens: Number of opens this hour!
}!
!
db.app_opens.update({date: D, app_id: A, hour: 0},
{$inc: {opens:1}})
Pro: Really easy to draw histograms
Con: We never care about an hour by itself. We lose attribution.
12. Fewer documents with pre-aggregation iteration 2
Create a document by day and have each hour be a field
!
{!
app_id: App identifier,!
date: Date of the document,!
total_opens: Total number of opens this day,!
0: Number of opens at midnight,!
1: Number of opens at 1am,!
...!
23: Number of opens at 11pm!
}!
!
db.app_opens.update(!
{date: D, app_id: A}, !
{$inc: {“0”:1, total:1}}!
)
Pro: Document count is low, easy to use aggregation framework
for longer spans, fast: document should be in working set
13. Fewer documents with pre-aggregation iteration 2
•
What about looking at different dimensions?
•
App opens by device type (e.g., how do iPads
compare to iPhones?)
•
Demographics (gender, age group)
15. Fewer documents with pre-aggregation iteration 3
Dynamically add dimensions in the document
!
{!
app_id; App identifier,!
date: Date of the document,!
totals: {!
app_opens: Total number of opens this day,!
devices: {!
"iPad Air": Total number of opens on the iPad Air,!
"iPhone 4": Total number of opens on the iPhone 4,!
},!
genders: {!
male: Total number of opens from male users,!
female: Total number of opens from female users!
},!
...!
},!
0: {!
app_opens: Number of opens at midnight,!
devices: {!
"iPad Air": Number of opens on the iPad Air at midnight,!
"iPhone 4": Number of opens on the iPhone 4 at midnight,!
},!
...!
},!
...!
}!
!
db.app_opens.update({date: D, app_id: A}, {$inc: {“0”:1, total:1}})
16. Pre-aggregated analytics
Pros
•
•
Easily extensible to add other dimensions
•
Still only using one document, therefore you can create
charts very quickly
•
You get breakdowns over a time period for free
!
Cons
•
•
Pre-aggregated data has no attribution
•
Have to know questions ahead of time
Follow up: What if we wanted to look at a graph by age group?
17. Pre-aggregated analytics summary
•
Get started tracking time series
data quickly
•
You get breakdowns for free
•
Adding dimensions is super simple
•
No attribution, need to know
questions ahead of time
•
Don’t just rely on pre-aggregated
analytics
20. Counting quickly
Appboy shows you segment membership in real-time
as you add/edit/remove filters.
!
How do we do it quickly?
!
We estimate the population sizes of segments when
using our web UI.
21. Counting quickly
Goal: Quickly get the
count() of an arbitrary
query
!
Problem: MongoDB
counts are slow,
especially unindexed
ones
23. Counting quickly
10 million documents that represent people:
{!
favorite_color: “blue”,!
age: 27,!
gender: “M”,!
favorite_food: “pizza”,!
city: “NYC”,!
shoe_size: 11,!
attractiveness: 10,!
...!
} !
•
How many people like blue?
•
How many live in NYC and love pizza?
•
How many men have a shoe size less than 10?
25. Counting quickly
Add a random number in a known range to each document. Say,
between 0 and 9999.
{!
random: 4583,!
favorite_color: “blue”,!
age: 27,!
gender: “M”,!
favorite_food: “pizza”,!
city: “NYC”,!
shoe_size: 11,!
attractiveness: 10,!
...!
} !
Add an index on the random number:
!
db.users.ensureIndex({random:1})
26. Counting quickly
Step 1: Get a random sample
!
I have 10 million documents. Of my 10,000 random “buckets”, I
should expect each “bucket” to hold about 1,000 users.
!
E.g.,
!
db.users.find({random: 123}).count() == ~1000!
db.users.find({random: 9043}).count() == ~1000!
db.users.find({random: 4982}).count() == ~1000
27. Counting quickly
Step 1: Get a random sample
!
Let’s take a random 100,000 users. Grab a random range that
“holds” those users. These all work:
!
db.users.find({random: {$gt: 0, $lt: 101})!
db.users.find({random: {$gt: 503, $lt: 604})!
db.users.find({random: {$gt: 8938, $lt: 9039})!
db.users.find({$or: [!
{random: {$gt: 9955}}, !
{random: {$lt: 56}}!
])
Tip: Limit $maxScan to 100,000 just to be safe
29. Counting quickly
Step 3: Do the math
!
Population: 10,000,000
!
Sample size: 100,000
!
Num matches: 11,302
!
Percentage of users who matched: 11.3%
!
Estimated total count: 1,130,000 +/- 0.2%
with 95% confidence
30. Counting quickly
Step 4: Optimize
!
Limit $maxScan to (100,000/numShards) to be even
faster
•
!
Cache the random range for a few hours
•
!
Add more RAM (or shards)
•
!
Cache results to not hit the database for the same
query
•
31. Counting quickly
Step 5: Improve
!
Get more than one count: use the aggregation
framework on top of the population’s sample size
•
•
Work around all sorts of Mongo bugs :-(
32. Summarize
•
Pre-aggregated analytics
•
Create a document that represents event occurrences
in some time period
•
Takes full advantage of MongoDB’s flexible schemas
•
Not a catch-all for analytics, you should still store event
data
33. Summarize
•
Counting quickly
•
Estimate results of arbitrary queries using population
sample sizes
•
Depending on your app, this could be a great way to
keep response time predictable as you scale