Appboy analytics - NYC MUG 11/19/13

Appboy Analytics
Jon Hyman
NY MongoDB User Group, November 19, 2013
eBay NYC

@appboy @jon_hyman

A LITTLE BIT ABOUT
US & APPBOY
(who we are and what we do)

Appboy is a mobile relationship
management platform for apps
Jon Hyman
CIO :: @jon_hyman

!
Harvard
Bridgewater

Appboy improves
engagement by helping you
understand your app users
•

IDENTIFY - Understand demographics,

social and behavioral data
•

SEGMENT - Organize customers into

groups based on behaviors, events, user
attributes, and location
•

ENGAGE - Message users through

push notiﬁcations, emails, and multiple
forms of in-app messages

Use Case: Customer engagement begins with onboarding

Urban Outfitters

textPlus

Shape Magazine

Agenda
•

How to quickly store time series data in
MongoDB using flexible schemas 

•

Learn how flexible schemas can easily
provide breakdowns across dimensions 

•

Counting quickly: statistical analysis on top
of MongoDB queries

What kinds of analytics does Appboy track?
•

Lots of time series data
•

App opens over time

•

Events over time

•

Revenue over time

•

Marketing campaign stats and efficacy over time

•

Breakdowns*
•

Device types

•

Device OS versions

•

Screen resolutions

•

Revenue by product

* We also care about this over time!

•

User segment membership
•

How many users are in each
segment?

•

How many can be emailed or
reached via push notifications?

•

What is the average revenue
per user in the segment?

•

Per paying user?

Pre-aggregated Analytics:

APP OPENS OVER TIME

Typical time series collection
Log a new row for each open received
!
{!
timestamp: 2013-11-14 00:00:00 UTC,!
app_id: App identifier!
}!
!
db.app_opens.find({app_id: A, timestamp: {$gte: date}})!

Pro: Really, really simple. Easy to add attribution to users.
Con: You need to aggregate the data before
drawing the chart; lots of documents read into
memory, lots of dirty pages

Fewer documents with pre-aggregation iteration 1
Create a document that groups by the time period
!

{!
app_id: App identifier,!
date: Date of the document,!
hour: 0-23 based hour this document represents,!
opens: Number of opens this hour!
}!
!

db.app_opens.update({date: D, app_id: A, hour: 0},
{$inc: {opens:1}})
Pro: Really easy to draw histograms
Con: We never care about an hour by itself. We lose attribution.

Create a document by day and have each hour be a field
!
{!
app_id: App identifier,!
total_opens: Total number of opens this day,!
0: Number of opens at midnight,!
1: Number of opens at 1am,!
...!
23: Number of opens at 11pm!
}!

!
db.app_opens.update(!
{date: D, app_id: A}, !
{$inc: {“0”:1, total:1}}!
)

Pro: Document count is low, easy to use aggregation framework
for longer spans, fast: document should be in working set

•

What about looking at different dimensions?
•

App opens by device type (e.g., how do iPads

compare to iPhones?)
•

Demographics (gender, age group)

Dynamically add dimensions in the document

!
{!
app_id; App identifier,!
totals: {!
app_opens: Total number of opens this day,!
devices: {!
"iPad Air": Total number of opens on the iPad Air,!
"iPhone 4": Total number of opens on the iPhone 4,!
},!
genders: {!
male: Total number of opens from male users,!
female: Total number of opens from female users!
},!
...!
},!
0: {!
app_opens: Number of opens at midnight,!
devices: {!
"iPad Air": Number of opens on the iPad Air at midnight,!
"iPhone 4": Number of opens on the iPhone 4 at midnight,!
},!
...!
},!
...!
}!

!

db.app_opens.update({date: D, app_id: A}, {$inc: {“0”:1, total:1}})

Pre-aggregated analytics
Pros

•
•

Easily extensible to add other dimensions

•

Still only using one document, therefore you can create
charts very quickly

•

You get breakdowns over a time period for free

!

Cons

•
•

Pre-aggregated data has no attribution

•

Have to know questions ahead of time

Follow up: What if we wanted to look at a graph by age group?

Pre-aggregated analytics summary
•

Get started tracking time series
data quickly

•

You get breakdowns for free

•

Adding dimensions is super simple

•

No attribution, need to know
questions ahead of time

•

Don’t just rely on pre-aggregated
analytics

Counting quickly:

USER SEGMENTATION &
STATISTICAL ANALYSIS

User Segmentation
•A

group of users who match some set of filters

Counting quickly
Appboy shows you segment membership in real-time
as you add/edit/remove filters.
!

How do we do it quickly?
!

We estimate the population sizes of segments when
using our web UI.

Counting quickly

Goal: Quickly get the
count() of an arbitrary
query
!

Problem: MongoDB
counts are slow,
especially unindexed
ones

Counting quickly
10 million documents that represent people:
{!
favorite_color: “blue”,!
age: 27,!
gender: “M”,!
favorite_food: “pizza”,!
city: “NYC”,!
shoe_size: 11,!
attractiveness: 10,!
...!
} !

Counting quickly
10 million documents that represent people:
{!
age: 27,!
gender: “M”,!
city: “NYC”,!
shoe_size: 11,!
...!
} !
•

How many people like blue?

•

How many live in NYC and love pizza?

•

How many men have a shoe size less than 10?

Answer:

Big Question:
How do you estimate
counts?

The same way news
networks do it.
!

With confidence.

Counting quickly
Add a random number in a known range to each document. Say,
between 0 and 9999.
{!
random: 4583,!
age: 27,!
gender: “M”,!
city: “NYC”,!
shoe_size: 11,!
...!
} !

Add an index on the random number:
!

db.users.ensureIndex({random:1})

Counting quickly
Step 1: Get a random sample
!

I have 10 million documents. Of my 10,000 random “buckets”, I
should expect each “bucket” to hold about 1,000 users.
!

E.g.,
!

db.users.find({random: 123}).count() == ~1000!
db.users.find({random: 9043}).count() == ~1000!
db.users.find({random: 4982}).count() == ~1000

Counting quickly
Step 1: Get a random sample
!

Let’s take a random 100,000 users. Grab a random range that
“holds” those users. These all work:
!

db.users.find({random: {$gt: 0, $lt: 101})!
db.users.find({$or: [!
{random: {$gt: 9955}}, !
{random: {$lt: 56}}!
])
Tip: Limit $maxScan to 100,000 just to be safe

Counting quickly
Step 2: Learn about that random sample
!

db.users.find(!
{!
random: {$gt: 0, $lt: 101},!
gender: “M”,!
size_size: {$gt: 10}!
}, !
)!
._addSpecial(“$maxScan”, 100000)!
.explain()
Explain Result:
!
{!
nscannedObjects: 100000,!
n: 11302,!
...!
} !

Counting quickly
Step 3: Do the math
!

Population: 10,000,000
!

Sample size: 100,000
!

Num matches: 11,302
!

Percentage of users who matched: 11.3%
!

Estimated total count: 1,130,000 +/- 0.2%
with 95% confidence

Counting quickly
Step 4: Optimize
!

Limit $maxScan to (100,000/numShards) to be even
faster
•

!

Cache the random range for a few hours

•
!

Add more RAM (or shards)

•
!

Cache results to not hit the database for the same
query
•

Counting quickly
Step 5: Improve
!

Get more than one count: use the aggregation
framework on top of the population’s sample size 
•

•

Work around all sorts of Mongo bugs :-(

Summarize
•

Pre-aggregated analytics
•

Create a document that represents event occurrences
in some time period

•

Takes full advantage of MongoDB’s flexible schemas

•

Not a catch-all for analytics, you should still store event
data

Summarize
•

Counting quickly
•

Estimate results of arbitrary queries using population
sample sizes

•

Depending on your app, this could be a great way to
keep response time predictable as you scale

Thanks! Questions?
jon@appboy.com

@appboy @jon_hyman

Appboy analytics - NYC MUG 11/19/13

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Appboy analytics - NYC MUG 11/19/13

Semelhante a Appboy analytics - NYC MUG 11/19/13 (20)

Mais de MongoDB

Mais de MongoDB (20)

Último

Último (20)

Appboy analytics - NYC MUG 11/19/13