No BS Data Salon #3: Probabilistic Sketching

No BS Data Salon #3:
Probabilistic Sketching
May 2012
Analytics + Attribution =
Actionable Insights

Outline

 What we do at AK
 What’s sketching?
 Our motivation for sketching
 Why should you sketch?
 Our case: unique counting
How it works
How well it works
How we use them

2

Here’s what we do at AK.

Online ad analytics
Compare performance of different: campaigns, inventory,
providers, creatives, etc…

Bottom Line:
Give the advertisers insight into the performance of their ads.

3

Motivation

 High throughput: 10s of K/s => 100s of K/s
 High dimensionality: 100M+ reporting keys
 Easy aggregates: counters, scalars
 Hard aggregates: unique user counting, set operations

 No cheap or effective “online” solutions
Streaming DBs (Truviso, Coral8, StreamBase) insufficient
Warehouse appliances (Aster, custom PG) same
Our data is immutable. Paying for unneeded ACID is silly.

 Offline solutions slow, operationally finicky.
 Not a bank. We don’t need to be perfect, just useful.

4

Why should you bother?

SELECT COUNT(DISTINCT user_id)
FROM access_logs
GROUP BY campaign_id

5

What is probabilistic sketching?

 One-pass
 “Small” memory
 Probabilistic error

6

Our Case Study: unique counting

 Non-unique stream of ints
 Want to keep unique count, up to about a billion
 Want to do set operations (union, intersection, set difference)
 Straw Man #1: “Put them in a HashSet, and go away.”
 (Maybe) Straw Man #2: “Fine, keep a sample.”
 How we did it: HyperLogLog

7

How it works

The Papers:
 LogLog Counting of Large Cardinalities
Marianne Durand and Philippe Flajolet (RIP 2010), 2003

 HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
Flajolet, Fusy, Gandouet, Meunier, 2007

The (rudimentary, unrigorous) Intuition:

Flip fair coins
Longest streak of heads is length k, seen once
Probability of streak ≈ (½)k
E[x] = 1, p = (½)k => n ≈ 2k
8

How it works cont’d

1. Stream of int_64 => “good” hash => random {0,1}64
2. Keep track of longest run of leading zeroes
3. Longest run of length k => cardinality ≈2k

 Crazy math business
Correct systematic bias with a derived constant
Stochastic averaging
Balls and bins correction

9

Here’s what you get

Native:
union, cardinality

Implies:
intersection (!!!), set difference (!!!)

10

Show me the money!

 Used in production at AK for a year
 Accurate: count to a billion with 1-3% error
 Small: a few KB each so we can keep 100s of M in memory
 Fast: benched at 2M inserts/s, used in production at 100s of K/s

11

Lies, damn lies, and boxplots!

Cardinality Relative Error vs True Cardinality
log2m=13 [5kB]

4%

2% ●
HLL Cardinality RE

0%

−2%

●

●

−4%

102 103 104 105 106 107 108 109

12 True Cardinality

But wait, there’s more!
●
●

Intersection Error vs Magnitude Diff erence
log2m=13 [5kB]

40%

●
● ●
● ●
●
● ●
●
● ●
●
●
20% ● ●
●
● factor(overlap_fraction)
●

● 0.1
HLL Intersection Error

● ●
● 0.2

●
0.3
●

●
0.4

0% 0.5
● ● ● 0.6
●
● 0.7
● ●
0.8
● 0.9
●
●
1
−20%

● ●

●
−40%

0 1 2 3

13 Cardinality Order of Magnitude Diff erence

Implementation caveats

 If you store an HLL for each key, you’ll likely be wasting space when all the
registers aren’t set. Use map-based HLL or use compression.
 Pick a good hash function!
 Test on your data!
 Tune parameters to suit your business needs!

14

How we use them, in production

 Original problem: fast, on-the-fly overlaps and unique counts
 Solution:
streaming, in-memory aggregations shipped to Postgres
Postgres module to do set operations on binary representations in the DB

 Freebie: PG analytics support like GROUP BY, sliding windows, etc…

15

UI example

To the browser, Robin!

16

How we use them, Ad Hoc

 Outside of production: amazing ad-hoc analysis tool
 Example: gathering more than a year’s worth of data for an RFP, at 20B
impressions/month
painless and quick when we had the data as sketches
much more effort to put it through Hadoop

 Iterating on product and research is cheaper and faster.
Waiting minutes instead of seconds between iterations is painful.

17

“Soft” Caveats

 Fixed N% error is deceiving
 Additive error for set operations can balloon
 Unbounded error sneaks in now and again

18

Parting Advice

 Test these on your data rigorously
 Choose good hash functions
 Tuning parameters are particularly sensitive
 You’ll find all kinds of unexpected uses for them, so get building!
 Bibliography blog post will be up in a bit!

19

Questions?

@timonk
timon@aggregateknowledge.com
blog.aggregateknowledge.com

20

Credits

All the adorable cartoons you saw in this presentation were taken from
http://sureilldrawthat.com/ and http://sureilldrawthat.tumblr.com/ and belong
to him/her.

21

No BS Data Salon #3: Probabilistic Sketching

Recommended

Recommended

More Related Content

Similar to No BS Data Salon #3: Probabilistic Sketching

Similar to No BS Data Salon #3: Probabilistic Sketching (13)

Recently uploaded

Recently uploaded (20)

No BS Data Salon #3: Probabilistic Sketching