1. No BS Data Salon #3:
Probabilistic Sketching
May 2012
Analytics + Attribution =
Actionable Insights
2. Outline
What we do at AK
What’s sketching?
Our motivation for sketching
Why should you sketch?
Our case: unique counting
How it works
How well it works
How we use them
2
3. Here’s what we do at AK.
Online ad analytics
Compare performance of different: campaigns, inventory,
providers, creatives, etc…
Bottom Line:
Give the advertisers insight into the performance of their ads.
3
4. Motivation
High throughput: 10s of K/s => 100s of K/s
High dimensionality: 100M+ reporting keys
Easy aggregates: counters, scalars
Hard aggregates: unique user counting, set operations
No cheap or effective “online” solutions
Streaming DBs (Truviso, Coral8, StreamBase) insufficient
Warehouse appliances (Aster, custom PG) same
Our data is immutable. Paying for unneeded ACID is silly.
Offline solutions slow, operationally finicky.
Not a bank. We don’t need to be perfect, just useful.
4
5. Why should you bother?
SELECT COUNT(DISTINCT user_id)
FROM access_logs
GROUP BY campaign_id
5
7. Our Case Study: unique counting
Non-unique stream of ints
Want to keep unique count, up to about a billion
Want to do set operations (union, intersection, set difference)
Straw Man #1: “Put them in a HashSet, and go away.”
(Maybe) Straw Man #2: “Fine, keep a sample.”
How we did it: HyperLogLog
7
8. How it works
The Papers:
LogLog Counting of Large Cardinalities
Marianne Durand and Philippe Flajolet (RIP 2010), 2003
HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
Flajolet, Fusy, Gandouet, Meunier, 2007
The (rudimentary, unrigorous) Intuition:
Flip fair coins
Longest streak of heads is length k, seen once
Probability of streak ≈ (½)k
E[x] = 1, p = (½)k => n ≈ 2k
8
9. How it works cont’d
1. Stream of int_64 => “good” hash => random {0,1}64
2. Keep track of longest run of leading zeroes
3. Longest run of length k => cardinality ≈2k
Crazy math business
Correct systematic bias with a derived constant
Stochastic averaging
Balls and bins correction
9
10. Here’s what you get
Native:
union, cardinality
Implies:
intersection (!!!), set difference (!!!)
10
11. Show me the money!
Used in production at AK for a year
Accurate: count to a billion with 1-3% error
Small: a few KB each so we can keep 100s of M in memory
Fast: benched at 2M inserts/s, used in production at 100s of K/s
11
14. Implementation caveats
If you store an HLL for each key, you’ll likely be wasting space when all the
registers aren’t set. Use map-based HLL or use compression.
Pick a good hash function!
Test on your data!
Tune parameters to suit your business needs!
14
15. How we use them, in production
Original problem: fast, on-the-fly overlaps and unique counts
Solution:
streaming, in-memory aggregations shipped to Postgres
Postgres module to do set operations on binary representations in the DB
Freebie: PG analytics support like GROUP BY, sliding windows, etc…
15
17. How we use them, Ad Hoc
Outside of production: amazing ad-hoc analysis tool
Example: gathering more than a year’s worth of data for an RFP, at 20B
impressions/month
painless and quick when we had the data as sketches
much more effort to put it through Hadoop
Iterating on product and research is cheaper and faster.
Waiting minutes instead of seconds between iterations is painful.
17
18. “Soft” Caveats
Fixed N% error is deceiving
Additive error for set operations can balloon
Unbounded error sneaks in now and again
18
19. Parting Advice
Test these on your data rigorously
Choose good hash functions
Tuning parameters are particularly sensitive
You’ll find all kinds of unexpected uses for them, so get building!
Bibliography blog post will be up in a bit!
19
21. Credits
All the adorable cartoons you saw in this presentation were taken from
http://sureilldrawthat.com/ and http://sureilldrawthat.tumblr.com/ and belong
to him/her.
21