SlideShare a Scribd company logo
1 of 21
No BS Data Salon #3:
Probabilistic Sketching
May 2012
                          Analytics + Attribution =
                            Actionable Insights
Outline

     What we do at AK
     What’s sketching?
     Our motivation for sketching
     Why should you sketch?
     Our case: unique counting
       How it works
       How well it works
       How we use them




2
Here’s what we do at AK.


                   Online ad analytics
      Compare performance of different: campaigns, inventory,
                    providers, creatives, etc…




                        Bottom Line:
    Give the advertisers insight into the performance of their ads.



3
Motivation

     High throughput: 10s of K/s => 100s of K/s
     High dimensionality: 100M+ reporting keys
     Easy aggregates: counters, scalars
     Hard aggregates: unique user counting, set operations


     No cheap or effective “online” solutions
       Streaming DBs (Truviso, Coral8, StreamBase) insufficient
       Warehouse appliances (Aster, custom PG) same
       Our data is immutable. Paying for unneeded ACID is silly.

     Offline solutions slow, operationally finicky.
     Not a bank. We don’t need to be perfect, just useful.

4
Why should you bother?




    SELECT COUNT(DISTINCT user_id)
    FROM access_logs
    GROUP BY campaign_id




5
What is probabilistic sketching?




     One-pass
     “Small” memory
     Probabilistic error




6
Our Case Study: unique counting

     Non-unique stream of ints
     Want to keep unique count, up to about a billion
     Want to do set operations (union, intersection, set difference)
     Straw Man #1: “Put them in a HashSet, and go away.”
     (Maybe) Straw Man #2: “Fine, keep a sample.”
     How we did it: HyperLogLog




7
How it works

                                     The Papers:
     LogLog Counting of Large Cardinalities
       Marianne Durand and Philippe Flajolet (RIP 2010), 2003

     HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
       Flajolet, Fusy, Gandouet, Meunier, 2007

               The (rudimentary, unrigorous) Intuition:

        Flip fair coins
        Longest streak of heads is length k, seen once
        Probability of streak ≈ (½)k
        E[x] = 1, p = (½)k => n ≈ 2k
8
How it works cont’d

    1.   Stream of int_64 => “good” hash => random {0,1}64
    2.   Keep track of longest run of leading zeroes
    3.   Longest run of length k => cardinality ≈2k

     Crazy math business
         Correct systematic bias with a derived constant
         Stochastic averaging
         Balls and bins correction




9
Here’s what you get




                     Native:
                union, cardinality

                    Implies:
      intersection (!!!), set difference (!!!)




10
Show me the money!

      Used in production at AK for a year
      Accurate: count to a billion with 1-3% error
      Small: a few KB each so we can keep 100s of M in memory
      Fast: benched at 2M inserts/s, used in production at 100s of K/s




11
Lies, damn lies, and boxplots!

                                                Cardinality Relative Error vs True Cardinality
                                                              log2m=13 [5kB]


                          4%




                          2%     ●
     HLL Cardinality RE




                          0%




                          −2%



                                                               ●

                                                                                                  ●




                          −4%


                                102       103   104           105                106       107   108   109


12                                                                 True Cardinality
But wait, there’s more!
                                     ●
                                                                                                                             ●

                                                                     Intersection Error vs Magnitude Diff erence
                                                                                   log2m=13 [5kB]




                              40%


                                                                                                             ●
                                                                          ●                                  ●
                                                                              ●                              ●
                                                                              ●
                                                                                                                 ●   ●
                                         ●
                                         ●                                                                       ●
                                                                                                                 ●
                                                                                                                 ●
                              20%                                                 ●   ●
                                                                                                                             ●
                                                                                                                             ●   factor(overlap_fraction)
                                                                                                                             ●

                                             ●                                                                                       0.1
     HLL Intersection Error




                                                                                      ●   ●
                                                                                          ●                                          0.2

                                                                                              ●
                                                                                                                                     0.3
                                                                                                  ●

                                                             ●
                                                                                                                                     0.4

                               0%                                                                                                    0.5
                                                         ●   ●   ●                                                                   0.6
                                                     ●
                                                 ●                                                                                   0.7
                                                 ●                                            ●
                                                                                                                                     0.8
                                                                                      ●                                              0.9
                                                                                      ●
                                                                                      ●
                                                                                                                                     1
                              −20%



                                                                              ●                              ●




                                                                                                             ●
                              −40%




                                                     0                                1                  2               3


13                                                                      Cardinality Order of Magnitude Diff erence
Implementation caveats

      If you store an HLL for each key, you’ll likely be wasting space when all the
       registers aren’t set. Use map-based HLL or use compression.
      Pick a good hash function!
      Test on your data!
      Tune parameters to suit your business needs!




14
How we use them, in production

      Original problem: fast, on-the-fly overlaps and unique counts
      Solution:
        streaming, in-memory aggregations shipped to Postgres
        Postgres module to do set operations on binary representations in the DB

      Freebie: PG analytics support like GROUP BY, sliding windows, etc…




15
UI example




              To the browser, Robin!




16
How we use them, Ad Hoc

      Outside of production: amazing ad-hoc analysis tool
      Example: gathering more than a year’s worth of data for an RFP, at 20B
       impressions/month
         painless and quick when we had the data as sketches
        much more effort to put it through Hadoop

      Iterating on product and research is cheaper and faster.
        Waiting minutes instead of seconds between iterations is painful.




17
“Soft” Caveats



      Fixed N% error is deceiving
      Additive error for set operations can balloon
      Unbounded error sneaks in now and again




18
Parting Advice

      Test these on your data rigorously
      Choose good hash functions
      Tuning parameters are particularly sensitive
      You’ll find all kinds of unexpected uses for them, so get building!
      Bibliography blog post will be up in a bit!




19
Questions?


                  @timonk
     timon@aggregateknowledge.com
      blog.aggregateknowledge.com




20
Credits

     All the adorable cartoons you saw in this presentation were taken from
     http://sureilldrawthat.com/ and http://sureilldrawthat.tumblr.com/ and belong
     to him/her.




21

More Related Content

Similar to No BS Data Salon #3: Probabilistic Sketching (14)

01 Intro
01 Intro01 Intro
01 Intro
 
17 Sampling Dist
17 Sampling Dist17 Sampling Dist
17 Sampling Dist
 
01 intro
01 intro01 intro
01 intro
 
02 Large
02 Large02 Large
02 Large
 
02 large
02 large02 large
02 large
 
A Comparative Study of Geographic Routing in Social Network Based on Mobile P...
A Comparative Study of Geographic Routing in Social Network Based on Mobile P...A Comparative Study of Geographic Routing in Social Network Based on Mobile P...
A Comparative Study of Geographic Routing in Social Network Based on Mobile P...
 
20 Polishing
20 Polishing20 Polishing
20 Polishing
 
How People Use Facebook -- And Why It Matters
How People Use Facebook -- And Why It MattersHow People Use Facebook -- And Why It Matters
How People Use Facebook -- And Why It Matters
 
08 Continuous
08 Continuous08 Continuous
08 Continuous
 
08 Continuous
08 Continuous08 Continuous
08 Continuous
 
About Vision, Mission And Strategy
About Vision, Mission And StrategyAbout Vision, Mission And Strategy
About Vision, Mission And Strategy
 
13 Bivariate
13 Bivariate13 Bivariate
13 Bivariate
 
14 case-study
14 case-study14 case-study
14 case-study
 
Over Visie, Missie En Strategie
Over Visie, Missie En StrategieOver Visie, Missie En Strategie
Over Visie, Missie En Strategie
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

No BS Data Salon #3: Probabilistic Sketching

  • 1. No BS Data Salon #3: Probabilistic Sketching May 2012 Analytics + Attribution = Actionable Insights
  • 2. Outline  What we do at AK  What’s sketching?  Our motivation for sketching  Why should you sketch?  Our case: unique counting How it works How well it works How we use them 2
  • 3. Here’s what we do at AK. Online ad analytics Compare performance of different: campaigns, inventory, providers, creatives, etc… Bottom Line: Give the advertisers insight into the performance of their ads. 3
  • 4. Motivation  High throughput: 10s of K/s => 100s of K/s  High dimensionality: 100M+ reporting keys  Easy aggregates: counters, scalars  Hard aggregates: unique user counting, set operations  No cheap or effective “online” solutions Streaming DBs (Truviso, Coral8, StreamBase) insufficient Warehouse appliances (Aster, custom PG) same Our data is immutable. Paying for unneeded ACID is silly.  Offline solutions slow, operationally finicky.  Not a bank. We don’t need to be perfect, just useful. 4
  • 5. Why should you bother? SELECT COUNT(DISTINCT user_id) FROM access_logs GROUP BY campaign_id 5
  • 6. What is probabilistic sketching?  One-pass  “Small” memory  Probabilistic error 6
  • 7. Our Case Study: unique counting  Non-unique stream of ints  Want to keep unique count, up to about a billion  Want to do set operations (union, intersection, set difference)  Straw Man #1: “Put them in a HashSet, and go away.”  (Maybe) Straw Man #2: “Fine, keep a sample.”  How we did it: HyperLogLog 7
  • 8. How it works The Papers:  LogLog Counting of Large Cardinalities Marianne Durand and Philippe Flajolet (RIP 2010), 2003  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm Flajolet, Fusy, Gandouet, Meunier, 2007 The (rudimentary, unrigorous) Intuition: Flip fair coins Longest streak of heads is length k, seen once Probability of streak ≈ (½)k E[x] = 1, p = (½)k => n ≈ 2k 8
  • 9. How it works cont’d 1. Stream of int_64 => “good” hash => random {0,1}64 2. Keep track of longest run of leading zeroes 3. Longest run of length k => cardinality ≈2k  Crazy math business Correct systematic bias with a derived constant Stochastic averaging Balls and bins correction 9
  • 10. Here’s what you get Native: union, cardinality Implies: intersection (!!!), set difference (!!!) 10
  • 11. Show me the money!  Used in production at AK for a year  Accurate: count to a billion with 1-3% error  Small: a few KB each so we can keep 100s of M in memory  Fast: benched at 2M inserts/s, used in production at 100s of K/s 11
  • 12. Lies, damn lies, and boxplots! Cardinality Relative Error vs True Cardinality log2m=13 [5kB] 4% 2% ● HLL Cardinality RE 0% −2% ● ● −4% 102 103 104 105 106 107 108 109 12 True Cardinality
  • 13. But wait, there’s more! ● ● Intersection Error vs Magnitude Diff erence log2m=13 [5kB] 40% ● ● ● ● ● ● ● ● ● ● ● ● ● 20% ● ● ● ● factor(overlap_fraction) ● ● 0.1 HLL Intersection Error ● ● ● 0.2 ● 0.3 ● ● 0.4 0% 0.5 ● ● ● 0.6 ● ● 0.7 ● ● 0.8 ● 0.9 ● ● 1 −20% ● ● ● −40% 0 1 2 3 13 Cardinality Order of Magnitude Diff erence
  • 14. Implementation caveats  If you store an HLL for each key, you’ll likely be wasting space when all the registers aren’t set. Use map-based HLL or use compression.  Pick a good hash function!  Test on your data!  Tune parameters to suit your business needs! 14
  • 15. How we use them, in production  Original problem: fast, on-the-fly overlaps and unique counts  Solution: streaming, in-memory aggregations shipped to Postgres Postgres module to do set operations on binary representations in the DB  Freebie: PG analytics support like GROUP BY, sliding windows, etc… 15
  • 16. UI example To the browser, Robin! 16
  • 17. How we use them, Ad Hoc  Outside of production: amazing ad-hoc analysis tool  Example: gathering more than a year’s worth of data for an RFP, at 20B impressions/month painless and quick when we had the data as sketches much more effort to put it through Hadoop  Iterating on product and research is cheaper and faster. Waiting minutes instead of seconds between iterations is painful. 17
  • 18. “Soft” Caveats  Fixed N% error is deceiving  Additive error for set operations can balloon  Unbounded error sneaks in now and again 18
  • 19. Parting Advice  Test these on your data rigorously  Choose good hash functions  Tuning parameters are particularly sensitive  You’ll find all kinds of unexpected uses for them, so get building!  Bibliography blog post will be up in a bit! 19
  • 20. Questions? @timonk timon@aggregateknowledge.com blog.aggregateknowledge.com 20
  • 21. Credits All the adorable cartoons you saw in this presentation were taken from http://sureilldrawthat.com/ and http://sureilldrawthat.tumblr.com/ and belong to him/her. 21