SlideShare a Scribd company logo
1 of 21
No BS Data Salon #3:
Probabilistic Sketching
May 2012
                          Analytics + Attribution =
                            Actionable Insights
Outline

     What we do at AK
     What’s sketching?
     Our motivation for sketching
     Why should you sketch?
     Our case: unique counting
       How it works
       How well it works
       How we use them




2
Here’s what we do at AK.


                   Online ad analytics
      Compare performance of different: campaigns, inventory,
                    providers, creatives, etc…




                        Bottom Line:
    Give the advertisers insight into the performance of their ads.



3
Motivation

     High throughput: 10s of K/s => 100s of K/s
     High dimensionality: 100M+ reporting keys
     Easy aggregates: counters, scalars
     Hard aggregates: unique user counting, set operations


     No cheap or effective “online” solutions
       Streaming DBs (Truviso, Coral8, StreamBase) insufficient
       Warehouse appliances (Aster, custom PG) same
       Our data is immutable. Paying for unneeded ACID is silly.

     Offline solutions slow, operationally finicky.
     Not a bank. We don’t need to be perfect, just useful.

4
Why should you bother?




    SELECT COUNT(DISTINCT user_id)
    FROM access_logs
    GROUP BY campaign_id




5
What is probabilistic sketching?




     One-pass
     “Small” memory
     Probabilistic error




6
Our Case Study: unique counting

     Non-unique stream of ints
     Want to keep unique count, up to about a billion
     Want to do set operations (union, intersection, set difference)
     Straw Man #1: “Put them in a HashSet, and go away.”
     (Maybe) Straw Man #2: “Fine, keep a sample.”
     How we did it: HyperLogLog




7
How it works

                                     The Papers:
     LogLog Counting of Large Cardinalities
       Marianne Durand and Philippe Flajolet (RIP 2010), 2003

     HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
       Flajolet, Fusy, Gandouet, Meunier, 2007

               The (rudimentary, unrigorous) Intuition:

        Flip fair coins
        Longest streak of heads is length k, seen once
        Probability of streak ≈ (½)k
        E[x] = 1, p = (½)k => n ≈ 2k
8
How it works cont’d

    1.   Stream of int_64 => “good” hash => random {0,1}64
    2.   Keep track of longest run of leading zeroes
    3.   Longest run of length k => cardinality ≈2k

     Crazy math business
         Correct systematic bias with a derived constant
         Stochastic averaging
         Balls and bins correction




9
Here’s what you get




                     Native:
                union, cardinality

                    Implies:
      intersection (!!!), set difference (!!!)




10
Show me the money!

      Used in production at AK for a year
      Accurate: count to a billion with 1-3% error
      Small: a few KB each so we can keep 100s of M in memory
      Fast: benched at 2M inserts/s, used in production at 100s of K/s




11
Lies, damn lies, and boxplots!

                                                Cardinality Relative Error vs True Cardinality
                                                              log2m=13 [5kB]


                          4%




                          2%     ●
     HLL Cardinality RE




                          0%




                          −2%



                                                               ●

                                                                                                  ●




                          −4%


                                102       103   104           105                106       107   108   109


12                                                                 True Cardinality
But wait, there’s more!
                                     ●
                                                                                                                             ●

                                                                     Intersection Error vs Magnitude Diff erence
                                                                                   log2m=13 [5kB]




                              40%


                                                                                                             ●
                                                                          ●                                  ●
                                                                              ●                              ●
                                                                              ●
                                                                                                                 ●   ●
                                         ●
                                         ●                                                                       ●
                                                                                                                 ●
                                                                                                                 ●
                              20%                                                 ●   ●
                                                                                                                             ●
                                                                                                                             ●   factor(overlap_fraction)
                                                                                                                             ●

                                             ●                                                                                       0.1
     HLL Intersection Error




                                                                                      ●   ●
                                                                                          ●                                          0.2

                                                                                              ●
                                                                                                                                     0.3
                                                                                                  ●

                                                             ●
                                                                                                                                     0.4

                               0%                                                                                                    0.5
                                                         ●   ●   ●                                                                   0.6
                                                     ●
                                                 ●                                                                                   0.7
                                                 ●                                            ●
                                                                                                                                     0.8
                                                                                      ●                                              0.9
                                                                                      ●
                                                                                      ●
                                                                                                                                     1
                              −20%



                                                                              ●                              ●




                                                                                                             ●
                              −40%




                                                     0                                1                  2               3


13                                                                      Cardinality Order of Magnitude Diff erence
Implementation caveats

      If you store an HLL for each key, you’ll likely be wasting space when all the
       registers aren’t set. Use map-based HLL or use compression.
      Pick a good hash function!
      Test on your data!
      Tune parameters to suit your business needs!




14
How we use them, in production

      Original problem: fast, on-the-fly overlaps and unique counts
      Solution:
        streaming, in-memory aggregations shipped to Postgres
        Postgres module to do set operations on binary representations in the DB

      Freebie: PG analytics support like GROUP BY, sliding windows, etc…




15
UI example




              To the browser, Robin!




16
How we use them, Ad Hoc

      Outside of production: amazing ad-hoc analysis tool
      Example: gathering more than a year’s worth of data for an RFP, at 20B
       impressions/month
         painless and quick when we had the data as sketches
        much more effort to put it through Hadoop

      Iterating on product and research is cheaper and faster.
        Waiting minutes instead of seconds between iterations is painful.




17
“Soft” Caveats



      Fixed N% error is deceiving
      Additive error for set operations can balloon
      Unbounded error sneaks in now and again




18
Parting Advice

      Test these on your data rigorously
      Choose good hash functions
      Tuning parameters are particularly sensitive
      You’ll find all kinds of unexpected uses for them, so get building!
      Bibliography blog post will be up in a bit!




19
Questions?


                  @timonk
     timon@aggregateknowledge.com
      blog.aggregateknowledge.com




20
Credits

     All the adorable cartoons you saw in this presentation were taken from
     http://sureilldrawthat.com/ and http://sureilldrawthat.tumblr.com/ and belong
     to him/her.




21

More Related Content

Similar to No BS Data Salon #3: Probabilistic Sketching

Similar to No BS Data Salon #3: Probabilistic Sketching (13)

01 Intro
01 Intro01 Intro
01 Intro
 
17 Sampling Dist
17 Sampling Dist17 Sampling Dist
17 Sampling Dist
 
01 intro
01 intro01 intro
01 intro
 
02 Large
02 Large02 Large
02 Large
 
02 large
02 large02 large
02 large
 
A Comparative Study of Geographic Routing in Social Network Based on Mobile P...
A Comparative Study of Geographic Routing in Social Network Based on Mobile P...A Comparative Study of Geographic Routing in Social Network Based on Mobile P...
A Comparative Study of Geographic Routing in Social Network Based on Mobile P...
 
20 Polishing
20 Polishing20 Polishing
20 Polishing
 
08 Continuous
08 Continuous08 Continuous
08 Continuous
 
08 Continuous
08 Continuous08 Continuous
08 Continuous
 
About Vision, Mission And Strategy
About Vision, Mission And StrategyAbout Vision, Mission And Strategy
About Vision, Mission And Strategy
 
13 Bivariate
13 Bivariate13 Bivariate
13 Bivariate
 
14 case-study
14 case-study14 case-study
14 case-study
 
Over Visie, Missie En Strategie
Over Visie, Missie En StrategieOver Visie, Missie En Strategie
Over Visie, Missie En Strategie
 

Recently uploaded

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Recently uploaded (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

No BS Data Salon #3: Probabilistic Sketching

  • 1. No BS Data Salon #3: Probabilistic Sketching May 2012 Analytics + Attribution = Actionable Insights
  • 2. Outline  What we do at AK  What’s sketching?  Our motivation for sketching  Why should you sketch?  Our case: unique counting How it works How well it works How we use them 2
  • 3. Here’s what we do at AK. Online ad analytics Compare performance of different: campaigns, inventory, providers, creatives, etc… Bottom Line: Give the advertisers insight into the performance of their ads. 3
  • 4. Motivation  High throughput: 10s of K/s => 100s of K/s  High dimensionality: 100M+ reporting keys  Easy aggregates: counters, scalars  Hard aggregates: unique user counting, set operations  No cheap or effective “online” solutions Streaming DBs (Truviso, Coral8, StreamBase) insufficient Warehouse appliances (Aster, custom PG) same Our data is immutable. Paying for unneeded ACID is silly.  Offline solutions slow, operationally finicky.  Not a bank. We don’t need to be perfect, just useful. 4
  • 5. Why should you bother? SELECT COUNT(DISTINCT user_id) FROM access_logs GROUP BY campaign_id 5
  • 6. What is probabilistic sketching?  One-pass  “Small” memory  Probabilistic error 6
  • 7. Our Case Study: unique counting  Non-unique stream of ints  Want to keep unique count, up to about a billion  Want to do set operations (union, intersection, set difference)  Straw Man #1: “Put them in a HashSet, and go away.”  (Maybe) Straw Man #2: “Fine, keep a sample.”  How we did it: HyperLogLog 7
  • 8. How it works The Papers:  LogLog Counting of Large Cardinalities Marianne Durand and Philippe Flajolet (RIP 2010), 2003  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm Flajolet, Fusy, Gandouet, Meunier, 2007 The (rudimentary, unrigorous) Intuition: Flip fair coins Longest streak of heads is length k, seen once Probability of streak ≈ (½)k E[x] = 1, p = (½)k => n ≈ 2k 8
  • 9. How it works cont’d 1. Stream of int_64 => “good” hash => random {0,1}64 2. Keep track of longest run of leading zeroes 3. Longest run of length k => cardinality ≈2k  Crazy math business Correct systematic bias with a derived constant Stochastic averaging Balls and bins correction 9
  • 10. Here’s what you get Native: union, cardinality Implies: intersection (!!!), set difference (!!!) 10
  • 11. Show me the money!  Used in production at AK for a year  Accurate: count to a billion with 1-3% error  Small: a few KB each so we can keep 100s of M in memory  Fast: benched at 2M inserts/s, used in production at 100s of K/s 11
  • 12. Lies, damn lies, and boxplots! Cardinality Relative Error vs True Cardinality log2m=13 [5kB] 4% 2% ● HLL Cardinality RE 0% −2% ● ● −4% 102 103 104 105 106 107 108 109 12 True Cardinality
  • 13. But wait, there’s more! ● ● Intersection Error vs Magnitude Diff erence log2m=13 [5kB] 40% ● ● ● ● ● ● ● ● ● ● ● ● ● 20% ● ● ● ● factor(overlap_fraction) ● ● 0.1 HLL Intersection Error ● ● ● 0.2 ● 0.3 ● ● 0.4 0% 0.5 ● ● ● 0.6 ● ● 0.7 ● ● 0.8 ● 0.9 ● ● 1 −20% ● ● ● −40% 0 1 2 3 13 Cardinality Order of Magnitude Diff erence
  • 14. Implementation caveats  If you store an HLL for each key, you’ll likely be wasting space when all the registers aren’t set. Use map-based HLL or use compression.  Pick a good hash function!  Test on your data!  Tune parameters to suit your business needs! 14
  • 15. How we use them, in production  Original problem: fast, on-the-fly overlaps and unique counts  Solution: streaming, in-memory aggregations shipped to Postgres Postgres module to do set operations on binary representations in the DB  Freebie: PG analytics support like GROUP BY, sliding windows, etc… 15
  • 16. UI example To the browser, Robin! 16
  • 17. How we use them, Ad Hoc  Outside of production: amazing ad-hoc analysis tool  Example: gathering more than a year’s worth of data for an RFP, at 20B impressions/month painless and quick when we had the data as sketches much more effort to put it through Hadoop  Iterating on product and research is cheaper and faster. Waiting minutes instead of seconds between iterations is painful. 17
  • 18. “Soft” Caveats  Fixed N% error is deceiving  Additive error for set operations can balloon  Unbounded error sneaks in now and again 18
  • 19. Parting Advice  Test these on your data rigorously  Choose good hash functions  Tuning parameters are particularly sensitive  You’ll find all kinds of unexpected uses for them, so get building!  Bibliography blog post will be up in a bit! 19
  • 20. Questions? @timonk timon@aggregateknowledge.com blog.aggregateknowledge.com 20
  • 21. Credits All the adorable cartoons you saw in this presentation were taken from http://sureilldrawthat.com/ and http://sureilldrawthat.tumblr.com/ and belong to him/her. 21