SlideShare uma empresa Scribd logo
1 de 19
Bloom Filter

xuanzi.wp@taobao.com
      2011-11-18




                       1
Agenda

• A Membership Query Problem

• What is Bloom Filter

• BloomFilter Math Theory

• Compression

• Application Scenario
                               2
Membership Query Problem

Problem Description

 Given an element E, query whether it
 belongs to an big elements set S.
  – Fast as soon as possible

  – Small as soon as possible




                                        3
Membership Query Problem

Some Solutions
   hashtable

    fast but big data structure
   bitmap index

    can be smaller?



                                  4
Membership Query Problem

Tradeoff Solutions
   To obtain speed and size improvements,
   allow some probability of error.



         Bloom Filter

                                            5
What is Bloom Filter
 Support approximate set membership
 Given a set S = {x ,x ,…,x }, construct data
                    1 2     n
  structure to answer queries of the form “Is
  y in S?”
 Data structure should be:

     –Fast (Faster than searching through S).
     –Small (Smaller than explicit representation).
    To obtain speed and size improvements,
    allow some probability of error.
     –False positives: y ∉ S but we report y ∈ S
     –False negatives: y ∈ S but we report y ∉ S

                                                      6
What is Bloom Filter
                     Start with an m bit array, filled with 0s.

B   0       0    0    0    0     0    0      0   0   0     0   0   0    0    0   0

          Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.

B   0       1    0    0    1     0    1      0   0   1     1   1   0    1    1   0

          To check if y is in S, check B at Hi(y). All k values must be 1.

B   0       1    0    0    1     0    1      0   0   1     1   1   0    1    1   0

        Possible to have a false positive; all k values are 1, but y is not in S.
B   0       1    0    0    1     0    1      0   0   1     1   1   0    1    1   0

n items                        m = cn bits               k hash functions            7 7
What is Bloom Filter
False Positive
                            0
                            0
                            1
                 hash1
                            0
    A                       1
                 hash2      0
                            0
    B                       0
                 hash3
                            0
                            1
                            0


                                8
Bloom Filter Math Theory
 Pr(specific bit of filter is 0) is
            p ' ≡ (1 − 1 / m) kn ≈ e − kn / m ≡ p
 If ρ is fraction of 0 bits in the filter then false

positive probability is
    (1 − ρ ) k ≈ (1 − p ' ) k ≈ (1 − p ) k = (1 − e − k / c ) k
 Approximations valid as ρ is concentrated

around E[ρ].
     –Martingale argument suffices.
   Find optimal at k = (ln 2)m/n by calculus.
     –So optimal fpp is about (0.6185)m/n
n items                m = cn bits           k hash functions

                                                                  9
Bloom Filter Math Theory

                       0.1
                      0.09
                      0.08
False positive rate




                      0.07
                                                                                     m/n = 8
                      0.06                      Opt k = 8 ln 2 = 5.45...
                      0.05
                      0.04
                      0.03
                      0.02
                      0.01
                         0
                             0    1   2     3     4     5   6     7   8    9    10
                                                 Hash functions
n items                                   m = cn bits             k hash functions       10
Bloom Filter Compression

Use BF on Network Transmission
     BF as a message, should be small enough

      to transmitted over the network
     Compressing bit vector is easy
      Arithmetic coding gets close to entropy.

     Can Bloom filters be compressed?


                                                 11
Bloom Filter Compression
• Optimize to minimize false positive.
    p = Pr[cell is empty] = (1 − 1 / m) kn ≈ e − kn / m
                                k          − kn / m k
    f = Pr[false pos] = (1 − p ) ≈ (1 − e          )
    k = (m ln 2) / n
• At k = m (ln 2) /n, p = 1/2.
• Bloom filter looks like a random string.
  – Can’t compress it.
  – H(p) = -plog2p – (1-p)log2(1-p)

                                                          12
Bloom Filter Compression
 With more decompressed size (storage),
  we can achive compression.
• Assumption: optimal compressor, z =
  mH(p).
    – H(p) is entropy function; optimally get
      H(p) compressed bits per original table bit.
    – Arithmetic coding close to optimal.
• Optimization: Given z bits for compressed
  filter and n elements, choose table size m
  and number of hash functions k to
  minimize /f. ; f ≈ (1 − e − kn / m ) k ; z ≈ mH ( p )
       p≈e − kn m                                       13
Bloom Filter Compression

                       0.1
                      0.09
                      0.08
                                                                  Original
                                                                                  z/n = 8
False positive rate




                      0.07                                        Compressed
                      0.06
                      0.05
                      0.04
                      0.03
                      0.02
                      0.01
                         0
                             0   1   2   3    4    5    6     7     8    9   10
                                             Hash functions
                                                                                           14
                                                                                      14
Bloom Filter Compression

Conclusion

• At k = m (ln 2) /n, false positives are
  maximized with a compressed Bloom
  filter.
  – Best case without compression is worst case
    with compression; compression always
    helps.
  – Side benefit: Use fewer hash functions with
    compression; possible speedup.
                                             15 15
Application Scenario

   Speed up answers in a key-value like syetem
           filter(memory                 storage(memory)
           )
    key1
    no

    key2                   disk access
     yes                     success

    key3                   disk access
     yes                       fail




                                                           16
Application Scenario

   Web Cache

      cache1    cache2    ……    cache3




                   Web Server




                                         17
Q&A




Q&A


       18
Bloom filter

Mais conteúdo relacionado

Mais procurados

STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.Albert Bifet
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Andrii Gakhov
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data ScienceAlbert Bifet
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsAlbert Bifet
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsAlbert Bifet
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithmsSandeep Joshi
 
Scalable Recommendation Algorithms with LSH
Scalable Recommendation Algorithms with LSHScalable Recommendation Algorithms with LSH
Scalable Recommendation Algorithms with LSHMaruf Aytekin
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashingJohan Tibell
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Andrii Gakhov
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306Yasuo Tabei
 
Introduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning ProgrammersIntroduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning ProgrammersKimikazu Kato
 
LSH for
 Prediction Problem in Recommendation
LSH for
 Prediction Problem in RecommendationLSH for
 Prediction Problem in Recommendation
LSH for
 Prediction Problem in RecommendationMaruf Aytekin
 

Mais procurados (20)

Bloom filters
Bloom filtersBloom filters
Bloom filters
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive Windows
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
Scalable Recommendation Algorithms with LSH
Scalable Recommendation Algorithms with LSHScalable Recommendation Algorithms with LSH
Scalable Recommendation Algorithms with LSH
 
Smalltalk
SmalltalkSmalltalk
Smalltalk
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 
Hashing
HashingHashing
Hashing
 
Introduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning ProgrammersIntroduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning Programmers
 
Hashing gt1
Hashing gt1Hashing gt1
Hashing gt1
 
LSH for
 Prediction Problem in Recommendation
LSH for
 Prediction Problem in RecommendationLSH for
 Prediction Problem in Recommendation
LSH for
 Prediction Problem in Recommendation
 
Numpy Talk at SIAM
Numpy Talk at SIAMNumpy Talk at SIAM
Numpy Talk at SIAM
 

Semelhante a Bloom filter

Count-Distinct Problem
Count-Distinct ProblemCount-Distinct Problem
Count-Distinct ProblemKai Zhang
 
Csr2011 june14 15_15_romashchenko
Csr2011 june14 15_15_romashchenkoCsr2011 june14 15_15_romashchenko
Csr2011 june14 15_15_romashchenkoCSR2011
 
Csr2011 june14 15_15_romashchenko
Csr2011 june14 15_15_romashchenkoCsr2011 june14 15_15_romashchenko
Csr2011 june14 15_15_romashchenkoCSR2011
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutData Science London
 
06_finite_elements_basics.ppt
06_finite_elements_basics.ppt06_finite_elements_basics.ppt
06_finite_elements_basics.pptAditya765321
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4arogozhnikov
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptxSonaliAjankar
 
2D1431 Machine Learning
2D1431 Machine Learning2D1431 Machine Learning
2D1431 Machine Learningbutest
 
Quantum error correction
Quantum error correctionQuantum error correction
Quantum error correctionPhelim Bradley
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxGopiNathVelivela
 
SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)Eric Zhang
 
Support Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the theSupport Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the thesanjaibalajeessn
 
Stochastic Approximation and Simulated Annealing
Stochastic Approximation and Simulated AnnealingStochastic Approximation and Simulated Annealing
Stochastic Approximation and Simulated AnnealingSSA KPI
 
Machine Learning
Machine LearningMachine Learning
Machine Learningbutest
 

Semelhante a Bloom filter (20)

December 7, Projects
December 7, ProjectsDecember 7, Projects
December 7, Projects
 
Count-Distinct Problem
Count-Distinct ProblemCount-Distinct Problem
Count-Distinct Problem
 
Simplex3
Simplex3Simplex3
Simplex3
 
Csr2011 june14 15_15_romashchenko
Csr2011 june14 15_15_romashchenkoCsr2011 june14 15_15_romashchenko
Csr2011 june14 15_15_romashchenko
 
Csr2011 june14 15_15_romashchenko
Csr2011 june14 15_15_romashchenkoCsr2011 june14 15_15_romashchenko
Csr2011 june14 15_15_romashchenko
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
Automatic bayesian cubature
Automatic bayesian cubatureAutomatic bayesian cubature
Automatic bayesian cubature
 
06_finite_elements_basics.ppt
06_finite_elements_basics.ppt06_finite_elements_basics.ppt
06_finite_elements_basics.ppt
 
5994944.ppt
5994944.ppt5994944.ppt
5994944.ppt
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptx
 
2D1431 Machine Learning
2D1431 Machine Learning2D1431 Machine Learning
2D1431 Machine Learning
 
Quantum error correction
Quantum error correctionQuantum error correction
Quantum error correction
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
 
ilp-nlp-slides.pdf
ilp-nlp-slides.pdfilp-nlp-slides.pdf
ilp-nlp-slides.pdf
 
SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)
 
Support Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the theSupport Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the the
 
Stochastic Approximation and Simulated Annealing
Stochastic Approximation and Simulated AnnealingStochastic Approximation and Simulated Annealing
Stochastic Approximation and Simulated Annealing
 
Test
TestTest
Test
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

Último (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Bloom filter

  • 2. Agenda • A Membership Query Problem • What is Bloom Filter • BloomFilter Math Theory • Compression • Application Scenario 2
  • 3. Membership Query Problem Problem Description Given an element E, query whether it belongs to an big elements set S. – Fast as soon as possible – Small as soon as possible 3
  • 4. Membership Query Problem Some Solutions  hashtable fast but big data structure  bitmap index can be smaller? 4
  • 5. Membership Query Problem Tradeoff Solutions To obtain speed and size improvements, allow some probability of error. Bloom Filter 5
  • 6. What is Bloom Filter  Support approximate set membership  Given a set S = {x ,x ,…,x }, construct data 1 2 n structure to answer queries of the form “Is y in S?”  Data structure should be: –Fast (Faster than searching through S). –Small (Smaller than explicit representation).  To obtain speed and size improvements, allow some probability of error. –False positives: y ∉ S but we report y ∈ S –False negatives: y ∈ S but we report y ∉ S 6
  • 7. What is Bloom Filter Start with an m bit array, filled with 0s. B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1. B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 To check if y is in S, check B at Hi(y). All k values must be 1. B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 Possible to have a false positive; all k values are 1, but y is not in S. B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 n items m = cn bits k hash functions 7 7
  • 8. What is Bloom Filter False Positive 0 0 1 hash1 0 A 1 hash2 0 0 B 0 hash3 0 1 0 8
  • 9. Bloom Filter Math Theory  Pr(specific bit of filter is 0) is p ' ≡ (1 − 1 / m) kn ≈ e − kn / m ≡ p  If ρ is fraction of 0 bits in the filter then false positive probability is (1 − ρ ) k ≈ (1 − p ' ) k ≈ (1 − p ) k = (1 − e − k / c ) k  Approximations valid as ρ is concentrated around E[ρ]. –Martingale argument suffices.  Find optimal at k = (ln 2)m/n by calculus. –So optimal fpp is about (0.6185)m/n n items m = cn bits k hash functions 9
  • 10. Bloom Filter Math Theory 0.1 0.09 0.08 False positive rate 0.07 m/n = 8 0.06 Opt k = 8 ln 2 = 5.45... 0.05 0.04 0.03 0.02 0.01 0 0 1 2 3 4 5 6 7 8 9 10 Hash functions n items m = cn bits k hash functions 10
  • 11. Bloom Filter Compression Use BF on Network Transmission  BF as a message, should be small enough to transmitted over the network  Compressing bit vector is easy Arithmetic coding gets close to entropy.  Can Bloom filters be compressed? 11
  • 12. Bloom Filter Compression • Optimize to minimize false positive. p = Pr[cell is empty] = (1 − 1 / m) kn ≈ e − kn / m k − kn / m k f = Pr[false pos] = (1 − p ) ≈ (1 − e ) k = (m ln 2) / n • At k = m (ln 2) /n, p = 1/2. • Bloom filter looks like a random string. – Can’t compress it. – H(p) = -plog2p – (1-p)log2(1-p) 12
  • 13. Bloom Filter Compression  With more decompressed size (storage), we can achive compression. • Assumption: optimal compressor, z = mH(p). – H(p) is entropy function; optimally get H(p) compressed bits per original table bit. – Arithmetic coding close to optimal. • Optimization: Given z bits for compressed filter and n elements, choose table size m and number of hash functions k to minimize /f. ; f ≈ (1 − e − kn / m ) k ; z ≈ mH ( p ) p≈e − kn m 13
  • 14. Bloom Filter Compression 0.1 0.09 0.08 Original z/n = 8 False positive rate 0.07 Compressed 0.06 0.05 0.04 0.03 0.02 0.01 0 0 1 2 3 4 5 6 7 8 9 10 Hash functions 14 14
  • 15. Bloom Filter Compression Conclusion • At k = m (ln 2) /n, false positives are maximized with a compressed Bloom filter. – Best case without compression is worst case with compression; compression always helps. – Side benefit: Use fewer hash functions with compression; possible speedup. 15 15
  • 16. Application Scenario  Speed up answers in a key-value like syetem filter(memory storage(memory) ) key1 no key2 disk access yes success key3 disk access yes fail 16
  • 17. Application Scenario  Web Cache cache1 cache2 …… cache3 Web Server 17
  • 18. Q&A Q&A 18

Notas do Editor

  1. 按照时间顺序介绍一下 , 在本年度参与的技术项目,包括每个项目的情况 ,自己 工作 内容 描述