SlideShare uma empresa Scribd logo
1 de 39
Baixar para ler offline
Andrii Gakhov, PhD
Exceeding Classical:
Probabilistic Data Structures
in Data-Intensive Applications
EuroSciPy 2019

Bilbao, Spain
Andrii Gakhov
Senior Software Engineer

at Ferret Go GmbH, Germany
Ph.D. in Mathematical Modelling, 

M.Sc. in Applied Mathematics
Twitter: @gakhov | Website: gakhov.com
Probabilistic Data Structures
and Algorithms

for Big Data Applications
ISBN: 9783748190486

https://pdsa.gakhov.com
0. Motivation
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov
Bioinformatics: Counting k-mers in DNA
Counting substrings of length k in DNA sequence data (k-mers) is
essential in bioinformatics, for instance, for metagenomic sequencing.
A large fraction of the storage is spent on storing k-mers with sequencing
errors and which are observed only a single time in the data*.
Can we efficiently avoid to persist such invalid substrings?
Can we efficiently count valid substrings?
* Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics 12(1), 333, 2011
For example, the team that sequenced the giant panda genome needed to
count 8.62 billion 27-mers, where 68% were low-coverage k-mers.
1. Data-Intensive Applications 

in Big Data epoch
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov
What is Big Data?
Doug Laney in 2001 described Big Data datasets as such that
contain greater variety arriving in increasing volumes and with ever-
higher velocity. Today this is known as the famous 3V’s of Big Data.
Big
Data
Velocity Variety
Volume expresses the amount of data
describes the speed at which data is arriving refers to the number of types of data
What is Big Data?
Big Data is more than simply a
matter of size.
Big Data does not refer to data, it
refers to technology.
The datasets of Big Data are larger, more complex, and
generated more rapidly than our current resources can handle.
Image: https://www.freepngimg.com/electronics/technology
2. Probabilistic Data Structures

and Algorithms
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov
Probabilistic Data Structures and Algorithms (PDSA)
A family of advanced approaches that are optimized to
use sublinear memory and constant execution time.
Cannot provide the exact answers and have some probability of error.
error
resources
The tradeoff between the error and the resources is another feature
that distinguish the algorithms and data structures of this family.
PDSA in Big Data Ecosystem
Count-Min Sketch
Count Sketch
Bloom Filter
Quotient Filter
Cuckoo Filter
Linear Counting
FM Sketch
LogLog
HyperLogLog
Random Sampling t-digestq-digestGreenwald-Khanna
MinHash
SimHash
LSH
Counting
find the number of unique elements
Membership
keep track of indexed elements
Rank approximate percentiles and quantiles
Frequency
estimate frequencies of elements
Similarity
find similar documents
Big
Data
Velocity Variety
Volume
PDSA in Apache Spark SQL (PySpark interface)
q-quantile estimation (Greenwald-Khanna)
# pyspark.sql.DataFrameStatFunctions(df).approxQuantile

df.approxQuantile("language", [0.5], 0.25)
Approximate number of distinct elements (HyperLogLog++)
#pyspark.sql.functions.approx_count_distinct
df.agg(approx_count_distinct(df.language).alias('lang')).collect()
Spark SQL is Apache Spark's module for working with structured data.
PDSA in Production
3. Frequency
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov
Frequency: Challenge
A hashtag is used to index a topic on Twitter and allows people to easily follow
items they are interested in. Hashtags are usually written with a # symbol in front.
Find the most trending hashtags on Twitter
every second about 6000 tweets are created on Twitter,
that is roughly 500 million items daily
most of tweets are linked with one or more hashtags
https://www.internetlivestats.com/twitter-statistics/
Frequency:Traditional Approach
Build a table that lists of all seen thus far
elements with corresponding counters
Increment counters when a new element
comes or add that element into the table and
initialize its counter
Return the value of the counter that
corresponds to the element as frequency
requires linear memory
requires O(n) time lookup (worst case)
huge overhead for heavy hitters search
1 1 1
1 1 2
Frequency: Challenges for Big Data data streams
Continuous data streams
potentially unbounded number of unique elements

➡ sublinear (polylogarithmic at most) space

not feasible to re-process data streams

➡ one-pass algorithms preferred

high frequency throughput

➡ fast updates
Image: https://www.pngfind.com
Count-Min Sketch
a simple space-efficient probabilistic data structure that is used to estimate
frequencies of elements in data streams and can address the Heavy hitters problem
presented by Graham Cormode and Shan Muthukrishnan in 2003
Frequency: Estimation with a single counter
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
counter 0 1 2 3 4 5 …. m-1 m+1
h( )
0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
counter 0 1 2 3 4 5 …. m-1 m
h( ) +1
0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0
counter 0 1 2 3 4 5 …. m-1 m+1
h( )
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
counter 1 0 1 2 3 4 5 …. m-1 m
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
counter 2 0 1 2 3 4 5 …. m-1 m
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
counter k 0 1 2 3 4 5 …. m-1 m
…
CMSketch
Frequency: Estimation with Count-Min Sketch
+1 +1 +1
h1( ) h2( ) hk( )…,
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
counter 1 0 1 2 3 4 5 …. m-1 m
0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0
counter 2 0 1 2 3 4 5 …. m-1 m
0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0
counter k 0 1 2 3 4 5 …. m-1 m
…
CMSketch
Frequency: Estimation with Count-Min Sketch
h1( ) h2( ) hk( )…,
f( ) = min (1, 3, ..., 5) = 1
Counting: Invoking Count-Min Sketch from Python


import json
from pdsa.frequency.count_min_sketch import CountMinSketch
cms = CountMinSketch(5, 2000)
with open('tweets.txt') as f:
for line in f:
ip = json.loads(line)['hashtag']
cms.add(ip)
print('Frequency of #Python', cms.frequency("Python"))
size_in_bytes = cms.sizeof()
print('Size in bytes', size_in_bytes) # ~40Kb / 32-bit counters 

4. Counting
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov
Counting: Challenge
Count the number of unique visitors
Amazon and eBay had about 3.375 billion* visitors in June 2019
Assume 337 million of unique IP addresses (128 bit per IPv6 record)
5.4 GB of memory just to store them all
*SimilarWeb.Com Data for June, 2019
What if we can count them with 12 KB only?
Image: https://www.cleanpng.com
Counting:Traditional Approach
Build list of all unique elements
Sort / search 

to avoid listing elements twice
Count elements in the list
requires linear memory
requires O(n·log n) time
Counting:Approximate Counting
@katyperry has
107,287,629 followers
Would you really care 

if she has 107.2, 108.0, or 106.7 million followers?
HyperLogLog
a hash-based probabilistic algorithm for counting the number of distinct
values in the presence of duplicates
proposed by Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier in 2007
Counting: Estimation with a single counter (Flajolet, Martin)
h( )
h( )
0 0 0 0 1 1 … 0 0 1 0 1 0 0 0 0
binary (LSB-0)
rank ( ) = 4
h( )
1 1 0 0 0 1 … 0 0 1 0 1 0 0 0 0
binary (LSB-0)
rank ( ) = 0
h( )
1
0
0
0
1
…
0
0
0
1
2
3
4
…
m-1
m
R = 1
n ≈
2R
0.77351
FM Sketch
Counting: Estimation with HyperLogLog
1 1 0 0 0 1 … 0 0 1 0 1 0 0 1 0
binary (LSB-0)
rank1 ( ) = 0
h1( )
0
5
…
2
1
2
…
k
HLL Sketch
h1( ) h2( ) hk( )…,
0 0 0 0 0 1 … 0 0 1 1 1 0 0 0 0
binary (LSB-0)
rank2 ( ) = 5
h2( )
0 0 1 1 0 1 … 0 0 1 0 1 1 0 0 1
binary (LSB-0)
rankk ( ) = 0
hk( )
…
iff bigger than
existing value
iff bigger than
existing value
iff bigger then
existing value
n ≈ α ⋅ k ⋅ 2AVG(HLLi)
Counting: HyperLogLog Algorithm
Based on a single 32-bit hash function
Simulates k hash functions using stochastic averaging
approach
p bits (32 - p) bits
addressing bits rank computation bits
hash(x) =
32-bit hash value
Stores only k = 2p
counters (registers), about 4 bytes each
The memory always fixed, regardless the number of unique elements
More counters provide less error (memory/accuracy trade-off)
Counting: Invoking HyperLogLog from Python


import json
from pdsa.cardinality.hyperloglog import HyperLogLog
hll = HyperLogLog(precision=10) # 2^{10} = 1024 counters
with open('visitors.txt') as f:
for line in f:
ip = json.loads(line)['ip']
hll.add(ip)
num_of_unique_visitors = hll.count()
print('Unique visitors', num_of_unique_visitors)
size_in_bytes = hll.sizeof()
print('Size in bytes', size_in_bytes) # ~ 4Kb
Counting: Distinct Count in Redis
Redis uses the HyperLogLog data structure to count unique elements in a set
requires a small constant amount of memory of 12KB for every data
structure
approximates the exact cardinality with a standard error of 0.81%.
redis> PFADD hll python java ruby
(integer) 1
redis> PFADD hll python python python
(integer) 0
redis> PFADD hll java ruby
(integer) 0
redis> PFCOUNT hll
(integer) 3
http://antirez.com/news/75
5. Final Notes
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov
Final Notes
Think about Big Data as a technology
challenge
Instead of buying new servers, learn new
algorithms
Believe in hashing! Sample vs Hashing.
Probabilistic Data Structures and Algorithms
become useful when your problem fits
Image: https://longfordpc.com/
Read More
[book] Probabilistic Data Structures and Algorithms for Big Data Applications 

https://pdsa.gakhov.com
[repo] Probabilistic Data Structures and Algorithms in Python

https://github.com/gakhov/pdsa
Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure 

https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
Redis new data structure: the HyperLogLog

http://antirez.com/news/75
Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles 

https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html
Big Data with Sketchy Structures 

https://towardsdatascience.com/b73fb3a33e2a
Count-Min Sketch 

http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov
Website: www.gakhov.com
Twitter: @gakhov
Probabilistic Data Structures and
Algorithms for Big Data Applications
pdsa.gakhov.com
Eskerrik asko!
6.Additional Slides
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov, @gakhov
(for that person who wants more)
Counting: Interactive Presentation of HyperLogLog
Counting:Accuracy vs MemoryTradeoff in HyperLogLog
!38
More counters require more memory (4 bytes per counter)
More counters need more bits for addressing them (m = 2p
)
Counting: HyperLogLog++Algorithm
HyperLogLog++
64-bit hash function, so allows to count more values
better bias correction using pre-trained data
proposed a sparse representation of the counters
(registers) to reduce memory requirements
HyperLogLog++ is an improved version of HyperLogLog 

developed in Google and proposed in 2013

Mais conteúdo relacionado

Mais procurados

Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Kira
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.Albert Bifet
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filterxlight
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data ScienceAlbert Bifet
 
Bloom filter
Bloom filterBloom filter
Bloom filterfeng lee
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonChristopher Conlan
 
Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)Andrew Clegg
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
Algorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsAlgorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsChristopher Conlan
 
Bloom filter
Bloom filterBloom filter
Bloom filterwang ping
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsAlbert Bifet
 
Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Flink Forward
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashingJohan Tibell
 
Analysis Of Algorithms - Hashing
Analysis Of Algorithms - HashingAnalysis Of Algorithms - Hashing
Analysis Of Algorithms - HashingSam Light
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream MiningAlbert Bifet
 
Concept of hashing
Concept of hashingConcept of hashing
Concept of hashingRafi Dar
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
Tech talk Probabilistic Data Structure
Tech talk  Probabilistic Data StructureTech talk  Probabilistic Data Structure
Tech talk Probabilistic Data StructureRishabh Dugar
 

Mais procurados (20)

Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
 
Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Algorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsAlgorithms 101 for Data Scientists
Algorithms 101 for Data Scientists
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive Windows
 
Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
 
Analysis Of Algorithms - Hashing
Analysis Of Algorithms - HashingAnalysis Of Algorithms - Hashing
Analysis Of Algorithms - Hashing
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 
Concept of hashing
Concept of hashingConcept of hashing
Concept of hashing
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Tech talk Probabilistic Data Structure
Tech talk  Probabilistic Data StructureTech talk  Probabilistic Data Structure
Tech talk Probabilistic Data Structure
 

Semelhante a Exceeding Classical: Probabilistic Data Structures in Data Intensive Applications

Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetupsource{d}
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsDebasish Ghosh
 
Presentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresecPresentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresecTiago Henriques
 
Performance Analysis of Hashing Mathods on the Employment of App
Performance Analysis of Hashing Mathods on the Employment of App Performance Analysis of Hashing Mathods on the Employment of App
Performance Analysis of Hashing Mathods on the Employment of App IJECEIAES
 
Data-Centric Parallel Programming
Data-Centric Parallel ProgrammingData-Centric Parallel Programming
Data-Centric Parallel Programminginside-BigData.com
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent MonitoringIntelie
 
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph AnalyticsScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph AnalyticsToyotaro Suzumura
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Big Data Spain
 
5 parallel implementation 06299286
5 parallel implementation 062992865 parallel implementation 06299286
5 parallel implementation 06299286Ninad Samel
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data StreamsSujaAldrin
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...Till Blume
 
Scalable frequent itemset mining using heterogeneous computing par apriori a...
Scalable frequent itemset mining using heterogeneous computing  par apriori a...Scalable frequent itemset mining using heterogeneous computing  par apriori a...
Scalable frequent itemset mining using heterogeneous computing par apriori a...ijdpsjournal
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
 
Cassandra at Finn.io — May 30th 2013
Cassandra at Finn.io — May 30th 2013Cassandra at Finn.io — May 30th 2013
Cassandra at Finn.io — May 30th 2013DataStax Academy
 
R, HTTP, and APIs, with a preview of TopicWatchr
R, HTTP, and APIs, with a preview of TopicWatchrR, HTTP, and APIs, with a preview of TopicWatchr
R, HTTP, and APIs, with a preview of TopicWatchrPortland R User Group
 
"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011)
"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011)"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011)
"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011)Portland R User Group
 
Pycon 2016-open-space
Pycon 2016-open-spacePycon 2016-open-space
Pycon 2016-open-spaceChetan Khatri
 

Semelhante a Exceeding Classical: Probabilistic Data Structures in Data Intensive Applications (20)

Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 
Presentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresecPresentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresec
 
Performance Analysis of Hashing Mathods on the Employment of App
Performance Analysis of Hashing Mathods on the Employment of App Performance Analysis of Hashing Mathods on the Employment of App
Performance Analysis of Hashing Mathods on the Employment of App
 
Data-Centric Parallel Programming
Data-Centric Parallel ProgrammingData-Centric Parallel Programming
Data-Centric Parallel Programming
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
 
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph AnalyticsScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
 
5 parallel implementation 06299286
5 parallel implementation 062992865 parallel implementation 06299286
5 parallel implementation 06299286
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
 
Scalable frequent itemset mining using heterogeneous computing par apriori a...
Scalable frequent itemset mining using heterogeneous computing  par apriori a...Scalable frequent itemset mining using heterogeneous computing  par apriori a...
Scalable frequent itemset mining using heterogeneous computing par apriori a...
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
Cassandra at Finn.io — May 30th 2013
Cassandra at Finn.io — May 30th 2013Cassandra at Finn.io — May 30th 2013
Cassandra at Finn.io — May 30th 2013
 
R, HTTP, and APIs, with a preview of TopicWatchr
R, HTTP, and APIs, with a preview of TopicWatchrR, HTTP, and APIs, with a preview of TopicWatchr
R, HTTP, and APIs, with a preview of TopicWatchr
 
"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011)
"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011)"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011)
"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011)
 
Pycon 2016-open-space
Pycon 2016-open-spacePycon 2016-open-space
Pycon 2016-open-space
 

Mais de Andrii Gakhov

Let's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureAndrii Gakhov
 
Implementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaAndrii Gakhov
 
Pecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsAndrii Gakhov
 
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityAndrii Gakhov
 
Вероятностные структуры данных
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данныхAndrii Gakhov
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksAndrii Gakhov
 
Swagger / Quick Start Guide
Swagger / Quick Start GuideSwagger / Quick Start Guide
Swagger / Quick Start GuideAndrii Gakhov
 
API Days Berlin highlights
API Days Berlin highlightsAPI Days Berlin highlights
API Days Berlin highlightsAndrii Gakhov
 
ELK - What's new and showcases
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcasesAndrii Gakhov
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014Andrii Gakhov
 
Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Andrii Gakhov
 
Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014Andrii Gakhov
 
Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014Andrii Gakhov
 
Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014Andrii Gakhov
 
Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014Andrii Gakhov
 
Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)Andrii Gakhov
 
Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014Andrii Gakhov
 

Mais de Andrii Gakhov (20)

Let's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architecture
 
DNS Delegation
DNS DelegationDNS Delegation
DNS Delegation
 
Implementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and Lua
 
Pecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food Traditions
 
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
 
Вероятностные структуры данных
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данных
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
 
Swagger / Quick Start Guide
Swagger / Quick Start GuideSwagger / Quick Start Guide
Swagger / Quick Start Guide
 
API Days Berlin highlights
API Days Berlin highlightsAPI Days Berlin highlights
API Days Berlin highlights
 
ELK - What's new and showcases
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcases
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014
 
Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014
 
Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014
 
Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014
 
Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014
 
Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014
 
Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)
 
Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014
 

Último

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 

Último (20)

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 

Exceeding Classical: Probabilistic Data Structures in Data Intensive Applications

  • 1. Andrii Gakhov, PhD Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019
 Bilbao, Spain
  • 2. Andrii Gakhov Senior Software Engineer
 at Ferret Go GmbH, Germany Ph.D. in Mathematical Modelling, 
 M.Sc. in Applied Mathematics Twitter: @gakhov | Website: gakhov.com Probabilistic Data Structures and Algorithms
 for Big Data Applications ISBN: 9783748190486
 https://pdsa.gakhov.com
  • 3. 0. Motivation Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
  • 4. Bioinformatics: Counting k-mers in DNA Counting substrings of length k in DNA sequence data (k-mers) is essential in bioinformatics, for instance, for metagenomic sequencing. A large fraction of the storage is spent on storing k-mers with sequencing errors and which are observed only a single time in the data*. Can we efficiently avoid to persist such invalid substrings? Can we efficiently count valid substrings? * Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics 12(1), 333, 2011 For example, the team that sequenced the giant panda genome needed to count 8.62 billion 27-mers, where 68% were low-coverage k-mers.
  • 5. 1. Data-Intensive Applications 
 in Big Data epoch Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
  • 6. What is Big Data? Doug Laney in 2001 described Big Data datasets as such that contain greater variety arriving in increasing volumes and with ever- higher velocity. Today this is known as the famous 3V’s of Big Data. Big Data Velocity Variety Volume expresses the amount of data describes the speed at which data is arriving refers to the number of types of data
  • 7. What is Big Data? Big Data is more than simply a matter of size. Big Data does not refer to data, it refers to technology. The datasets of Big Data are larger, more complex, and generated more rapidly than our current resources can handle. Image: https://www.freepngimg.com/electronics/technology
  • 8. 2. Probabilistic Data Structures
 and Algorithms Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
  • 9. Probabilistic Data Structures and Algorithms (PDSA) A family of advanced approaches that are optimized to use sublinear memory and constant execution time. Cannot provide the exact answers and have some probability of error. error resources The tradeoff between the error and the resources is another feature that distinguish the algorithms and data structures of this family.
  • 10. PDSA in Big Data Ecosystem Count-Min Sketch Count Sketch Bloom Filter Quotient Filter Cuckoo Filter Linear Counting FM Sketch LogLog HyperLogLog Random Sampling t-digestq-digestGreenwald-Khanna MinHash SimHash LSH Counting find the number of unique elements Membership keep track of indexed elements Rank approximate percentiles and quantiles Frequency estimate frequencies of elements Similarity find similar documents Big Data Velocity Variety Volume
  • 11. PDSA in Apache Spark SQL (PySpark interface) q-quantile estimation (Greenwald-Khanna) # pyspark.sql.DataFrameStatFunctions(df).approxQuantile
 df.approxQuantile("language", [0.5], 0.25) Approximate number of distinct elements (HyperLogLog++) #pyspark.sql.functions.approx_count_distinct df.agg(approx_count_distinct(df.language).alias('lang')).collect() Spark SQL is Apache Spark's module for working with structured data.
  • 13. 3. Frequency Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
  • 14. Frequency: Challenge A hashtag is used to index a topic on Twitter and allows people to easily follow items they are interested in. Hashtags are usually written with a # symbol in front. Find the most trending hashtags on Twitter every second about 6000 tweets are created on Twitter, that is roughly 500 million items daily most of tweets are linked with one or more hashtags https://www.internetlivestats.com/twitter-statistics/
  • 15. Frequency:Traditional Approach Build a table that lists of all seen thus far elements with corresponding counters Increment counters when a new element comes or add that element into the table and initialize its counter Return the value of the counter that corresponds to the element as frequency requires linear memory requires O(n) time lookup (worst case) huge overhead for heavy hitters search 1 1 1 1 1 2
  • 16. Frequency: Challenges for Big Data data streams Continuous data streams potentially unbounded number of unique elements
 ➡ sublinear (polylogarithmic at most) space
 not feasible to re-process data streams
 ➡ one-pass algorithms preferred
 high frequency throughput
 ➡ fast updates Image: https://www.pngfind.com
  • 17. Count-Min Sketch a simple space-efficient probabilistic data structure that is used to estimate frequencies of elements in data streams and can address the Heavy hitters problem presented by Graham Cormode and Shan Muthukrishnan in 2003
  • 18. Frequency: Estimation with a single counter 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 counter 0 1 2 3 4 5 …. m-1 m+1 h( ) 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 counter 0 1 2 3 4 5 …. m-1 m h( ) +1 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 counter 0 1 2 3 4 5 …. m-1 m+1 h( )
  • 19. 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 counter 1 0 1 2 3 4 5 …. m-1 m 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 counter 2 0 1 2 3 4 5 …. m-1 m 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 counter k 0 1 2 3 4 5 …. m-1 m … CMSketch Frequency: Estimation with Count-Min Sketch +1 +1 +1 h1( ) h2( ) hk( )…,
  • 20. 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 counter 1 0 1 2 3 4 5 …. m-1 m 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 counter 2 0 1 2 3 4 5 …. m-1 m 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 counter k 0 1 2 3 4 5 …. m-1 m … CMSketch Frequency: Estimation with Count-Min Sketch h1( ) h2( ) hk( )…, f( ) = min (1, 3, ..., 5) = 1
  • 21. Counting: Invoking Count-Min Sketch from Python 
 import json from pdsa.frequency.count_min_sketch import CountMinSketch cms = CountMinSketch(5, 2000) with open('tweets.txt') as f: for line in f: ip = json.loads(line)['hashtag'] cms.add(ip) print('Frequency of #Python', cms.frequency("Python")) size_in_bytes = cms.sizeof() print('Size in bytes', size_in_bytes) # ~40Kb / 32-bit counters 

  • 22. 4. Counting Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
  • 23. Counting: Challenge Count the number of unique visitors Amazon and eBay had about 3.375 billion* visitors in June 2019 Assume 337 million of unique IP addresses (128 bit per IPv6 record) 5.4 GB of memory just to store them all *SimilarWeb.Com Data for June, 2019 What if we can count them with 12 KB only? Image: https://www.cleanpng.com
  • 24. Counting:Traditional Approach Build list of all unique elements Sort / search 
 to avoid listing elements twice Count elements in the list requires linear memory requires O(n·log n) time
  • 25. Counting:Approximate Counting @katyperry has 107,287,629 followers Would you really care 
 if she has 107.2, 108.0, or 106.7 million followers?
  • 26. HyperLogLog a hash-based probabilistic algorithm for counting the number of distinct values in the presence of duplicates proposed by Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier in 2007
  • 27. Counting: Estimation with a single counter (Flajolet, Martin) h( ) h( ) 0 0 0 0 1 1 … 0 0 1 0 1 0 0 0 0 binary (LSB-0) rank ( ) = 4 h( ) 1 1 0 0 0 1 … 0 0 1 0 1 0 0 0 0 binary (LSB-0) rank ( ) = 0 h( ) 1 0 0 0 1 … 0 0 0 1 2 3 4 … m-1 m R = 1 n ≈ 2R 0.77351 FM Sketch
  • 28. Counting: Estimation with HyperLogLog 1 1 0 0 0 1 … 0 0 1 0 1 0 0 1 0 binary (LSB-0) rank1 ( ) = 0 h1( ) 0 5 … 2 1 2 … k HLL Sketch h1( ) h2( ) hk( )…, 0 0 0 0 0 1 … 0 0 1 1 1 0 0 0 0 binary (LSB-0) rank2 ( ) = 5 h2( ) 0 0 1 1 0 1 … 0 0 1 0 1 1 0 0 1 binary (LSB-0) rankk ( ) = 0 hk( ) … iff bigger than existing value iff bigger than existing value iff bigger then existing value n ≈ α ⋅ k ⋅ 2AVG(HLLi)
  • 29. Counting: HyperLogLog Algorithm Based on a single 32-bit hash function Simulates k hash functions using stochastic averaging approach p bits (32 - p) bits addressing bits rank computation bits hash(x) = 32-bit hash value Stores only k = 2p counters (registers), about 4 bytes each The memory always fixed, regardless the number of unique elements More counters provide less error (memory/accuracy trade-off)
  • 30. Counting: Invoking HyperLogLog from Python 
 import json from pdsa.cardinality.hyperloglog import HyperLogLog hll = HyperLogLog(precision=10) # 2^{10} = 1024 counters with open('visitors.txt') as f: for line in f: ip = json.loads(line)['ip'] hll.add(ip) num_of_unique_visitors = hll.count() print('Unique visitors', num_of_unique_visitors) size_in_bytes = hll.sizeof() print('Size in bytes', size_in_bytes) # ~ 4Kb
  • 31. Counting: Distinct Count in Redis Redis uses the HyperLogLog data structure to count unique elements in a set requires a small constant amount of memory of 12KB for every data structure approximates the exact cardinality with a standard error of 0.81%. redis> PFADD hll python java ruby (integer) 1 redis> PFADD hll python python python (integer) 0 redis> PFADD hll java ruby (integer) 0 redis> PFCOUNT hll (integer) 3 http://antirez.com/news/75
  • 32. 5. Final Notes Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
  • 33. Final Notes Think about Big Data as a technology challenge Instead of buying new servers, learn new algorithms Believe in hashing! Sample vs Hashing. Probabilistic Data Structures and Algorithms become useful when your problem fits Image: https://longfordpc.com/
  • 34. Read More [book] Probabilistic Data Structures and Algorithms for Big Data Applications 
 https://pdsa.gakhov.com [repo] Probabilistic Data Structures and Algorithms in Python
 https://github.com/gakhov/pdsa Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure 
 https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/ Redis new data structure: the HyperLogLog
 http://antirez.com/news/75 Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles 
 https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html Big Data with Sketchy Structures 
 https://towardsdatascience.com/b73fb3a33e2a Count-Min Sketch 
 http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf
  • 35. Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov Website: www.gakhov.com Twitter: @gakhov Probabilistic Data Structures and Algorithms for Big Data Applications pdsa.gakhov.com Eskerrik asko!
  • 36. 6.Additional Slides Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov, @gakhov (for that person who wants more)
  • 38. Counting:Accuracy vs MemoryTradeoff in HyperLogLog !38 More counters require more memory (4 bytes per counter) More counters need more bits for addressing them (m = 2p )
  • 39. Counting: HyperLogLog++Algorithm HyperLogLog++ 64-bit hash function, so allows to count more values better bias correction using pre-trained data proposed a sparse representation of the counters (registers) to reduce memory requirements HyperLogLog++ is an improved version of HyperLogLog 
 developed in Google and proposed in 2013