We interact with an increasing amount of data but classical data structures and algorithms can't fit our requirements anymore. This talk is to present the probabilistic algorithms and data structures and describe the main areas of their applications.
Diamond Application Development Crafting Solutions with Precision
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applications
1. Andrii Gakhov, PhD
Exceeding Classical:
Probabilistic Data Structures
in Data-Intensive Applications
EuroSciPy 2019
Bilbao, Spain
2. Andrii Gakhov
Senior Software Engineer
at Ferret Go GmbH, Germany
Ph.D. in Mathematical Modelling,
M.Sc. in Applied Mathematics
Twitter: @gakhov | Website: gakhov.com
Probabilistic Data Structures
and Algorithms
for Big Data Applications
ISBN: 9783748190486
https://pdsa.gakhov.com
4. Bioinformatics: Counting k-mers in DNA
Counting substrings of length k in DNA sequence data (k-mers) is
essential in bioinformatics, for instance, for metagenomic sequencing.
A large fraction of the storage is spent on storing k-mers with sequencing
errors and which are observed only a single time in the data*.
Can we efficiently avoid to persist such invalid substrings?
Can we efficiently count valid substrings?
* Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics 12(1), 333, 2011
For example, the team that sequenced the giant panda genome needed to
count 8.62 billion 27-mers, where 68% were low-coverage k-mers.
5. 1. Data-Intensive Applications
in Big Data epoch
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov
6. What is Big Data?
Doug Laney in 2001 described Big Data datasets as such that
contain greater variety arriving in increasing volumes and with ever-
higher velocity. Today this is known as the famous 3V’s of Big Data.
Big
Data
Velocity Variety
Volume expresses the amount of data
describes the speed at which data is arriving refers to the number of types of data
7. What is Big Data?
Big Data is more than simply a
matter of size.
Big Data does not refer to data, it
refers to technology.
The datasets of Big Data are larger, more complex, and
generated more rapidly than our current resources can handle.
Image: https://www.freepngimg.com/electronics/technology
8. 2. Probabilistic Data Structures
and Algorithms
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov
9. Probabilistic Data Structures and Algorithms (PDSA)
A family of advanced approaches that are optimized to
use sublinear memory and constant execution time.
Cannot provide the exact answers and have some probability of error.
error
resources
The tradeoff between the error and the resources is another feature
that distinguish the algorithms and data structures of this family.
10. PDSA in Big Data Ecosystem
Count-Min Sketch
Count Sketch
Bloom Filter
Quotient Filter
Cuckoo Filter
Linear Counting
FM Sketch
LogLog
HyperLogLog
Random Sampling t-digestq-digestGreenwald-Khanna
MinHash
SimHash
LSH
Counting
find the number of unique elements
Membership
keep track of indexed elements
Rank approximate percentiles and quantiles
Frequency
estimate frequencies of elements
Similarity
find similar documents
Big
Data
Velocity Variety
Volume
11. PDSA in Apache Spark SQL (PySpark interface)
q-quantile estimation (Greenwald-Khanna)
# pyspark.sql.DataFrameStatFunctions(df).approxQuantile
df.approxQuantile("language", [0.5], 0.25)
Approximate number of distinct elements (HyperLogLog++)
#pyspark.sql.functions.approx_count_distinct
df.agg(approx_count_distinct(df.language).alias('lang')).collect()
Spark SQL is Apache Spark's module for working with structured data.
14. Frequency: Challenge
A hashtag is used to index a topic on Twitter and allows people to easily follow
items they are interested in. Hashtags are usually written with a # symbol in front.
Find the most trending hashtags on Twitter
every second about 6000 tweets are created on Twitter,
that is roughly 500 million items daily
most of tweets are linked with one or more hashtags
https://www.internetlivestats.com/twitter-statistics/
15. Frequency:Traditional Approach
Build a table that lists of all seen thus far
elements with corresponding counters
Increment counters when a new element
comes or add that element into the table and
initialize its counter
Return the value of the counter that
corresponds to the element as frequency
requires linear memory
requires O(n) time lookup (worst case)
huge overhead for heavy hitters search
1 1 1
1 1 2
16. Frequency: Challenges for Big Data data streams
Continuous data streams
potentially unbounded number of unique elements
➡ sublinear (polylogarithmic at most) space
not feasible to re-process data streams
➡ one-pass algorithms preferred
high frequency throughput
➡ fast updates
Image: https://www.pngfind.com
17. Count-Min Sketch
a simple space-efficient probabilistic data structure that is used to estimate
frequencies of elements in data streams and can address the Heavy hitters problem
presented by Graham Cormode and Shan Muthukrishnan in 2003
21. Counting: Invoking Count-Min Sketch from Python
import json
from pdsa.frequency.count_min_sketch import CountMinSketch
cms = CountMinSketch(5, 2000)
with open('tweets.txt') as f:
for line in f:
ip = json.loads(line)['hashtag']
cms.add(ip)
print('Frequency of #Python', cms.frequency("Python"))
size_in_bytes = cms.sizeof()
print('Size in bytes', size_in_bytes) # ~40Kb / 32-bit counters
22. 4. Counting
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov
23. Counting: Challenge
Count the number of unique visitors
Amazon and eBay had about 3.375 billion* visitors in June 2019
Assume 337 million of unique IP addresses (128 bit per IPv6 record)
5.4 GB of memory just to store them all
*SimilarWeb.Com Data for June, 2019
What if we can count them with 12 KB only?
Image: https://www.cleanpng.com
24. Counting:Traditional Approach
Build list of all unique elements
Sort / search
to avoid listing elements twice
Count elements in the list
requires linear memory
requires O(n·log n) time
26. HyperLogLog
a hash-based probabilistic algorithm for counting the number of distinct
values in the presence of duplicates
proposed by Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier in 2007
29. Counting: HyperLogLog Algorithm
Based on a single 32-bit hash function
Simulates k hash functions using stochastic averaging
approach
p bits (32 - p) bits
addressing bits rank computation bits
hash(x) =
32-bit hash value
Stores only k = 2p
counters (registers), about 4 bytes each
The memory always fixed, regardless the number of unique elements
More counters provide less error (memory/accuracy trade-off)
30. Counting: Invoking HyperLogLog from Python
import json
from pdsa.cardinality.hyperloglog import HyperLogLog
hll = HyperLogLog(precision=10) # 2^{10} = 1024 counters
with open('visitors.txt') as f:
for line in f:
ip = json.loads(line)['ip']
hll.add(ip)
num_of_unique_visitors = hll.count()
print('Unique visitors', num_of_unique_visitors)
size_in_bytes = hll.sizeof()
print('Size in bytes', size_in_bytes) # ~ 4Kb
31. Counting: Distinct Count in Redis
Redis uses the HyperLogLog data structure to count unique elements in a set
requires a small constant amount of memory of 12KB for every data
structure
approximates the exact cardinality with a standard error of 0.81%.
redis> PFADD hll python java ruby
(integer) 1
redis> PFADD hll python python python
(integer) 0
redis> PFADD hll java ruby
(integer) 0
redis> PFCOUNT hll
(integer) 3
http://antirez.com/news/75
32. 5. Final Notes
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov
33. Final Notes
Think about Big Data as a technology
challenge
Instead of buying new servers, learn new
algorithms
Believe in hashing! Sample vs Hashing.
Probabilistic Data Structures and Algorithms
become useful when your problem fits
Image: https://longfordpc.com/
34. Read More
[book] Probabilistic Data Structures and Algorithms for Big Data Applications
https://pdsa.gakhov.com
[repo] Probabilistic Data Structures and Algorithms in Python
https://github.com/gakhov/pdsa
Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure
https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
Redis new data structure: the HyperLogLog
http://antirez.com/news/75
Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles
https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html
Big Data with Sketchy Structures
https://towardsdatascience.com/b73fb3a33e2a
Count-Min Sketch
http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf
35. Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov
Website: www.gakhov.com
Twitter: @gakhov
Probabilistic Data Structures and
Algorithms for Big Data Applications
pdsa.gakhov.com
Eskerrik asko!
36. 6.Additional Slides
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov, @gakhov
(for that person who wants more)
38. Counting:Accuracy vs MemoryTradeoff in HyperLogLog
!38
More counters require more memory (4 bytes per counter)
More counters need more bits for addressing them (m = 2p
)
39. Counting: HyperLogLog++Algorithm
HyperLogLog++
64-bit hash function, so allows to count more values
better bias correction using pre-trained data
proposed a sparse representation of the counters
(registers) to reduce memory requirements
HyperLogLog++ is an improved version of HyperLogLog
developed in Google and proposed in 2013