SlideShare uma empresa Scribd logo
1 de 41
Baixar para ler offline
Approximate methods for scalable
data mining
Andrew Clegg
Data Analytics & Visualization Team
Pearson Technology
Twitter: @andrew_clegg
Approximate methods for scalable data mining l 08/03/132
Outline
1. Intro
2. What are approximate methods and why are they cool?
3. Set membership (finding non-unique items)
4. Cardinality estimation (counting unique items)
5. Frequency estimation (counting occurrences of items)
6. Locality-sensitive hashing (finding similar items)
7. Further reading and sample code
Approximate methods for scalable data mining l 08/03/133
Intro
Me and the team
London
Dario Villanueva
Ablanedo
Data Analytics Engineer
Hubert Rogers
Data Scientist
Andrew Clegg
Technical Manager
Kostas Perifanos
Data Analytics
Engineer
Andreas
Galatoulas
Product Manager
Approximate methods for scalable data mining l 08/03/134
Intro
Motivation for getting into approximate methods
Counting unique terms across ElasticSearch shards
Cluster
nodes Master
node
Distinct
terms per
shard
Globally
distinct
terms
Client
Number of
globally
distinct
terms
Icons from Dropline Neu! http://findicons.com/pack/1714/dropline_neu
Approximate methods for scalable data mining l 08/03/135
Intro
Motivation for getting into approximate methods
But what if each term-set is BIG?
Memory cost
CPU cost to
serialize
Network
transfer cost
CPU cost to
deserialize
CPU & memory cost to
merge & count sets
… and what if they’re too big to fit in memory?
Approximate methods for scalable data mining l 08/03/136
But we’ll come back to that later.
Approximate methods for scalable data mining l 08/03/137
What are approximate methods?
Trading accuracy for scalability
• Often use probabilistic data structures
– a.k.a. “Sketches”
• Mostly stream-friendly
– Allow you to query data you haven’t even kept!
• Generally simple to parallelize
• Predictable error rate (can be tuned)
Approximate methods for scalable data mining l 08/03/138
What are approximate methods?
Trading accuracy for scalability
• Represent characteristics or summary of data
• Use much less space than full dataset (often via hashing)
– Can alleviate disk, memory, network bottlenecks
• Generally incur more CPU load than exact methods
– This may not be true in a distributed system, overall:
[de]serialization for example
– Many data-centric systems have CPU to spare anyway
Approximate methods for scalable data mining l 08/03/139
Set membership
Have I seen this item before?
Approximate methods for scalable data mining l 08/03/1310
Set membership
Naïve approach
• Put all items in a hash table in memory
– e.g. HashSet in Java, set in Python
• Checking whether item exists is very cheap
• Not so good when items don’t fit in memory any more
• Merging big sets (to increase query speed) can be expensive
– Especially if they are on different cluster nodes
Approximate methods for scalable data mining l 08/03/1311
Set membership
Bloom filter
A probabilistic data structure for testing set membership
Real-life example:
BigTable and HBase use these to avoid wasted lookups for non-
existent row and column IDs.
Approximate methods for scalable data mining l 08/03/1312
Set membership
Bloom filter: creating and populating
• Bitfield of size n (can be quite large but << total data size)
• k independent hash functions with integer output in [0, n-1]
• For each input item:
– For each hash:
○ Hash item to get an index into the bitfield
○ Set that bit to 1
i.e. Each item yields a unique pattern of k bits.
These are ORed onto the bitfield when the item is added.
Approximate methods for scalable data mining l 08/03/1313
Set membership
Bloom filter: querying
• Hash the query item with all k hash functions
• Are all of the corresponding bits set?
– No = we have never seen this item before
– Yes = we have probably seen this item before
• Probability of false positive depends on:
– n (bitfield size)
– number of items added
• k has an optimum value also based on these
– Must be picked in advance based on what you expect, roughly
Approximate methods for scalable data mining l 08/03/1314
Set membership
Bloom filter
Example (3 elements, 3 hash functions, 18 bits)
Image from Wikipedia http://en.wikipedia.org/wiki/File:Bloom_filter.svg
Approximate methods for scalable data mining l 08/03/1315
Set membership
Bloom filter
Cool properties
• Union/intersection = bitwise OR/AND
• Add/query operations stay at O(k) time (and they’re fast)
• Filter takes up constant space
– Can be rebuilt bigger once saturated, if you still have the data
Extensions
• BFs supporting “remove”, scalable (growing) BFs, stable BFs, …
Approximate methods for scalable data mining l 08/03/1316
Cardinality estimation
How many distinct items have I seen?
Approximate methods for scalable data mining l 08/03/1317
Cardinality calculation
Naïve approach
• Put all items in a hash table in memory
– e.g. HashSet in Java, set in Python
– Duplicates are ignored
• Count the number remaining at the end
– Implementations typically track this -- fast to check
• Not so good when items don’t fit in memory any more
• Merging big sets can be expensive
– Especially if they are on different cluster nodes
Approximate methods for scalable data mining l 08/03/1318
Cardinality estimation
Probabilistic counting
An approximate method for counting unique items
Real-life example:
Implementation of parallelizable distinct counts in ElasticSearch.
https://github.com/ptdavteam/elasticsearch-approx-plugin
Approximate methods for scalable data mining l 08/03/1319
Cardinality estimation
Probabilistic counting
Intuitive explanation
Long runs of trailing 0s in random bit strings are rare.
But the more bit strings you look at, the more likely you
are to see a long one.
So “longest run of trailing 0s seen” can be used as an
estimator of “number of unique bit strings seen”.
01110001
11101010
00100101
11001100
11110100
11101100
00010100
00000001
00000010
10001110
01110100
01101010
01111111
00100010
00110000
00001010
01000100
01111010
01011101
00000100
Approximate methods for scalable data mining l 08/03/1320
Cardinality estimation
Probabilistic counting: basic algorithm
• Let n = 0
• For each input item:
– Hash item into bit string
– Count trailing zeroes in bit string
– If this count > n:
○ Let n = count
Approximate methods for scalable data mining l 08/03/1321
Cardinality estimation
Probabilistic counting: calculating the estimate
• n = longest run of trailing 0s seen
• Estimated cardinality (“count distinct”) = 2^n … that’s it!
This is an estimate, but not actually a great one.
Improvements
• Various “fudge factors”, corrections for extreme values, etc.
• Multiple hashes in parallel, average over results (LogLog algorithm)
• Harmonic mean instead of geometric (HyperLogLog algorithm)
Approximate methods for scalable data mining l 08/03/1322
Cardinality estimation
Probabilistic counting and friends
Cool properties
• Error rates are predictable
– And tunable, for multi-hash methods
• Can be merged easily
– max(longest run counters from all shards)
• Add/query operations are constant time (and fast too)
• Data structure is just counter[s]
Approximate methods for scalable data mining l 08/03/1323
Frequency estimation
How many occurences of each item have I seen?
Approximate methods for scalable data mining l 08/03/1324
Frequency calculation
Naïve approach
• Maintain a key-value hash table from item -> counter
– e.g. HashMap in Java, dict in Python
• Not so good when items don’t fit in memory any more
• Merging big maps can be expensive
– Especially if they are on different cluster nodes
Approximate methods for scalable data mining l 08/03/1325
Frequency estimation
Count-min sketch
A probabilistic data structure for counting occurences of
items
Real-life example:
Keeping track of traffic volume by IP address in a firewall, to
detect anomalies.
Approximate methods for scalable data mining l 08/03/1326
Frequency estimation
Count-min sketch: creating and populating
• k integer arrays, each of length n
• k hash functions yielding values in [0, n-1]
– These values act as indexes into the arrays
• For each input item:
– For each hash:
○ Hash item to get index into corresponding array
○ Increment the value at that position by 1
Approximate methods for scalable data mining l 08/03/1327
Frequency estimation
Count-min sketch: creating and populating
+2A3
+1+1A2
+1+1A1
“foo”
h1 h2 h3
h1 h2 h3
“bar”
Approximate methods for scalable data mining l 08/03/1328
Frequency estimation
Count-min sketch: querying
• For each hash function:
– Hash query item to get index into corresponding array
– Get the count at that position
• Return the lowest of these counts
This minimizes the effect of hash collisions.
(Collisions can only cause over-counting, not under-counting)
Approximate methods for scalable data mining l 08/03/1329
Frequency estimation
Count-min sketch: querying
0200000000A3
0001001000A2
0000100010A1
“foo”
h1 h2 h3
min(1, 1, 2) = 1
Caveat: You can’t iterate through the items, they’re not stored at all.
Approximate methods for scalable data mining l 08/03/1330
Frequency estimation
Count-min sketch
Cool properties
• Fast adding and querying in O(k) time
• As with Bloom filter: more hashes = lower error
• Mergeable by cellwise addition
• Better accuracy for higher-frequency items (“heavy hitters”)
• Can also be used to find quantiles
Approximate methods for scalable data mining l 08/03/1331
Similarity search
Which items are most similar to this one?
Approximate methods for scalable data mining l 08/03/1332
Similarity search
Naïve approach
Nearest-neighbour search
• For each stored item:
– Compare to query item via appropriate distance metric*
– Keep if closer than previous closest match
• Distance metric calculation can be expensive
– Especially if items are many-dimensional records
• Can be slow even if data small enough to fit in memory
*Cosine distance, Hamming distance, Jaccard distance etc.
Approximate methods for scalable data mining l 08/03/1333
Similarity search
Locality-sensitive hashing
A probabilistic method for nearest-neighbour search
Real-life example:
Finding users with similar music tastes in an online radio service.
Approximate methods for scalable data mining l 08/03/1334
Similarity search
Locality-sensitive hashing
Intuitive explanation
Typical hash functions:
• Similar inputs yield very different outputs
Locality-sensitive hash functions:
• Similar inputs yield similar or identical outputs
So: Hash each item, then just compare hashes.
– Can be used to pre-filter items before exact comparisons
– You can also index the hashes for quick lookup
Approximate methods for scalable data mining l 08/03/1335
Similarity search
Locality-sensitive hashing: random hyperplanes
Treat n-valued items as vectors in n-dimensional space.
Draw k random hyperplanes in that space.
For each hyperplane:
Is each vector above it (1) or below it (0)?
Item1
h1 h2
h3
Item2
Hash(Item1) = 011
Hash(Item2) = 001
In this example: n=2, k=3
As the cosine distance decreases, the
probability of a hash match increases.
Approximate methods for scalable data mining l 08/03/1336
Similarity search
Locality-sensitive hashing: random hyperplanes
Cool properties
• Hamming distance between hashes approximates cosine distance
• More hyperplanes (higher k) -> bigger hashes -> better estimates
• Can use to narrow search space before exact nearest-neighbours
• Various ways to combine sketches for better separation
Approximate methods for scalable data mining l 08/03/1337
Similarity search
Locality-sensitive hashing
Other approaches
• Bit sampling: approximates Hamming distance
• MinHashing: approximates Jaccard distance
• Random projection: approximates Euclidean distance
Approximate methods for scalable data mining l 08/03/1338
Wrap-up
Approximate methods for scalable data mining l 08/03/1339
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
Approximate methods for scalable data mining l 08/03/1340
Resources
Ebook available free from:
http://infolab.stanford.edu/~ullman/mmds.html
stream-lib:
https://github.com/clearspring/stream-lib
Java libraries for cardinality, set membership,
frequency and top-N items (not covered here)
No canonical source of multiple LSH algorithms,
but plenty of separate implementations
Wikipedia is pretty good on these topics too

Mais conteúdo relacionado

Mais procurados

5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streamsKrish_ver2
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTDavide Gallitelli
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataAnsgar Scherp
 
A Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document AnnotationA Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document AnnotationAnsgar Scherp
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Raja Chiky
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Jimmy Lai
 
Stacks
StacksStacks
StacksAcad
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data StreamsSujaAldrin
 
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudSchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudAnsgar Scherp
 
Approximate methods for scalable data mining
Approximate methods for scalable data miningApproximate methods for scalable data mining
Approximate methods for scalable data miningAndrew Clegg
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonWes McKinney
 
Data structure and algorithms
Data structure and algorithmsData structure and algorithms
Data structure and algorithmstechnologygyan
 

Mais procurados (20)

5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
09 searching[1]
09 searching[1]09 searching[1]
09 searching[1]
 
Hashing
HashingHashing
Hashing
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDT
 
Extensible hashing
Extensible hashingExtensible hashing
Extensible hashing
 
Algorithm and Programming (Searching)
Algorithm and Programming (Searching)Algorithm and Programming (Searching)
Algorithm and Programming (Searching)
 
Hashing gt1
Hashing gt1Hashing gt1
Hashing gt1
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open Data
 
Hash tables
Hash tablesHash tables
Hash tables
 
A Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document AnnotationA Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document Annotation
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
 
Data Structures 6
Data Structures 6Data Structures 6
Data Structures 6
 
Stacks
StacksStacks
Stacks
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
 
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudSchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
 
Approximate methods for scalable data mining
Approximate methods for scalable data miningApproximate methods for scalable data mining
Approximate methods for scalable data mining
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
 
Data structure and algorithms
Data structure and algorithmsData structure and algorithms
Data structure and algorithms
 
Linear search-and-binary-search
Linear search-and-binary-searchLinear search-and-binary-search
Linear search-and-binary-search
 

Semelhante a Approximate methods for scalable data mining (long version)

Algorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAlgorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisHiye Biniam
 
Algorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAlgorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAtner Yegorov
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer visionEran Shlomo
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsRajendran
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
 
ADS Introduction
ADS IntroductionADS Introduction
ADS IntroductionNagendraK18
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...BRNSSPublicationHubI
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Skillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise Group
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkMukesh Singh
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineApache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineDataWorks Summit
 
Data science chapter-7,8,9
Data science chapter-7,8,9Data science chapter-7,8,9
Data science chapter-7,8,9varshakumar21
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithmsc.titus.brown
 
Ts project Hash based inventory system
Ts project Hash based inventory systemTs project Hash based inventory system
Ts project Hash based inventory systemDADITIRUMALATARUN
 
Algorithms.pptx
Algorithms.pptxAlgorithms.pptx
Algorithms.pptxjohn6938
 

Semelhante a Approximate methods for scalable data mining (long version) (20)

Algorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAlgorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysis
 
Algorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAlgorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysis
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
DB
DBDB
DB
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notations
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
ADS Introduction
ADS IntroductionADS Introduction
ADS Introduction
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Skillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet app
 
t10_part1.pptx
t10_part1.pptxt10_part1.pptx
t10_part1.pptx
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lk
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineApache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
 
Data science chapter-7,8,9
Data science chapter-7,8,9Data science chapter-7,8,9
Data science chapter-7,8,9
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
Ts project Hash based inventory system
Ts project Hash based inventory systemTs project Hash based inventory system
Ts project Hash based inventory system
 
Algorithms.pptx
Algorithms.pptxAlgorithms.pptx
Algorithms.pptx
 

Último

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 

Último (20)

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 

Approximate methods for scalable data mining (long version)

  • 1.
  • 2. Approximate methods for scalable data mining Andrew Clegg Data Analytics & Visualization Team Pearson Technology Twitter: @andrew_clegg
  • 3. Approximate methods for scalable data mining l 08/03/132 Outline 1. Intro 2. What are approximate methods and why are they cool? 3. Set membership (finding non-unique items) 4. Cardinality estimation (counting unique items) 5. Frequency estimation (counting occurrences of items) 6. Locality-sensitive hashing (finding similar items) 7. Further reading and sample code
  • 4. Approximate methods for scalable data mining l 08/03/133 Intro Me and the team London Dario Villanueva Ablanedo Data Analytics Engineer Hubert Rogers Data Scientist Andrew Clegg Technical Manager Kostas Perifanos Data Analytics Engineer Andreas Galatoulas Product Manager
  • 5. Approximate methods for scalable data mining l 08/03/134 Intro Motivation for getting into approximate methods Counting unique terms across ElasticSearch shards Cluster nodes Master node Distinct terms per shard Globally distinct terms Client Number of globally distinct terms Icons from Dropline Neu! http://findicons.com/pack/1714/dropline_neu
  • 6. Approximate methods for scalable data mining l 08/03/135 Intro Motivation for getting into approximate methods But what if each term-set is BIG? Memory cost CPU cost to serialize Network transfer cost CPU cost to deserialize CPU & memory cost to merge & count sets … and what if they’re too big to fit in memory?
  • 7. Approximate methods for scalable data mining l 08/03/136 But we’ll come back to that later.
  • 8. Approximate methods for scalable data mining l 08/03/137 What are approximate methods? Trading accuracy for scalability • Often use probabilistic data structures – a.k.a. “Sketches” • Mostly stream-friendly – Allow you to query data you haven’t even kept! • Generally simple to parallelize • Predictable error rate (can be tuned)
  • 9. Approximate methods for scalable data mining l 08/03/138 What are approximate methods? Trading accuracy for scalability • Represent characteristics or summary of data • Use much less space than full dataset (often via hashing) – Can alleviate disk, memory, network bottlenecks • Generally incur more CPU load than exact methods – This may not be true in a distributed system, overall: [de]serialization for example – Many data-centric systems have CPU to spare anyway
  • 10. Approximate methods for scalable data mining l 08/03/139 Set membership Have I seen this item before?
  • 11. Approximate methods for scalable data mining l 08/03/1310 Set membership Naïve approach • Put all items in a hash table in memory – e.g. HashSet in Java, set in Python • Checking whether item exists is very cheap • Not so good when items don’t fit in memory any more • Merging big sets (to increase query speed) can be expensive – Especially if they are on different cluster nodes
  • 12. Approximate methods for scalable data mining l 08/03/1311 Set membership Bloom filter A probabilistic data structure for testing set membership Real-life example: BigTable and HBase use these to avoid wasted lookups for non- existent row and column IDs.
  • 13. Approximate methods for scalable data mining l 08/03/1312 Set membership Bloom filter: creating and populating • Bitfield of size n (can be quite large but << total data size) • k independent hash functions with integer output in [0, n-1] • For each input item: – For each hash: ○ Hash item to get an index into the bitfield ○ Set that bit to 1 i.e. Each item yields a unique pattern of k bits. These are ORed onto the bitfield when the item is added.
  • 14. Approximate methods for scalable data mining l 08/03/1313 Set membership Bloom filter: querying • Hash the query item with all k hash functions • Are all of the corresponding bits set? – No = we have never seen this item before – Yes = we have probably seen this item before • Probability of false positive depends on: – n (bitfield size) – number of items added • k has an optimum value also based on these – Must be picked in advance based on what you expect, roughly
  • 15. Approximate methods for scalable data mining l 08/03/1314 Set membership Bloom filter Example (3 elements, 3 hash functions, 18 bits) Image from Wikipedia http://en.wikipedia.org/wiki/File:Bloom_filter.svg
  • 16. Approximate methods for scalable data mining l 08/03/1315 Set membership Bloom filter Cool properties • Union/intersection = bitwise OR/AND • Add/query operations stay at O(k) time (and they’re fast) • Filter takes up constant space – Can be rebuilt bigger once saturated, if you still have the data Extensions • BFs supporting “remove”, scalable (growing) BFs, stable BFs, …
  • 17. Approximate methods for scalable data mining l 08/03/1316 Cardinality estimation How many distinct items have I seen?
  • 18. Approximate methods for scalable data mining l 08/03/1317 Cardinality calculation Naïve approach • Put all items in a hash table in memory – e.g. HashSet in Java, set in Python – Duplicates are ignored • Count the number remaining at the end – Implementations typically track this -- fast to check • Not so good when items don’t fit in memory any more • Merging big sets can be expensive – Especially if they are on different cluster nodes
  • 19. Approximate methods for scalable data mining l 08/03/1318 Cardinality estimation Probabilistic counting An approximate method for counting unique items Real-life example: Implementation of parallelizable distinct counts in ElasticSearch. https://github.com/ptdavteam/elasticsearch-approx-plugin
  • 20. Approximate methods for scalable data mining l 08/03/1319 Cardinality estimation Probabilistic counting Intuitive explanation Long runs of trailing 0s in random bit strings are rare. But the more bit strings you look at, the more likely you are to see a long one. So “longest run of trailing 0s seen” can be used as an estimator of “number of unique bit strings seen”. 01110001 11101010 00100101 11001100 11110100 11101100 00010100 00000001 00000010 10001110 01110100 01101010 01111111 00100010 00110000 00001010 01000100 01111010 01011101 00000100
  • 21. Approximate methods for scalable data mining l 08/03/1320 Cardinality estimation Probabilistic counting: basic algorithm • Let n = 0 • For each input item: – Hash item into bit string – Count trailing zeroes in bit string – If this count > n: ○ Let n = count
  • 22. Approximate methods for scalable data mining l 08/03/1321 Cardinality estimation Probabilistic counting: calculating the estimate • n = longest run of trailing 0s seen • Estimated cardinality (“count distinct”) = 2^n … that’s it! This is an estimate, but not actually a great one. Improvements • Various “fudge factors”, corrections for extreme values, etc. • Multiple hashes in parallel, average over results (LogLog algorithm) • Harmonic mean instead of geometric (HyperLogLog algorithm)
  • 23. Approximate methods for scalable data mining l 08/03/1322 Cardinality estimation Probabilistic counting and friends Cool properties • Error rates are predictable – And tunable, for multi-hash methods • Can be merged easily – max(longest run counters from all shards) • Add/query operations are constant time (and fast too) • Data structure is just counter[s]
  • 24. Approximate methods for scalable data mining l 08/03/1323 Frequency estimation How many occurences of each item have I seen?
  • 25. Approximate methods for scalable data mining l 08/03/1324 Frequency calculation Naïve approach • Maintain a key-value hash table from item -> counter – e.g. HashMap in Java, dict in Python • Not so good when items don’t fit in memory any more • Merging big maps can be expensive – Especially if they are on different cluster nodes
  • 26. Approximate methods for scalable data mining l 08/03/1325 Frequency estimation Count-min sketch A probabilistic data structure for counting occurences of items Real-life example: Keeping track of traffic volume by IP address in a firewall, to detect anomalies.
  • 27. Approximate methods for scalable data mining l 08/03/1326 Frequency estimation Count-min sketch: creating and populating • k integer arrays, each of length n • k hash functions yielding values in [0, n-1] – These values act as indexes into the arrays • For each input item: – For each hash: ○ Hash item to get index into corresponding array ○ Increment the value at that position by 1
  • 28. Approximate methods for scalable data mining l 08/03/1327 Frequency estimation Count-min sketch: creating and populating +2A3 +1+1A2 +1+1A1 “foo” h1 h2 h3 h1 h2 h3 “bar”
  • 29. Approximate methods for scalable data mining l 08/03/1328 Frequency estimation Count-min sketch: querying • For each hash function: – Hash query item to get index into corresponding array – Get the count at that position • Return the lowest of these counts This minimizes the effect of hash collisions. (Collisions can only cause over-counting, not under-counting)
  • 30. Approximate methods for scalable data mining l 08/03/1329 Frequency estimation Count-min sketch: querying 0200000000A3 0001001000A2 0000100010A1 “foo” h1 h2 h3 min(1, 1, 2) = 1 Caveat: You can’t iterate through the items, they’re not stored at all.
  • 31. Approximate methods for scalable data mining l 08/03/1330 Frequency estimation Count-min sketch Cool properties • Fast adding and querying in O(k) time • As with Bloom filter: more hashes = lower error • Mergeable by cellwise addition • Better accuracy for higher-frequency items (“heavy hitters”) • Can also be used to find quantiles
  • 32. Approximate methods for scalable data mining l 08/03/1331 Similarity search Which items are most similar to this one?
  • 33. Approximate methods for scalable data mining l 08/03/1332 Similarity search Naïve approach Nearest-neighbour search • For each stored item: – Compare to query item via appropriate distance metric* – Keep if closer than previous closest match • Distance metric calculation can be expensive – Especially if items are many-dimensional records • Can be slow even if data small enough to fit in memory *Cosine distance, Hamming distance, Jaccard distance etc.
  • 34. Approximate methods for scalable data mining l 08/03/1333 Similarity search Locality-sensitive hashing A probabilistic method for nearest-neighbour search Real-life example: Finding users with similar music tastes in an online radio service.
  • 35. Approximate methods for scalable data mining l 08/03/1334 Similarity search Locality-sensitive hashing Intuitive explanation Typical hash functions: • Similar inputs yield very different outputs Locality-sensitive hash functions: • Similar inputs yield similar or identical outputs So: Hash each item, then just compare hashes. – Can be used to pre-filter items before exact comparisons – You can also index the hashes for quick lookup
  • 36. Approximate methods for scalable data mining l 08/03/1335 Similarity search Locality-sensitive hashing: random hyperplanes Treat n-valued items as vectors in n-dimensional space. Draw k random hyperplanes in that space. For each hyperplane: Is each vector above it (1) or below it (0)? Item1 h1 h2 h3 Item2 Hash(Item1) = 011 Hash(Item2) = 001 In this example: n=2, k=3 As the cosine distance decreases, the probability of a hash match increases.
  • 37. Approximate methods for scalable data mining l 08/03/1336 Similarity search Locality-sensitive hashing: random hyperplanes Cool properties • Hamming distance between hashes approximates cosine distance • More hyperplanes (higher k) -> bigger hashes -> better estimates • Can use to narrow search space before exact nearest-neighbours • Various ways to combine sketches for better separation
  • 38. Approximate methods for scalable data mining l 08/03/1337 Similarity search Locality-sensitive hashing Other approaches • Bit sampling: approximates Hamming distance • MinHashing: approximates Jaccard distance • Random projection: approximates Euclidean distance
  • 39. Approximate methods for scalable data mining l 08/03/1338 Wrap-up
  • 40. Approximate methods for scalable data mining l 08/03/1339 http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
  • 41. Approximate methods for scalable data mining l 08/03/1340 Resources Ebook available free from: http://infolab.stanford.edu/~ullman/mmds.html stream-lib: https://github.com/clearspring/stream-lib Java libraries for cardinality, set membership, frequency and top-N items (not covered here) No canonical source of multiple LSH algorithms, but plenty of separate implementations Wikipedia is pretty good on these topics too