SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
Going Beyond k-meansGoing Beyond k-means
Developments in the ≈60 years since its publication
J Singh and Teresa Brooks
March 17, 2015
Hello Bulgaria
• A website with thousands of pages...
– Some pages identical to other pages
– Some pages nearly identical to other pages
• We want smart indexing of the collection
2
© DataThinks 2013-15
2
• We want smart indexing of the collection
– Save just one copy of the duplicate pages
– Save one copy of the nearly duplicate pages
– Filter out similar documents when returning search results
• And we want to keep the index up to date
– Detect content changes quickly, possibly without reading
old copies from a slow storage
The Naïve Way to Address this Challenge
• Represent each document as a dot in d-dimensional space
• Run a k-means algorithm on the document set
– Resulting in k clusters
• When presented with a new document
3
© DataThinks 2013-15 3
• When presented with a new document
– Find the “nearest cluster”
– Find the documents within the nearest
cluster that are nearest to the document in
question
• Can be skipped if the cluster is small enough
• i.e., k is large enough that everything in the
cluster is close!
The Naïve Way has conceptual problems
• No good way to decide optimal k
• All documents have to be re-clustered if we want to
change k
• A document may “belong” to multiple clusters
• All clusters are roughly the same size
4
© DataThinks 2013-15
4
• All clusters are roughly the same size
– In practice, this terrain is lumpy – some documents are
one-of-a-kind and others are similar to many others.
The Naïve Way has technical problems
• End result is subject to initial choice of centroids
– Leads to results not being repeatable
• Performance is O(nk), or worse!
– Especially unfortunate because we want k to be large
• Algorithm is not easily adapted to map/reduce
5
© DataThinks 2013-15
5
• Algorithm is not easily adapted to map/reduce
– We need a pipeline of map/reduce jobs to compute it
Any Evolutionary Alternatives?
• Clustering has been picked over quite well
due to its combination of interesting math
and wide applicability
• Two dominant types have emerged:
– Hierarchical clustering
6
© DataThinks 2013-15
6
– Hierarchical clustering
– Partitional clustering (e.g., k-means)
• k-Means Variations based on
– Choice of Initial Centroids
– Choice of k
– Parameters at each iteration
Another line of inquiry: Nearest Neighbor
• Based on partitioning the search space
– Quad Trees
– kd-Trees
7
© DataThinks 2013-15
7
– Locality-Sensitive Hashing
• Hash functions are locality-sensitive, if, for a
random hash function h, for any pair of points p,q :
– Pr[h(p)=h(q)] is “high” if p is “close” to q
– Pr[h(p)=h(q)] is “low” if p is “far” from q
More on Nearest Neighbor…
• Locality-Sensitive Hashing†
– Hash functions are locality-sensitive, if, for a random hash
random function h, for any pair of points p,q we have:
• Pr[h(p)=h(q)] is “high” if p is “close” to q
• Pr[h(p)=h(q)] is “low” if p is”far” from q
8
© DataThinks 2013-15
8
†Indyk-Motwani’98
The LSH Idea
• Treat items as vectors in d-
dimensional space.
• Draw k random hyper-planes in
that space.
• For each hyper-plane:
4
5
9
© DataThinks 2013-15 9
– Is each vector on the (0) side of
the hyperplane or the (1) side?
• Hash(Item1) = 000
• Hash(Item3) = 101
• Hashes each item into a number
• The magic is in choosing h1, h2,
…
2
13 6
7
h3
h1
h2
The LSH Hash Code Idea…
• …Breaks d-dimensional space into proximity-polyhedra.
• Each purple block
represents a document Buckets
10
© DataThinks 2013-15
represents a document
– Each Bucket represents a
group of alike docs
• Docs within each bucket
still need to be compared
to see which ones are the
“closest”
A Brief History of LSH
• Origins at Stanford (1998)
• Continuing research in universities
– Stanford, MIT, Rutgers, Cornell, …
• Continuing research in Industry
– Intel, Microsoft, Google, …
11
© DataThinks 2013-15
11
– Intel, Microsoft, Google, …
• Textbook:
– A. Rajaraman and J. Ullman (2010). (http://goo.gl/8AJDgI)
• Our contribution:
– An extensible implementation for large datasets
Choosing hash functions
• Introducing minhash
1. Sample each document to get its “shingles” – small
fragments
• “Mary had a “ “mary”, “ary “, “ry h”, “y ha”, “ had”, …
• “CTAGTATAAA” “CTAGTATA”, “TAGTATAA”, “AGTATAAA”,
• “now is the time” “now is”, “is the”, “the time”
12
© DataThinks 2013-15
12
• “now is the time” “now is”, “is the”, “the time”
2. Calculate the hash value for every shingle.
3. Store the minimum hash value found in step 2.
4. Repeat steps 2 and 3 with different hash algorithms 199
more times to get a total of 200 minhash values.
Interesting thing about minhashes
• The resulting minhashes are 200 integer values
representing a random selection of shingles.
– Property of minhashes: If the minhashes for two docs
are the same, their shingles are likely to be the same
– If the shingles for two docs are the same, the docs
themselves are likely to be the same
13
© DataThinks 2013-15
13
themselves are likely to be the same
• Beware…
– Minhash is specific to a particular similarity measure –
Jaccard similarity
– Other hash families exist for other similarity measures
All 200 minhashes must match?
• If all minhashes match, it implies a strong similarity
between docs.
• To catch most cases with weaker similarity
– Don’t compare all minhashes at once, compare them in
bands. Candidate pairs are those that hash to the same
bucket for ≥ 1 band.
14
© DataThinks 2013-15
14
bucket for ≥ 1 band.
– Sometimes one band will reject a pair and another band
will consider it a candidate.
LSH Involves a Tradeoff
• Pick the number of minhashes, the number of bands, and
the number of rows per band to balance false
positives/negatives.
– False positives ⇒ need to examine more pairs that are not
really similar. More processing resources, more time.
– False negatives ⇒ failed to examine pairs that were similar,
15
© DataThinks 2013-15
15
– False negatives ⇒ failed to examine pairs that were similar,
didn’t find all similar results. But got done faster!
Summary
• Mine the data and place
members into hash buckets
• When you need to find a
match, hash it and possible
nearest neighbors will be in
one of b buckets.
16
© DataThinks 2013-15
16
one of b buckets.
• Algorithm performance O(n)
Going Beyond k-meansGoing Beyond k-means
Demo
J Singh and Teresa Brooks
March 17, 2015
Peerbelt Results Example
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
#####
MUhNgZKlQ5qWKlSzlQ4auA
_UOkeHgLQn2HaLM5AdJBcw
fjw99kNgSBSNLXDjBijKIQ
6sp57Uq1TjCCb6ozoHlcEw
P2WtoTI0ROiwMu-KqcFmrg
f6aRJpmmQZWshgUJY6ddiA
o_CEANscQM-IC2VX-kk6Ag
mzfgajvrRNyJGYcr0d7i5g
Knvo0RsrRWeE-75QfeYRAQ
cllezWovQay6ZA1Ubxqbzw
oLXkAmQ5RIOM4svywmynbQ
TTWu2oleRcuHcNKqQqL_9Q
2is32hFhRACt-qAAg15eSQ
KnOpBza6TQO2lNHo45i08A
PebEFSHLQmGxI4aMAP-Pmw
T8TFg700R-WCACYPceCRfg
18
© DataThinks 2013-15
18
T8TFg700R-WCACYPceCRfg
BnM7ETiXQAywiFYEenzGfw
q6DSgUlOTVuro67PY2zpOQ
YZP6Tk7ZTBKPZnLTSctZEQ
yoTVRL8jSDyJHtS-Vcgkgw
xhA8UNjOTBuDt-VRnMTTnw
BQNIVz_5TxSlXZMJYV9lhA
S6FG_NUaQU-UyIoez_k2zg
_KJHmfuzQtKiCGHVT45JPg
AEWdkJ3QTAiaFRwOsbTcsA
MsVBW3oKT6yZNP9J8-2jKw
c7bRvt-dQse7n4tmFkuQCQ
K4DcDglWS3OdUXTGqTX1LA
lWgrETQwQsmmTDitHstIiQ
eAOq-w3pRJq1T0mdEeYBJA
OfTond3JRjCmaNaHJc5pcw
Wv4BFePCR0SSvotcfTbI-A
62p0zfd2SZOhH0niF90QcA
AxNLgwmBS1uK-QivL3bKWw
BcYtpGdbTtazQCp7ez7nCw
UpOP24JMSJuP58TAHkvc4w
K7fSX7v0Qcy4PAbGl7ZFFw
Zwc1YB8SSeSrcALscMfDNQ
mpmoIZY6S4Si89wdEyX9IA
3YhvLB30QJiFQXBA1vIqsA
=-xm8tkdTRN6i18BkP-EF4Q
YQ9K2Ka2TGic_7FZFb7pJg
Database Architecture Requirements
• Need a very large range of bucket numbers
– Bucket Numbers in our implementation are -231 to +231-1
• Most buckets are empty
– Empty buckets must not take any space in the database
19
© DataThinks 2013-15
19
– Empty buckets must not take any space in the database
– Some buckets have a lot of documents in them, we need to
be able to locate all of them
• To find documents similar to a given document,
– Bucketize the document, then find other documents in the
same buckets
Implementation: OpenLSH
• We started OpenLSH to provide a framework for LSH
• Factor out the database
– Started on Google App Engine
– Virtualized interface to make it work on Cassandra
20
© DataThinks 2013-15
20
– Virtualized interface to make it work on Cassandra
• Factor out the calculation engine
– Started on Google App Engine
– Can plug in Google MapReduce
– Ported to run in Batch mode on Cassandra
Using OpenLSH
• We’re looking for one or two interesting use cases
– Application areas:
• Near de-duplicaction (covered with Peerbelt’s data)
• Stocks that move independent of the herd
• Filtering “unique stories” from the News
21
© DataThinks 2013-15
21
• Contact us to discuss
What you can do
• For more information: http://openlsh.datathinks.org/
– Links to code and data set are included
• Run on App Engine
– Minimum setup required
22
© DataThinks 2013-15
22
– Minimum setup required
• Adapt it to your environment and need
• If you need help, send email or create a Github issue.
• Send us a pull request for any improvements you make.
Thank you
• J Singh
– Principal, DataThinks
• Algorithms for big data
• @datathinks, @singh_j
• j . singh @ datathinks . org
23
© DataThinks 2013-15
23
• j . singh @ datathinks . org
– Adj. Prof, Computer Science, WPI
• Teresa Brooks
– Senior Software Engineer @ Xero
• teresa.brooks@xero.com
• @VaderGirl13

Mais conteúdo relacionado

Mais procurados

The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 
Apache HBase Application Archetypes
Apache HBase Application ArchetypesApache HBase Application Archetypes
Apache HBase Application Archetypes
Cloudera, Inc.
 

Mais procurados (20)

Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MD
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Hadoop - How It Works
Hadoop - How It WorksHadoop - How It Works
Hadoop - How It Works
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deployment
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 
Apache HBase Application Archetypes
Apache HBase Application ArchetypesApache HBase Application Archetypes
Apache HBase Application Archetypes
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 

Semelhante a OpenLSH - a framework for locality sensitive hashing

Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
J Singh
 
Big data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big DataBig data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big Data
Christos Hadjinikolis
 

Semelhante a OpenLSH - a framework for locality sensitive hashing (20)

Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 update
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningDetection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
 
Implementing Linked Data in Low-Resource Conditions
Implementing Linked Data in Low-Resource ConditionsImplementing Linked Data in Low-Resource Conditions
Implementing Linked Data in Low-Resource Conditions
 
data mining
data miningdata mining
data mining
 
Part1
Part1Part1
Part1
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Introduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's GuideIntroduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's Guide
 
Big data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big DataBig data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big Data
 
Intro to graphs for HR analytics
Intro to graphs for HR analyticsIntro to graphs for HR analytics
Intro to graphs for HR analytics
 
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...
 
Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)
Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)
Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)
 
Joins in a distributed world - Lucian Precup
Joins in a distributed world - Lucian Precup Joins in a distributed world - Lucian Precup
Joins in a distributed world - Lucian Precup
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
Big Data Tutorial V4
Big Data Tutorial V4Big Data Tutorial V4
Big Data Tutorial V4
 
Spark
SparkSpark
Spark
 

Mais de J Singh

PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engine
J Singh
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
J Singh
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency Control
J Singh
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
J Singh
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
J Singh
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
J Singh
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index Structures
J Singh
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and Performance
J Singh
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processing
J Singh
 
CS 542 Introduction
CS 542 IntroductionCS 542 Introduction
CS 542 Introduction
J Singh
 
Cloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's ViewpointCloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's Viewpoint
J Singh
 

Mais de J Singh (16)

PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engine
 
Big Data Laboratory
Big Data LaboratoryBig Data Laboratory
Big Data Laboratory
 
Social Media Mining using GAE Map Reduce
Social Media Mining using GAE Map ReduceSocial Media Mining using GAE Map Reduce
Social Media Mining using GAE Map Reduce
 
High Throughput Data Analysis
High Throughput Data AnalysisHigh Throughput Data Analysis
High Throughput Data Analysis
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduce
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency Control
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index Structures
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and Performance
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processing
 
CS 542 Introduction
CS 542 IntroductionCS 542 Introduction
CS 542 Introduction
 
Cloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's ViewpointCloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's Viewpoint
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

OpenLSH - a framework for locality sensitive hashing

  • 1. Going Beyond k-meansGoing Beyond k-means Developments in the ≈60 years since its publication J Singh and Teresa Brooks March 17, 2015
  • 2. Hello Bulgaria • A website with thousands of pages... – Some pages identical to other pages – Some pages nearly identical to other pages • We want smart indexing of the collection 2 © DataThinks 2013-15 2 • We want smart indexing of the collection – Save just one copy of the duplicate pages – Save one copy of the nearly duplicate pages – Filter out similar documents when returning search results • And we want to keep the index up to date – Detect content changes quickly, possibly without reading old copies from a slow storage
  • 3. The Naïve Way to Address this Challenge • Represent each document as a dot in d-dimensional space • Run a k-means algorithm on the document set – Resulting in k clusters • When presented with a new document 3 © DataThinks 2013-15 3 • When presented with a new document – Find the “nearest cluster” – Find the documents within the nearest cluster that are nearest to the document in question • Can be skipped if the cluster is small enough • i.e., k is large enough that everything in the cluster is close!
  • 4. The Naïve Way has conceptual problems • No good way to decide optimal k • All documents have to be re-clustered if we want to change k • A document may “belong” to multiple clusters • All clusters are roughly the same size 4 © DataThinks 2013-15 4 • All clusters are roughly the same size – In practice, this terrain is lumpy – some documents are one-of-a-kind and others are similar to many others.
  • 5. The Naïve Way has technical problems • End result is subject to initial choice of centroids – Leads to results not being repeatable • Performance is O(nk), or worse! – Especially unfortunate because we want k to be large • Algorithm is not easily adapted to map/reduce 5 © DataThinks 2013-15 5 • Algorithm is not easily adapted to map/reduce – We need a pipeline of map/reduce jobs to compute it
  • 6. Any Evolutionary Alternatives? • Clustering has been picked over quite well due to its combination of interesting math and wide applicability • Two dominant types have emerged: – Hierarchical clustering 6 © DataThinks 2013-15 6 – Hierarchical clustering – Partitional clustering (e.g., k-means) • k-Means Variations based on – Choice of Initial Centroids – Choice of k – Parameters at each iteration
  • 7. Another line of inquiry: Nearest Neighbor • Based on partitioning the search space – Quad Trees – kd-Trees 7 © DataThinks 2013-15 7 – Locality-Sensitive Hashing • Hash functions are locality-sensitive, if, for a random hash function h, for any pair of points p,q : – Pr[h(p)=h(q)] is “high” if p is “close” to q – Pr[h(p)=h(q)] is “low” if p is “far” from q
  • 8. More on Nearest Neighbor… • Locality-Sensitive Hashing† – Hash functions are locality-sensitive, if, for a random hash random function h, for any pair of points p,q we have: • Pr[h(p)=h(q)] is “high” if p is “close” to q • Pr[h(p)=h(q)] is “low” if p is”far” from q 8 © DataThinks 2013-15 8 †Indyk-Motwani’98
  • 9. The LSH Idea • Treat items as vectors in d- dimensional space. • Draw k random hyper-planes in that space. • For each hyper-plane: 4 5 9 © DataThinks 2013-15 9 – Is each vector on the (0) side of the hyperplane or the (1) side? • Hash(Item1) = 000 • Hash(Item3) = 101 • Hashes each item into a number • The magic is in choosing h1, h2, … 2 13 6 7 h3 h1 h2
  • 10. The LSH Hash Code Idea… • …Breaks d-dimensional space into proximity-polyhedra. • Each purple block represents a document Buckets 10 © DataThinks 2013-15 represents a document – Each Bucket represents a group of alike docs • Docs within each bucket still need to be compared to see which ones are the “closest”
  • 11. A Brief History of LSH • Origins at Stanford (1998) • Continuing research in universities – Stanford, MIT, Rutgers, Cornell, … • Continuing research in Industry – Intel, Microsoft, Google, … 11 © DataThinks 2013-15 11 – Intel, Microsoft, Google, … • Textbook: – A. Rajaraman and J. Ullman (2010). (http://goo.gl/8AJDgI) • Our contribution: – An extensible implementation for large datasets
  • 12. Choosing hash functions • Introducing minhash 1. Sample each document to get its “shingles” – small fragments • “Mary had a “ “mary”, “ary “, “ry h”, “y ha”, “ had”, … • “CTAGTATAAA” “CTAGTATA”, “TAGTATAA”, “AGTATAAA”, • “now is the time” “now is”, “is the”, “the time” 12 © DataThinks 2013-15 12 • “now is the time” “now is”, “is the”, “the time” 2. Calculate the hash value for every shingle. 3. Store the minimum hash value found in step 2. 4. Repeat steps 2 and 3 with different hash algorithms 199 more times to get a total of 200 minhash values.
  • 13. Interesting thing about minhashes • The resulting minhashes are 200 integer values representing a random selection of shingles. – Property of minhashes: If the minhashes for two docs are the same, their shingles are likely to be the same – If the shingles for two docs are the same, the docs themselves are likely to be the same 13 © DataThinks 2013-15 13 themselves are likely to be the same • Beware… – Minhash is specific to a particular similarity measure – Jaccard similarity – Other hash families exist for other similarity measures
  • 14. All 200 minhashes must match? • If all minhashes match, it implies a strong similarity between docs. • To catch most cases with weaker similarity – Don’t compare all minhashes at once, compare them in bands. Candidate pairs are those that hash to the same bucket for ≥ 1 band. 14 © DataThinks 2013-15 14 bucket for ≥ 1 band. – Sometimes one band will reject a pair and another band will consider it a candidate.
  • 15. LSH Involves a Tradeoff • Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives. – False positives ⇒ need to examine more pairs that are not really similar. More processing resources, more time. – False negatives ⇒ failed to examine pairs that were similar, 15 © DataThinks 2013-15 15 – False negatives ⇒ failed to examine pairs that were similar, didn’t find all similar results. But got done faster!
  • 16. Summary • Mine the data and place members into hash buckets • When you need to find a match, hash it and possible nearest neighbors will be in one of b buckets. 16 © DataThinks 2013-15 16 one of b buckets. • Algorithm performance O(n)
  • 17. Going Beyond k-meansGoing Beyond k-means Demo J Singh and Teresa Brooks March 17, 2015
  • 18. Peerbelt Results Example ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### MUhNgZKlQ5qWKlSzlQ4auA _UOkeHgLQn2HaLM5AdJBcw fjw99kNgSBSNLXDjBijKIQ 6sp57Uq1TjCCb6ozoHlcEw P2WtoTI0ROiwMu-KqcFmrg f6aRJpmmQZWshgUJY6ddiA o_CEANscQM-IC2VX-kk6Ag mzfgajvrRNyJGYcr0d7i5g Knvo0RsrRWeE-75QfeYRAQ cllezWovQay6ZA1Ubxqbzw oLXkAmQ5RIOM4svywmynbQ TTWu2oleRcuHcNKqQqL_9Q 2is32hFhRACt-qAAg15eSQ KnOpBza6TQO2lNHo45i08A PebEFSHLQmGxI4aMAP-Pmw T8TFg700R-WCACYPceCRfg 18 © DataThinks 2013-15 18 T8TFg700R-WCACYPceCRfg BnM7ETiXQAywiFYEenzGfw q6DSgUlOTVuro67PY2zpOQ YZP6Tk7ZTBKPZnLTSctZEQ yoTVRL8jSDyJHtS-Vcgkgw xhA8UNjOTBuDt-VRnMTTnw BQNIVz_5TxSlXZMJYV9lhA S6FG_NUaQU-UyIoez_k2zg _KJHmfuzQtKiCGHVT45JPg AEWdkJ3QTAiaFRwOsbTcsA MsVBW3oKT6yZNP9J8-2jKw c7bRvt-dQse7n4tmFkuQCQ K4DcDglWS3OdUXTGqTX1LA lWgrETQwQsmmTDitHstIiQ eAOq-w3pRJq1T0mdEeYBJA OfTond3JRjCmaNaHJc5pcw Wv4BFePCR0SSvotcfTbI-A 62p0zfd2SZOhH0niF90QcA AxNLgwmBS1uK-QivL3bKWw BcYtpGdbTtazQCp7ez7nCw UpOP24JMSJuP58TAHkvc4w K7fSX7v0Qcy4PAbGl7ZFFw Zwc1YB8SSeSrcALscMfDNQ mpmoIZY6S4Si89wdEyX9IA 3YhvLB30QJiFQXBA1vIqsA =-xm8tkdTRN6i18BkP-EF4Q YQ9K2Ka2TGic_7FZFb7pJg
  • 19. Database Architecture Requirements • Need a very large range of bucket numbers – Bucket Numbers in our implementation are -231 to +231-1 • Most buckets are empty – Empty buckets must not take any space in the database 19 © DataThinks 2013-15 19 – Empty buckets must not take any space in the database – Some buckets have a lot of documents in them, we need to be able to locate all of them • To find documents similar to a given document, – Bucketize the document, then find other documents in the same buckets
  • 20. Implementation: OpenLSH • We started OpenLSH to provide a framework for LSH • Factor out the database – Started on Google App Engine – Virtualized interface to make it work on Cassandra 20 © DataThinks 2013-15 20 – Virtualized interface to make it work on Cassandra • Factor out the calculation engine – Started on Google App Engine – Can plug in Google MapReduce – Ported to run in Batch mode on Cassandra
  • 21. Using OpenLSH • We’re looking for one or two interesting use cases – Application areas: • Near de-duplicaction (covered with Peerbelt’s data) • Stocks that move independent of the herd • Filtering “unique stories” from the News 21 © DataThinks 2013-15 21 • Contact us to discuss
  • 22. What you can do • For more information: http://openlsh.datathinks.org/ – Links to code and data set are included • Run on App Engine – Minimum setup required 22 © DataThinks 2013-15 22 – Minimum setup required • Adapt it to your environment and need • If you need help, send email or create a Github issue. • Send us a pull request for any improvements you make.
  • 23. Thank you • J Singh – Principal, DataThinks • Algorithms for big data • @datathinks, @singh_j • j . singh @ datathinks . org 23 © DataThinks 2013-15 23 • j . singh @ datathinks . org – Adj. Prof, Computer Science, WPI • Teresa Brooks – Senior Software Engineer @ Xero • teresa.brooks@xero.com • @VaderGirl13