SlideShare uma empresa Scribd logo
1 de 66
Copyright (c) 2014 Scale Unlimited.
1
Similarity at Scale
Fuzzy matching and recommendations
using Hadoop, Solr, and heuristics
Ken Krugler
Scale Unlimited
Copyright (c) 2014 Scale Unlimited.
The Twitter Pitch
Wide class of problems that rely on "good" similarity
Fast
Accurate
Scalable
Benefit from my mistakes
Scale Unlimited - consulting & training
Talking about solutions to real problems
2
Copyright (c) 2014 Scale Unlimited.
What are similarity problems?
Clustering
Grouping similar advertisers
Deduplication
Joining noisy sets of POI data
Recommendations
Suggesting pages to users
Entity resolution
Fuzzy matching of people and companies
3
Copyright (c) 2014 Scale Unlimited.
What is "Similarity"?
Exact matching is easy(er)
Accuracy is a given
Fast and scalable can still be hard
Lots of key/value systems like Cassandra, HBase, etc.
Fuzzy matching is harder
Two "things" aren't exactly the same
Similarity is based on comparing features
4
Copyright (c) 2014 Scale Unlimited.
Between two articles?
Features could be a bag of words
Are these two articles the same?
5
Bosnia is the largest geographic
region of the modern state with a
moderate continental climate,
marked by hot summers and cold,
snowy winters.
The inland is a geographically
larger region and has a moderate
continental climate, bookended by
hot summers and cold and snowy
winters.
Copyright (c) 2014 Scale Unlimited.
What about now?
Easy to create challenging situations for a person
Which is an impossible problem for a computer
Need to distinguish between "conceptually similar" and "derived from"
6
Bosnia is the largest geographic
region of the modern state with a
moderate continental climate,
marked by hot summers and cold,
snowy winters.
Bosnia has a warm European
climate, though the summers can
be hot and the winters are often
cold and wet.
Copyright (c) 2014 Scale Unlimited.
Between two records?
Features could be field values
Are these two people the same?
7
Name
Address
City
State
Zip
Bob Bogus Robert Bogus
220 3rd Avenue 220 3rd Avenue
Seattle Seattle
WA WA
98104-2608 98104
Copyright (c) 2014 Scale Unlimited.
What about now?
Need to get rid of false differences caused by abbreviations
How does a computer know what's a "significant" difference?
8
Name
Address
City
State
Zip
Bob Bogus Robert H. Bogus
Apt 102, 3220 3rd Ave 220 3rd Avenue South
Seattle Seattle
Washington WA
98104
Copyright (c) 2014 Scale Unlimited.
Between two users?
Features could be...
Items a user has bought
Are these two users the same?
9
User 1 User 2
Copyright (c) 2014 Scale Unlimited.
What about now?
Need more generic features
E.g. product categories
10
User 1 User 2
Copyright (c) 2014 Scale Unlimited.
How to measure similarity?
Assuming you have some features for two "things"
How does a program determine their degree of similarity?
You want a number that represents their "closeness"
Typically 1.0 means exactly the same
And 0.0 means completely different
11
Copyright (c) 2014 Scale Unlimited.
Jaccard Coefficient
Ratio of number of items in common / total number of items
Where "items" typical means unique values (sets of things)
So 1.0 is exactly the same, and 0.0 is completely different
12
Jaccard(A, B) =
A!B
A"B
Copyright (c) 2014 Scale Unlimited.
Cosine Similarity
Assume a document only has three unique words
cat, dog, goldfish
Set x = frequency of cat
Set y = frequency of dog
Set z = frequency of goldfish
The result is a "term vector" with 3 dimensions
Calculate cosine of angle between term vectors
This is their "cosine similarity"
13
Copyright (c) 2014 Scale Unlimited.
Why is scalability hard?
Assume you have 8.5 million businesses in the US
There are ≈ N^2/2 pairs to evaluate
That's 36 trillion comparisons
Sometimes you can quickly trim this problem
E.g. if you assume the ZIP code exists, and must match
Then this becomes about 4 billion comparisons
But often you don't have a "magic" field
14
Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
DataStax Web
Site Page
Recommender
15
Copyright (c) 2014 Scale Unlimited.
How to recommend pages?
Besides manually adding a bunch of links...
Which is tedious, doesn't scale well, and gets busy
16
Copyright (c) 2014 Scale Unlimited.
Can we exploit other users?
Classic shopping cart analysis
"Users who bought X also bought Y"
Based on actual activity, versus (noisy, skewed) ratings
17
Copyright (c) 2014 Scale Unlimited.
What's the general approach?
We have web logs with IP addresses, time, path to page
157.55.33.39 - - [18/Mar/2014:00:01:00 -0500]
"GET /solutions/nosql HTTP/1.1"
A browsing session is a series of requests from one IP address
With some maximum time gap between requests
Find sessions "similar to" the current user's session
Recommend pages from these similar sessions
18
Copyright (c) 2014 Scale Unlimited.
How to find similar sessions?
Create a Lucene search index with one document per session
Each indexed document contains the page paths for one session
session-1 /path/to/page1, /path/to/page2, /path/to/page3
session-2 /path/to/pageX, /path/to/pageY
Search for paths from the current user's session
19
Copyright (c) 2014 Scale Unlimited.
Why is this a search issue?
Solr (search in general) is all about similarity
Find documents similar to the words in my query
Cosine similarity is used to calculate similarity
Between the term vector for my query
and the term vector of each document
20
Copyright (c) 2014 Scale Unlimited.
What's the algorithm?
Find sessions similar to the target (current user's) session
Calculate similarity between these sessions and the target session
Aggregate similarity scores for all paths from these sessions
Remove paths that are already in the target session
Recommend the highest scoring path(s)
21
Copyright (c) 2014 Scale Unlimited.
Why do you sum similarities?
Give more weight to pages from sessions that are more similar
Pages from more similar sessions are assumed to be more interesting
22
F
D
B
C
A
Jaccard = 0.2
(1 / 5)
Session 2 vs Target Session
E
D B
C
A
Jaccard = 0.4
(2 / 5)
Session 1 vs Target Session
D
E
F
0.6 (0.4 + 0.2)
0.4
0.2
Page Score
Copyright (c) 2014 Scale Unlimited.
What are some problems?
The classic problem is that we recommend "common" pages
23
Copyright (c) 2014 Scale Unlimited.
What are some problems?
The classic problem is that we recommend "common" pages
E.g. if you haven't viewed the top-level page in your session
23
Copyright (c) 2014 Scale Unlimited.
What are some problems?
The classic problem is that we recommend "common" pages
E.g. if you haven't viewed the top-level page in your session
But this page is very common in most of the other sessions
23
Copyright (c) 2014 Scale Unlimited.
What are some problems?
The classic problem is that we recommend "common" pages
E.g. if you haven't viewed the top-level page in your session
But this page is very common in most of the other sessions
So then it becomes one of the top recommended page
23
Copyright (c) 2014 Scale Unlimited.
What are some problems?
The classic problem is that we recommend "common" pages
E.g. if you haven't viewed the top-level page in your session
But this page is very common in most of the other sessions
So then it becomes one of the top recommended page
But that generally stinks as a recommendation
23
Copyright (c) 2014 Scale Unlimited.
Can RowSimilarityJob help?
24
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Copyright (c) 2014 Scale Unlimited.
Can RowSimilarityJob help?
Part of the Mahout open source project
24
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Copyright (c) 2014 Scale Unlimited.
Can RowSimilarityJob help?
Part of the Mahout open source project
Takes as input a table of users (one per row) with lists of items
24
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Copyright (c) 2014 Scale Unlimited.
Can RowSimilarityJob help?
Part of the Mahout open source project
Takes as input a table of users (one per row) with lists of items
Generates an item-item co-occurrence matrix
24
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Copyright (c) 2014 Scale Unlimited.
Can RowSimilarityJob help?
Part of the Mahout open source project
Takes as input a table of users (one per row) with lists of items
Generates an item-item co-occurrence matrix
Values are weights calculated using log-likelihood ratio (LLR)
24
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Copyright (c) 2014 Scale Unlimited.
Can RowSimilarityJob help?
Part of the Mahout open source project
Takes as input a table of users (one per row) with lists of items
Generates an item-item co-occurrence matrix
Values are weights calculated using log-likelihood ratio (LLR)
Unsurprising (common) items get low weights
24
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Copyright (c) 2014 Scale Unlimited.
Can RowSimilarityJob help?
Part of the Mahout open source project
Takes as input a table of users (one per row) with lists of items
Generates an item-item co-occurrence matrix
Values are weights calculated using log-likelihood ratio (LLR)
Unsurprising (common) items get low weights
If we run it on our data, where users = sessions and items = pages
24
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Copyright (c) 2014 Scale Unlimited.
Can RowSimilarityJob help?
Part of the Mahout open source project
Takes as input a table of users (one per row) with lists of items
Generates an item-item co-occurrence matrix
Values are weights calculated using log-likelihood ratio (LLR)
Unsurprising (common) items get low weights
If we run it on our data, where users = sessions and items = pages
We get page-page co-occurrence matrix
24
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Copyright (c) 2014 Scale Unlimited.
How to use co-occurrence?
25
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Related Pages
Page 1
Page 2
Page 3
Page 2
Page 1, Page 3
Page 2
Copyright (c) 2014 Scale Unlimited.
How to use co-occurrence?
Convert the matrix into an index
25
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Related Pages
Page 1
Page 2
Page 3
Page 2
Page 1, Page 3
Page 2
Copyright (c) 2014 Scale Unlimited.
How to use co-occurrence?
Convert the matrix into an index
Each row is one Lucene document
25
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Related Pages
Page 1
Page 2
Page 3
Page 2
Page 1, Page 3
Page 2
Copyright (c) 2014 Scale Unlimited.
How to use co-occurrence?
Convert the matrix into an index
Each row is one Lucene document
Drop any low-scoring entries
25
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Related Pages
Page 1
Page 2
Page 3
Page 2
Page 1, Page 3
Page 2
Copyright (c) 2014 Scale Unlimited.
How to use co-occurrence?
Convert the matrix into an index
Each row is one Lucene document
Drop any low-scoring entries
Create list of "related" pages
25
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Related Pages
Page 1
Page 2
Page 3
Page 2
Page 1, Page 3
Page 2
Copyright (c) 2014 Scale Unlimited.
How to use co-occurrence?
Convert the matrix into an index
Each row is one Lucene document
Drop any low-scoring entries
Create list of "related" pages
25
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Related Pages
Page 1
Page 2
Page 3
Page 2
Page 1, Page 3
Page 2
Copyright (c) 2014 Scale Unlimited.
How to use co-occurrence?
Convert the matrix into an index
Each row is one Lucene document
Drop any low-scoring entries
Create list of "related" pages
Search in Related Pages field
25
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Related Pages
Page 1
Page 2
Page 3
Page 2
Page 1, Page 3
Page 2
Copyright (c) 2014 Scale Unlimited.
How to use co-occurrence?
Convert the matrix into an index
Each row is one Lucene document
Drop any low-scoring entries
Create list of "related" pages
Search in Related Pages field
Using pages from current session
25
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Related Pages
Page 1
Page 2
Page 3
Page 2
Page 1, Page 3
Page 2
Copyright (c) 2014 Scale Unlimited.
How to use co-occurrence?
Convert the matrix into an index
Each row is one Lucene document
Drop any low-scoring entries
Create list of "related" pages
Search in Related Pages field
Using pages from current session
So Page 2 recommends Page 1 & 3
25
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Related Pages
Page 1
Page 2
Page 3
Page 2
Page 1, Page 3
Page 2
Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
EWS
Entity
Resolution
26
Copyright (c) 2014 Scale Unlimited.
What is Early Warning?
Early Warning helps banks fight fraud
It's owned by the top 5 US banks
And gets data from 800+ financial institutions
So they have details on most US bank accounts
When somebody signs up for an account
They need to quickly match the person to "known entities"
And derive a risk score based on related account details
27
Copyright (c) 2014 Scale Unlimited.
Why do they need similarity?
Assume you have information on 100s of millions of entities
Name(s), address(es), phone number(s), etc.
And often a unique ID (Social Security Number, EIN, etc)
Why is this a similarity problem?
Data is noisy - typos, abbreviations, partial data
People lie - much fraud starts with opening an account using bad data
28
Copyright (c) 2014 Scale Unlimited.
How does search help?
We can quickly build a list of candidate entities, using search
Query contains field data provided by the client bank
Significantly less than 1 second for 30 candidate entities
Then do more precise, sophisticated and CPU-intensive scoring
The end result is a ranked list of entities with similarity scores
Which then is used to look up account status, fraud cases, etc.
29
Copyright (c) 2014 Scale Unlimited.
What's the data pipeline?
Incoming data is cleaned up/normalized in Hadoop
Simple things like space stripping
Also phone number formatting
ZIP+4 expansion into just ZIP plus full
Other normalization happens inside of Solr
This gets loaded into Cassandra tables
And automatically indexed by Solr, via DataStax Enterprise
30
ZIP+4 Terms
95014-2127 95014, 2127
Phone Terms
4805551212 480, 5551212
Copyright (c) 2014 Scale Unlimited.
What's the Solr setup?
Each field in the index has very specific analysis
Simple things like normalization
Synonym expansion for names, abbreviations
Split up fields so partial matches work
At query time we can weight the importance of each field
Which helps order the top N candidates similar to their real match scores
E.g. an SSN matching means much more than a first name matching
31
Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Batch Similarity
32
Copyright (c) 2014 Scale Unlimited.
Can we do batch similarity?
Search works well for real-time similarity
But batch processing at scale maxes out the search system
We can use two different techniques with Hadoop for batch
SimHash - good for text document similarity
Parallel Set-Similarity Joins - good for record similarity
33
Copyright (c) 2014 Scale Unlimited.
What is SimHash?
Assume a document is a set of (unique) words
Calculate a hash for each word
Probability that the minimum hash is the same for two documents...
...is magically equal to the Jaccard Coefficient
34
Term Hash
bosnia
is
the
largest
geographic
78954874223
53466156768
5064199193
3193621783
-5718349925
Copyright (c) 2014 Scale Unlimited.
What is a SimHash workflow?
Calculate N hash values
Easy way is to use the N smallest hash values
Calculate number of matching hash values between doc pairs (M)
Then the Jaccard Coefficient is ≈ M/N
Only works if N is much smaller than # of unique words in docs
Implementation of this in cascading.utils open source project
https://github.com/ScaleUnlimited/cascading.utils
35
Copyright (c) 2014 Scale Unlimited.
What is Set-Similarity Join?
Joining records in two sets that are "close enough"
aka "fuzzy join"
Requires generation of "tokens" from record field(s)
Typically words from text
Simple implementation has three phases
First calculate counts for each unique token value
Then output <token, record> for N most common tokens of each record
Group by token, compare records in each group
36
Copyright (c) 2014 Scale Unlimited.
How does fuzzy join work?
For two records to be "similar enough"...
They need to share one of their common tokens
Generalization of the ZIP code "magic field" approach
Basic implementation has a number of issues
Passing around copies of full record is inefficient
Too-common tokens create huge groups for comparison
Two records compared multiple times
37
Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Summary
38
Copyright (c) 2014 Scale Unlimited.
The Net-Net
39
Copyright (c) 2014 Scale Unlimited.
The Net-Net
Similarity is a common requirement for many applications
39
Copyright (c) 2014 Scale Unlimited.
The Net-Net
Similarity is a common requirement for many applications
Recommendations
39
Copyright (c) 2014 Scale Unlimited.
The Net-Net
Similarity is a common requirement for many applications
Recommendations
Entity matching
39
Copyright (c) 2014 Scale Unlimited.
The Net-Net
Similarity is a common requirement for many applications
Recommendations
Entity matching
Combining Hadoop with search is a powerful combination
39
Copyright (c) 2014 Scale Unlimited.
The Net-Net
Similarity is a common requirement for many applications
Recommendations
Entity matching
Combining Hadoop with search is a powerful combination
Scalability
39
Copyright (c) 2014 Scale Unlimited.
The Net-Net
Similarity is a common requirement for many applications
Recommendations
Entity matching
Combining Hadoop with search is a powerful combination
Scalability
Performance
39
Copyright (c) 2014 Scale Unlimited.
The Net-Net
Similarity is a common requirement for many applications
Recommendations
Entity matching
Combining Hadoop with search is a powerful combination
Scalability
Performance
Flexibility
39
Copyright (c) 2014 Scale Unlimited.
Questions?
Feel free to contact me
http://www.scaleunlimited.com/contact/
Take a look at Pat Ferrel's Hadoop + Solr recommender
http://github.com/pferrel/solr-recommender
Check out Mahout
http://mahout.apache.org
Read paper & code for fuzzyjoin project
http://asterix.ics.uci.edu/fuzzyjoin/
40

Mais conteúdo relacionado

Mais procurados

DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsDC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsChris Fregly
 
Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Chris Fregly
 
Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Chris Fregly
 
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016 Chris Fregly
 
Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015Chris Fregly
 
Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Chris Fregly
 
Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015Chris Fregly
 
Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015Chris Fregly
 
Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Chris Fregly
 
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Chris Fregly
 
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...Chris Fregly
 
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Chris Fregly
 
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...Chris Fregly
 
Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015Chris Fregly
 
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Chris Fregly
 
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Chris Fregly
 
Augury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache MesosAugury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache MesosPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 

Mais procurados (20)

DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsDC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
 
Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015
 
Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015
 
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016
 
Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015
 
Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015
 
Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015
 
Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015
 
Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015
 
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
 
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
 
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
 
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
 
Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015
 
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
 
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
 
Augury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache MesosAugury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Tech
TechTech
Tech
 
Sparksee overview
Sparksee overviewSparksee overview
Sparksee overview
 

Semelhante a Similarity at Scale

Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scaleKen Krugler
 
Respond to Discussion minimum 150 WordsThroughout my life I have.docx
Respond to Discussion minimum 150 WordsThroughout my life I have.docxRespond to Discussion minimum 150 WordsThroughout my life I have.docx
Respond to Discussion minimum 150 WordsThroughout my life I have.docxronak56
 
Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016
Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016
Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016Calin Constantinov
 
Critical Friends Brief
Critical Friends BriefCritical Friends Brief
Critical Friends BriefNoel Hatch
 
Being data driven - our data journey
Being data driven - our data journeyBeing data driven - our data journey
Being data driven - our data journeyIstván Rechner
 
How to Build an Attribution Solution in 1 Day
How to Build an Attribution Solution in 1 DayHow to Build an Attribution Solution in 1 Day
How to Build an Attribution Solution in 1 DayPhillip Law
 
How to Build an Attribution Solution in 1 Day
How to Build an Attribution Solution in 1 DayHow to Build an Attribution Solution in 1 Day
How to Build an Attribution Solution in 1 DayPhillip Law
 
Big Data Science - hype?
Big Data Science - hype?Big Data Science - hype?
Big Data Science - hype?BalaBit
 
Giving Good Whiteboard
Giving Good WhiteboardGiving Good Whiteboard
Giving Good WhiteboardBill Branson
 
Recommendation engines : Matching items to users
Recommendation engines : Matching items to usersRecommendation engines : Matching items to users
Recommendation engines : Matching items to usersjobinwilson
 
Recommendation engines matching items to users
Recommendation engines matching items to usersRecommendation engines matching items to users
Recommendation engines matching items to usersFlytxt
 
mm-ADT: A Virtual Machine/An Economic Machine
mm-ADT: A Virtual Machine/An Economic Machinemm-ADT: A Virtual Machine/An Economic Machine
mm-ADT: A Virtual Machine/An Economic MachineMarko Rodriguez
 
Datapedia Analysis Report
Datapedia Analysis ReportDatapedia Analysis Report
Datapedia Analysis ReportAbanoub Amgad
 
Search Analytics for Fun and Profit
Search Analytics for Fun and ProfitSearch Analytics for Fun and Profit
Search Analytics for Fun and ProfitLouis Rosenfeld
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentationrenjan131
 
Open source vs. open data
Open source vs. open dataOpen source vs. open data
Open source vs. open datadata publica
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and boltsNBER
 
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICSUSING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICSHCL Technologies
 

Semelhante a Similarity at Scale (20)

Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scale
 
AWS re:Invent Hackathon
AWS re:Invent HackathonAWS re:Invent Hackathon
AWS re:Invent Hackathon
 
Respond to Discussion minimum 150 WordsThroughout my life I have.docx
Respond to Discussion minimum 150 WordsThroughout my life I have.docxRespond to Discussion minimum 150 WordsThroughout my life I have.docx
Respond to Discussion minimum 150 WordsThroughout my life I have.docx
 
Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016
Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016
Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016
 
Critical Friends Brief
Critical Friends BriefCritical Friends Brief
Critical Friends Brief
 
Being data driven - our data journey
Being data driven - our data journeyBeing data driven - our data journey
Being data driven - our data journey
 
How to Build an Attribution Solution in 1 Day
How to Build an Attribution Solution in 1 DayHow to Build an Attribution Solution in 1 Day
How to Build an Attribution Solution in 1 Day
 
How to Build an Attribution Solution in 1 Day
How to Build an Attribution Solution in 1 DayHow to Build an Attribution Solution in 1 Day
How to Build an Attribution Solution in 1 Day
 
Big Data Science - hype?
Big Data Science - hype?Big Data Science - hype?
Big Data Science - hype?
 
Giving Good Whiteboard
Giving Good WhiteboardGiving Good Whiteboard
Giving Good Whiteboard
 
Recommendation engines : Matching items to users
Recommendation engines : Matching items to usersRecommendation engines : Matching items to users
Recommendation engines : Matching items to users
 
Recommendation engines matching items to users
Recommendation engines matching items to usersRecommendation engines matching items to users
Recommendation engines matching items to users
 
BD-ACA week7a
BD-ACA week7aBD-ACA week7a
BD-ACA week7a
 
mm-ADT: A Virtual Machine/An Economic Machine
mm-ADT: A Virtual Machine/An Economic Machinemm-ADT: A Virtual Machine/An Economic Machine
mm-ADT: A Virtual Machine/An Economic Machine
 
Datapedia Analysis Report
Datapedia Analysis ReportDatapedia Analysis Report
Datapedia Analysis Report
 
Search Analytics for Fun and Profit
Search Analytics for Fun and ProfitSearch Analytics for Fun and Profit
Search Analytics for Fun and Profit
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
 
Open source vs. open data
Open source vs. open dataOpen source vs. open data
Open source vs. open data
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and bolts
 
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICSUSING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Último (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

Similarity at Scale

  • 1. Copyright (c) 2014 Scale Unlimited. 1 Similarity at Scale Fuzzy matching and recommendations using Hadoop, Solr, and heuristics Ken Krugler Scale Unlimited
  • 2. Copyright (c) 2014 Scale Unlimited. The Twitter Pitch Wide class of problems that rely on "good" similarity Fast Accurate Scalable Benefit from my mistakes Scale Unlimited - consulting & training Talking about solutions to real problems 2
  • 3. Copyright (c) 2014 Scale Unlimited. What are similarity problems? Clustering Grouping similar advertisers Deduplication Joining noisy sets of POI data Recommendations Suggesting pages to users Entity resolution Fuzzy matching of people and companies 3
  • 4. Copyright (c) 2014 Scale Unlimited. What is "Similarity"? Exact matching is easy(er) Accuracy is a given Fast and scalable can still be hard Lots of key/value systems like Cassandra, HBase, etc. Fuzzy matching is harder Two "things" aren't exactly the same Similarity is based on comparing features 4
  • 5. Copyright (c) 2014 Scale Unlimited. Between two articles? Features could be a bag of words Are these two articles the same? 5 Bosnia is the largest geographic region of the modern state with a moderate continental climate, marked by hot summers and cold, snowy winters. The inland is a geographically larger region and has a moderate continental climate, bookended by hot summers and cold and snowy winters.
  • 6. Copyright (c) 2014 Scale Unlimited. What about now? Easy to create challenging situations for a person Which is an impossible problem for a computer Need to distinguish between "conceptually similar" and "derived from" 6 Bosnia is the largest geographic region of the modern state with a moderate continental climate, marked by hot summers and cold, snowy winters. Bosnia has a warm European climate, though the summers can be hot and the winters are often cold and wet.
  • 7. Copyright (c) 2014 Scale Unlimited. Between two records? Features could be field values Are these two people the same? 7 Name Address City State Zip Bob Bogus Robert Bogus 220 3rd Avenue 220 3rd Avenue Seattle Seattle WA WA 98104-2608 98104
  • 8. Copyright (c) 2014 Scale Unlimited. What about now? Need to get rid of false differences caused by abbreviations How does a computer know what's a "significant" difference? 8 Name Address City State Zip Bob Bogus Robert H. Bogus Apt 102, 3220 3rd Ave 220 3rd Avenue South Seattle Seattle Washington WA 98104
  • 9. Copyright (c) 2014 Scale Unlimited. Between two users? Features could be... Items a user has bought Are these two users the same? 9 User 1 User 2
  • 10. Copyright (c) 2014 Scale Unlimited. What about now? Need more generic features E.g. product categories 10 User 1 User 2
  • 11. Copyright (c) 2014 Scale Unlimited. How to measure similarity? Assuming you have some features for two "things" How does a program determine their degree of similarity? You want a number that represents their "closeness" Typically 1.0 means exactly the same And 0.0 means completely different 11
  • 12. Copyright (c) 2014 Scale Unlimited. Jaccard Coefficient Ratio of number of items in common / total number of items Where "items" typical means unique values (sets of things) So 1.0 is exactly the same, and 0.0 is completely different 12 Jaccard(A, B) = A!B A"B
  • 13. Copyright (c) 2014 Scale Unlimited. Cosine Similarity Assume a document only has three unique words cat, dog, goldfish Set x = frequency of cat Set y = frequency of dog Set z = frequency of goldfish The result is a "term vector" with 3 dimensions Calculate cosine of angle between term vectors This is their "cosine similarity" 13
  • 14. Copyright (c) 2014 Scale Unlimited. Why is scalability hard? Assume you have 8.5 million businesses in the US There are ≈ N^2/2 pairs to evaluate That's 36 trillion comparisons Sometimes you can quickly trim this problem E.g. if you assume the ZIP code exists, and must match Then this becomes about 4 billion comparisons But often you don't have a "magic" field 14
  • 15. Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. DataStax Web Site Page Recommender 15
  • 16. Copyright (c) 2014 Scale Unlimited. How to recommend pages? Besides manually adding a bunch of links... Which is tedious, doesn't scale well, and gets busy 16
  • 17. Copyright (c) 2014 Scale Unlimited. Can we exploit other users? Classic shopping cart analysis "Users who bought X also bought Y" Based on actual activity, versus (noisy, skewed) ratings 17
  • 18. Copyright (c) 2014 Scale Unlimited. What's the general approach? We have web logs with IP addresses, time, path to page 157.55.33.39 - - [18/Mar/2014:00:01:00 -0500] "GET /solutions/nosql HTTP/1.1" A browsing session is a series of requests from one IP address With some maximum time gap between requests Find sessions "similar to" the current user's session Recommend pages from these similar sessions 18
  • 19. Copyright (c) 2014 Scale Unlimited. How to find similar sessions? Create a Lucene search index with one document per session Each indexed document contains the page paths for one session session-1 /path/to/page1, /path/to/page2, /path/to/page3 session-2 /path/to/pageX, /path/to/pageY Search for paths from the current user's session 19
  • 20. Copyright (c) 2014 Scale Unlimited. Why is this a search issue? Solr (search in general) is all about similarity Find documents similar to the words in my query Cosine similarity is used to calculate similarity Between the term vector for my query and the term vector of each document 20
  • 21. Copyright (c) 2014 Scale Unlimited. What's the algorithm? Find sessions similar to the target (current user's) session Calculate similarity between these sessions and the target session Aggregate similarity scores for all paths from these sessions Remove paths that are already in the target session Recommend the highest scoring path(s) 21
  • 22. Copyright (c) 2014 Scale Unlimited. Why do you sum similarities? Give more weight to pages from sessions that are more similar Pages from more similar sessions are assumed to be more interesting 22 F D B C A Jaccard = 0.2 (1 / 5) Session 2 vs Target Session E D B C A Jaccard = 0.4 (2 / 5) Session 1 vs Target Session D E F 0.6 (0.4 + 0.2) 0.4 0.2 Page Score
  • 23. Copyright (c) 2014 Scale Unlimited. What are some problems? The classic problem is that we recommend "common" pages 23
  • 24. Copyright (c) 2014 Scale Unlimited. What are some problems? The classic problem is that we recommend "common" pages E.g. if you haven't viewed the top-level page in your session 23
  • 25. Copyright (c) 2014 Scale Unlimited. What are some problems? The classic problem is that we recommend "common" pages E.g. if you haven't viewed the top-level page in your session But this page is very common in most of the other sessions 23
  • 26. Copyright (c) 2014 Scale Unlimited. What are some problems? The classic problem is that we recommend "common" pages E.g. if you haven't viewed the top-level page in your session But this page is very common in most of the other sessions So then it becomes one of the top recommended page 23
  • 27. Copyright (c) 2014 Scale Unlimited. What are some problems? The classic problem is that we recommend "common" pages E.g. if you haven't viewed the top-level page in your session But this page is very common in most of the other sessions So then it becomes one of the top recommended page But that generally stinks as a recommendation 23
  • 28. Copyright (c) 2014 Scale Unlimited. Can RowSimilarityJob help? 24 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5
  • 29. Copyright (c) 2014 Scale Unlimited. Can RowSimilarityJob help? Part of the Mahout open source project 24 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5
  • 30. Copyright (c) 2014 Scale Unlimited. Can RowSimilarityJob help? Part of the Mahout open source project Takes as input a table of users (one per row) with lists of items 24 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5
  • 31. Copyright (c) 2014 Scale Unlimited. Can RowSimilarityJob help? Part of the Mahout open source project Takes as input a table of users (one per row) with lists of items Generates an item-item co-occurrence matrix 24 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5
  • 32. Copyright (c) 2014 Scale Unlimited. Can RowSimilarityJob help? Part of the Mahout open source project Takes as input a table of users (one per row) with lists of items Generates an item-item co-occurrence matrix Values are weights calculated using log-likelihood ratio (LLR) 24 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5
  • 33. Copyright (c) 2014 Scale Unlimited. Can RowSimilarityJob help? Part of the Mahout open source project Takes as input a table of users (one per row) with lists of items Generates an item-item co-occurrence matrix Values are weights calculated using log-likelihood ratio (LLR) Unsurprising (common) items get low weights 24 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5
  • 34. Copyright (c) 2014 Scale Unlimited. Can RowSimilarityJob help? Part of the Mahout open source project Takes as input a table of users (one per row) with lists of items Generates an item-item co-occurrence matrix Values are weights calculated using log-likelihood ratio (LLR) Unsurprising (common) items get low weights If we run it on our data, where users = sessions and items = pages 24 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5
  • 35. Copyright (c) 2014 Scale Unlimited. Can RowSimilarityJob help? Part of the Mahout open source project Takes as input a table of users (one per row) with lists of items Generates an item-item co-occurrence matrix Values are weights calculated using log-likelihood ratio (LLR) Unsurprising (common) items get low weights If we run it on our data, where users = sessions and items = pages We get page-page co-occurrence matrix 24 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5
  • 36. Copyright (c) 2014 Scale Unlimited. How to use co-occurrence? 25 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5 Related Pages Page 1 Page 2 Page 3 Page 2 Page 1, Page 3 Page 2
  • 37. Copyright (c) 2014 Scale Unlimited. How to use co-occurrence? Convert the matrix into an index 25 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5 Related Pages Page 1 Page 2 Page 3 Page 2 Page 1, Page 3 Page 2
  • 38. Copyright (c) 2014 Scale Unlimited. How to use co-occurrence? Convert the matrix into an index Each row is one Lucene document 25 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5 Related Pages Page 1 Page 2 Page 3 Page 2 Page 1, Page 3 Page 2
  • 39. Copyright (c) 2014 Scale Unlimited. How to use co-occurrence? Convert the matrix into an index Each row is one Lucene document Drop any low-scoring entries 25 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5 Related Pages Page 1 Page 2 Page 3 Page 2 Page 1, Page 3 Page 2
  • 40. Copyright (c) 2014 Scale Unlimited. How to use co-occurrence? Convert the matrix into an index Each row is one Lucene document Drop any low-scoring entries Create list of "related" pages 25 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5 Related Pages Page 1 Page 2 Page 3 Page 2 Page 1, Page 3 Page 2
  • 41. Copyright (c) 2014 Scale Unlimited. How to use co-occurrence? Convert the matrix into an index Each row is one Lucene document Drop any low-scoring entries Create list of "related" pages 25 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5 Related Pages Page 1 Page 2 Page 3 Page 2 Page 1, Page 3 Page 2
  • 42. Copyright (c) 2014 Scale Unlimited. How to use co-occurrence? Convert the matrix into an index Each row is one Lucene document Drop any low-scoring entries Create list of "related" pages Search in Related Pages field 25 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5 Related Pages Page 1 Page 2 Page 3 Page 2 Page 1, Page 3 Page 2
  • 43. Copyright (c) 2014 Scale Unlimited. How to use co-occurrence? Convert the matrix into an index Each row is one Lucene document Drop any low-scoring entries Create list of "related" pages Search in Related Pages field Using pages from current session 25 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5 Related Pages Page 1 Page 2 Page 3 Page 2 Page 1, Page 3 Page 2
  • 44. Copyright (c) 2014 Scale Unlimited. How to use co-occurrence? Convert the matrix into an index Each row is one Lucene document Drop any low-scoring entries Create list of "related" pages Search in Related Pages field Using pages from current session So Page 2 recommends Page 1 & 3 25 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5 Related Pages Page 1 Page 2 Page 3 Page 2 Page 1, Page 3 Page 2
  • 45. Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. EWS Entity Resolution 26
  • 46. Copyright (c) 2014 Scale Unlimited. What is Early Warning? Early Warning helps banks fight fraud It's owned by the top 5 US banks And gets data from 800+ financial institutions So they have details on most US bank accounts When somebody signs up for an account They need to quickly match the person to "known entities" And derive a risk score based on related account details 27
  • 47. Copyright (c) 2014 Scale Unlimited. Why do they need similarity? Assume you have information on 100s of millions of entities Name(s), address(es), phone number(s), etc. And often a unique ID (Social Security Number, EIN, etc) Why is this a similarity problem? Data is noisy - typos, abbreviations, partial data People lie - much fraud starts with opening an account using bad data 28
  • 48. Copyright (c) 2014 Scale Unlimited. How does search help? We can quickly build a list of candidate entities, using search Query contains field data provided by the client bank Significantly less than 1 second for 30 candidate entities Then do more precise, sophisticated and CPU-intensive scoring The end result is a ranked list of entities with similarity scores Which then is used to look up account status, fraud cases, etc. 29
  • 49. Copyright (c) 2014 Scale Unlimited. What's the data pipeline? Incoming data is cleaned up/normalized in Hadoop Simple things like space stripping Also phone number formatting ZIP+4 expansion into just ZIP plus full Other normalization happens inside of Solr This gets loaded into Cassandra tables And automatically indexed by Solr, via DataStax Enterprise 30 ZIP+4 Terms 95014-2127 95014, 2127 Phone Terms 4805551212 480, 5551212
  • 50. Copyright (c) 2014 Scale Unlimited. What's the Solr setup? Each field in the index has very specific analysis Simple things like normalization Synonym expansion for names, abbreviations Split up fields so partial matches work At query time we can weight the importance of each field Which helps order the top N candidates similar to their real match scores E.g. an SSN matching means much more than a first name matching 31
  • 51. Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Batch Similarity 32
  • 52. Copyright (c) 2014 Scale Unlimited. Can we do batch similarity? Search works well for real-time similarity But batch processing at scale maxes out the search system We can use two different techniques with Hadoop for batch SimHash - good for text document similarity Parallel Set-Similarity Joins - good for record similarity 33
  • 53. Copyright (c) 2014 Scale Unlimited. What is SimHash? Assume a document is a set of (unique) words Calculate a hash for each word Probability that the minimum hash is the same for two documents... ...is magically equal to the Jaccard Coefficient 34 Term Hash bosnia is the largest geographic 78954874223 53466156768 5064199193 3193621783 -5718349925
  • 54. Copyright (c) 2014 Scale Unlimited. What is a SimHash workflow? Calculate N hash values Easy way is to use the N smallest hash values Calculate number of matching hash values between doc pairs (M) Then the Jaccard Coefficient is ≈ M/N Only works if N is much smaller than # of unique words in docs Implementation of this in cascading.utils open source project https://github.com/ScaleUnlimited/cascading.utils 35
  • 55. Copyright (c) 2014 Scale Unlimited. What is Set-Similarity Join? Joining records in two sets that are "close enough" aka "fuzzy join" Requires generation of "tokens" from record field(s) Typically words from text Simple implementation has three phases First calculate counts for each unique token value Then output <token, record> for N most common tokens of each record Group by token, compare records in each group 36
  • 56. Copyright (c) 2014 Scale Unlimited. How does fuzzy join work? For two records to be "similar enough"... They need to share one of their common tokens Generalization of the ZIP code "magic field" approach Basic implementation has a number of issues Passing around copies of full record is inefficient Too-common tokens create huge groups for comparison Two records compared multiple times 37
  • 57. Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Summary 38
  • 58. Copyright (c) 2014 Scale Unlimited. The Net-Net 39
  • 59. Copyright (c) 2014 Scale Unlimited. The Net-Net Similarity is a common requirement for many applications 39
  • 60. Copyright (c) 2014 Scale Unlimited. The Net-Net Similarity is a common requirement for many applications Recommendations 39
  • 61. Copyright (c) 2014 Scale Unlimited. The Net-Net Similarity is a common requirement for many applications Recommendations Entity matching 39
  • 62. Copyright (c) 2014 Scale Unlimited. The Net-Net Similarity is a common requirement for many applications Recommendations Entity matching Combining Hadoop with search is a powerful combination 39
  • 63. Copyright (c) 2014 Scale Unlimited. The Net-Net Similarity is a common requirement for many applications Recommendations Entity matching Combining Hadoop with search is a powerful combination Scalability 39
  • 64. Copyright (c) 2014 Scale Unlimited. The Net-Net Similarity is a common requirement for many applications Recommendations Entity matching Combining Hadoop with search is a powerful combination Scalability Performance 39
  • 65. Copyright (c) 2014 Scale Unlimited. The Net-Net Similarity is a common requirement for many applications Recommendations Entity matching Combining Hadoop with search is a powerful combination Scalability Performance Flexibility 39
  • 66. Copyright (c) 2014 Scale Unlimited. Questions? Feel free to contact me http://www.scaleunlimited.com/contact/ Take a look at Pat Ferrel's Hadoop + Solr recommender http://github.com/pferrel/solr-recommender Check out Mahout http://mahout.apache.org Read paper & code for fuzzyjoin project http://asterix.ics.uci.edu/fuzzyjoin/ 40