SlideShare uma empresa Scribd logo
1 de 16
Baixar para ler offline
Building a
virtual data assistant
Kavitha Srinivas

kavitha@rivetlabs.io
Poster presentation at the CAIM Meetup, NYC, Feb 15, 2018
Why do we need one?
A typical business user MANUALLY:

1. Aggregates data across different datasets 

2. Applies business logic to the data

3. Shares a filtered view of it with other co-workers

Rinse and repeat as source data changes. 

Process is tedious, error-prone.
Example today
usage organization
48 hours ABC news
24 hours CBS
20 hours NBC
organization salesperson
ABC News
Corp
John Doe
CBS news Mary Peters
NBC Studios Bob Market
Gets weekly usage of service Merges with data from CRM
organization usage terms
monthly
revenue
ABC News 48 hours 2.0 per hour
usage * terms
….
CBS 24 hours 1.0 per hour usage * terms…
Merges data, where
merge is often hard and
needs fuzzy matching
Applies a lot of
business logic with
cut/paste
Has no access to functions
such as revenue forecasting,
spatial distance analysis,
etc.
Has no knowledge on what
has been done by others
with the same data
Computes revenue
John in
Finance
Repeats the process when
source data changes
A Virtual Data assistant
Recommends higher
level functions such as
time series forecasting
based on data
characteristics
usage terms organization
48 hours 2.0 per hour ABC news
24 hours 1.0 per hour CBS
20 hours 2.0 per hour NBC
organizatio
n
salesperso
nABC News
Corp
John Doe
CBS news Mary Peters
NBC
Studios
Bob Market
Notices semantic types
of data, handle merges
appropriately
organization salesperson usage terms
monthly
revenue
ABC News John Doe 48 hours 2.0 per hour
usage * terms
….
CBS Mary Peters 24 hours 1.0 per hour usage * terms…
Suggests prior user
actions on similar data,
applies business logic to
new data
Recommends functions
based on other users’
use of similar data
Technology to make this
possible…
• Semantic classifiers to classify data to power
recommendations (e.g., identify time series data).

• Algorithms to help with those difficult merges (fuzzy
joins) across columns (e.g. Ted Williams with T.
Williams) based on their semantic type.

• Algorithms to help users define complex business logic
(e.g., synthesis of SQL expressions from examples
provided by users).

• Algorithms to compute similarity across different
datasets.
Focus of
poster
Fuzzy join algorithms
Standard approach for fuzzy joins: string matching algorithms.

Alternatively use specific rules to match entities. E.g., 

Ted is a short name for Theodore

Ted Kennedy -> Ted is a first name, Kennedy is a last name

Ted Kennedy is same as Theodore Kennedy…

Can deep learning be used to ‘learn’ the rules for specific entity
types? More scalable than having to code entity-specific rules.
Fuzzy join Approach 1
Standard approach to avoid N2 comparisons of columnar values is to perform
blocking

organization
Bob Knuth
Ruth Rutherford
Robert Bailey
organization
Ruth Knuth
Eric Satie
Marion Bailey
Robert Knuth
Bob Knuth
Ruth Knuth
Robert Knuth
Ruth Rutherford
Ruth Knuth
To join
Block on cells having at least
one word in common
Marion Bailey
Robert Bailey
Bob Knuth
Ruth Knuth
Robert Knuth
Compare items within a block
Bob Knuth = Ruth Knuth?
Build a deep learning model for comparison within a block.
Similar or not?
Estimate distance
Fuzzy Join Approach 1
Cal
Christensen
Calvin L.
Christensen
Cal Hubbard
Calvin L.
Christensen
Cal
Christensen
Cal
Hubbard
Network learns a function that maps entities into a
vector space where vectors for positive pairs are
mapped closer than vectors for negative pairs
Cal
Christensen
Use a “Siamese network” architecture to map same entities
closer in vector space, maximize distance to different entities.
Vector space
http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf

X1 X2
net1 net2
Shared weights across
net1 and net2
Positive pair
Negative pair
Results of Approach 1
dropout
dense layer
character
embedding
…
Bob Knuth
Robert
Alan Knuth
Distance estimate for positive
and negative pairs
Siamese network
dense layer
183027 positive pairs +183027 negative pairs of people’s names drawn from
DBpedia. Negative pairs always shared at least 1 name word in common with
blocking implemented in the database.
F score: .89 (~9K test pairs)
vs .70 word matching baseline
'sergei ivanov' 'sergei borisovich ivanov’ - 0.059
'sergei ivanov' 'sergei anatolevich bozhenov’ - 1.013
'joseph josiah dodd' 'j j dodd' - 0.109
'joseph josiah dodd' 'joseph' - 1.127
'sir richard stydolph 1st baronet' 'richard stydolph’ - 0.157
'sir richard stydolph 1st baronet' 'sir harry chilcott’ - 1.409
'nahum sokolow' 'nachum sokolov’ - 0.666
'nahum sokolow' 'nahum gutman’ - 0.531
'robin chan' 'rabin sophonpanich’ - 0.199
'robin chan' 'eric robin bell’ - 1.259
‘richard russell' 'richard b russell jr ‘ - 0.275
'richard russell' 'richard paul wesley cresswell’ - 2.793
'kiyoshi kawasaki' 'kawasaki kiyoshi’ - 0.29039219
'kiyoshi kawasaki' 'matsura kiyoshi’ - 0.40613374
Examples where the model discriminates well:
Examples where the model has trouble:
dropout
dense layer
character
embedding
…
dense layer
Are we done?
Clearly promising to use fuzzy join approach 1 since it is
more scalable, and more effective than rule based
approach.

But… we can do better
Fuzzy Join Approach 2
Perform a linear pass over the column data and
get vector embeddings of the last layer.
Find nearest neighbors using approximate
nearest neighbors (ANN) algorithm to find
elements to join with
Cal Hubbard
Do away with blocking entirely but still eliminate N2 comparisons
of values.
Cal Christensen
Calvin L. Christensen
Vector space
If the network computes a vector
embedding where positive pairs are
closer than negative pairs then finding
the nearest neighbors directly from
vector embeddings should specify the
elements to join with.
Build a siamese network for a certain entity type
Should be more efficient than blocking because blocking is
ineffective for common words (e.g. John).
Only join these 2
Can we test this idea from the
model we just built?
Likely not, because negative pairs generated from blocking shared at least
one word of the name (e.g. Bob Knuth - Bob Rickets). 

Running the nearest neighbor algorithm on embeddings from the last layer
confirms it:
Surprising that the model mistakenly maps seemingly dissimilar pairs
as ‘neighbors’ because character input embeddings for those should
have nothing in common. Why?
Bob Knuth
Edith ChickWilliam Cooke
Robert Alan Knuth
.038 .035
.102
Bob Ricketts
1.6
Test pairs like the ones
the model was trained on
are mapped correctly
Test pairs with no words
in common are mapped
incorrectly as ‘near
neighbors’ by the model
What does the vector space for
input embeddings look like?
Cal Christensen
Calvin L. Christensen
Vector space
character
embedding
Cal
Christensen
Cal L.
Christensen
Siamese network
character
embedding
And feed it to the nearest neighbor
algorithm
Take the character embeddings for each
element of a pair in the siamese network
Cal Hubbard
Mary Christensen
John Williams
One might expect to see names with
common words clustered together,
and those with no words in common
far apart.
Its not that simple…
query=katri helena
neighbors ='katya medvedeva', 'natali pronina'
positive = katri helena kalaoja - 5.265079498291016
negative = aino katri kurki suonio - 6.6487250328063965
katya medvedeva - 2.918647050857544
natali pronina - 2.9291510581970215
query=ricardo oscar vanni
neighbors ='pablo oscar cavallero', 'mariano salvador
maella’
positive = vanni - 5.223626613616943
negative = ricardo porro hidalgo - 3.843609094619751
pablo oscar cavallero - 3.6668620109558105
mariano salvador maella - 3.8342466354370117
query=aleksei konakh
neighbors ='aliaksei konakh', 'aleksei vanyushin'
positive = aliaksei konakh - 1.567959189414978
negative= aleksei anatolyevich yushkov - 4.771781921386719
aliaksei konakh - 1.567959189414978
aleksei vanyushin - 2.4045419692993164
Names with no
components in
common have input
embeddings that are
closer than positives
Positives have higher
distances than
negatives
Input embeddings
sometimes contain
positive as nearest
neighbor
How to build a better model?
Use ANN of character embeddings to
drive selection of triplets instead of
pairs, picking a positive and a
negative example for a given anchor
(e.g., Cal Christensen)
(https://tinyurl.com/ycrw58ap)
dropout
dense layer
dropout
dense layer
character
embedding
character
embedding
…
Cal
Christensen
Calvin L.
Christensen
…
Distance estimate
dropout
dense layer
character
embedding
Mary
Christensen
…
Mary
Christensen
Cal
Christensen
Calvin L.
Christensen
Build a model to minimize
distance to the positive element
of a triplet and maximize distance
to the negative element.
minimize
maximize
Cal Christensen
Calvin L. ChristensenCal Hubbard
Mary Christensen
John Williams
Pick a triplet for
an anchor
Summary
Merge problem can be solved provided we have enough data, at
the very least with blocking.

To be seen whether we can improve the merge to eliminate blocking
entirely.

Acknowledgements:

This work was done in conjunction with Yehuda Gale, a senior who
is conducting this work as part of a senior thesis at Yeshiva
University.

https://github.com/yehudagale/fuzzyJoiner has the source code.

Mais conteúdo relacionado

Semelhante a Poster present at the CAIM workshop NYC, Feb 15 2018

Telefonica Lunch Seminar
Telefonica Lunch SeminarTelefonica Lunch Seminar
Telefonica Lunch SeminarNeal Lathia
 
[DSC Europe 22] Latest Techniques of Entity Matching in NLP - Avinash Pathak
[DSC Europe 22] Latest Techniques of Entity Matching in NLP - Avinash Pathak[DSC Europe 22] Latest Techniques of Entity Matching in NLP - Avinash Pathak
[DSC Europe 22] Latest Techniques of Entity Matching in NLP - Avinash PathakDataScienceConferenc1
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Lukas Mandrake
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Simplicial closure and higher-order link prediction
Simplicial closure and higher-order link predictionSimplicial closure and higher-order link prediction
Simplicial closure and higher-order link predictionAustin Benson
 
Neural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep LearningNeural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep Learningcomifa7406
 
Lecture 5 - Qunatifying a Network.pdf
Lecture 5 - Qunatifying a Network.pdfLecture 5 - Qunatifying a Network.pdf
Lecture 5 - Qunatifying a Network.pdfclararoumany1
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data ScienceSean Taylor
 
Detecting Actionable Items in Meetings by Convolutional Deep Structured Seman...
Detecting Actionable Items in Meetings by Convolutional Deep Structured Seman...Detecting Actionable Items in Meetings by Convolutional Deep Structured Seman...
Detecting Actionable Items in Meetings by Convolutional Deep Structured Seman...Yun-Nung (Vivian) Chen
 
Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?Zachary Thomas
 
Higher-order Link Prediction GraphEx
Higher-order Link Prediction GraphExHigher-order Link Prediction GraphEx
Higher-order Link Prediction GraphExAustin Benson
 
Building a names backbone
Building a names backboneBuilding a names backbone
Building a names backbonenickyn
 
02 Network Data Collection
02 Network Data Collection02 Network Data Collection
02 Network Data Collectiondnac
 
Simplicial closure & higher-order link prediction
Simplicial closure & higher-order link predictionSimplicial closure & higher-order link prediction
Simplicial closure & higher-order link predictionAustin Benson
 

Semelhante a Poster present at the CAIM workshop NYC, Feb 15 2018 (20)

Telefonica Lunch Seminar
Telefonica Lunch SeminarTelefonica Lunch Seminar
Telefonica Lunch Seminar
 
SAC TRECK 2008
SAC TRECK 2008SAC TRECK 2008
SAC TRECK 2008
 
[DSC Europe 22] Latest Techniques of Entity Matching in NLP - Avinash Pathak
[DSC Europe 22] Latest Techniques of Entity Matching in NLP - Avinash Pathak[DSC Europe 22] Latest Techniques of Entity Matching in NLP - Avinash Pathak
[DSC Europe 22] Latest Techniques of Entity Matching in NLP - Avinash Pathak
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Lalal
LalalLalal
Lalal
 
Deep learning
Deep learningDeep learning
Deep learning
 
Simplicial closure and higher-order link prediction
Simplicial closure and higher-order link predictionSimplicial closure and higher-order link prediction
Simplicial closure and higher-order link prediction
 
Neural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep LearningNeural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep Learning
 
Lecture 5 - Qunatifying a Network.pdf
Lecture 5 - Qunatifying a Network.pdfLecture 5 - Qunatifying a Network.pdf
Lecture 5 - Qunatifying a Network.pdf
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data Science
 
Detecting Actionable Items in Meetings by Convolutional Deep Structured Seman...
Detecting Actionable Items in Meetings by Convolutional Deep Structured Seman...Detecting Actionable Items in Meetings by Convolutional Deep Structured Seman...
Detecting Actionable Items in Meetings by Convolutional Deep Structured Seman...
 
DBSCAN (1) (4).pptx
DBSCAN (1) (4).pptxDBSCAN (1) (4).pptx
DBSCAN (1) (4).pptx
 
Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?
 
Higher-order Link Prediction GraphEx
Higher-order Link Prediction GraphExHigher-order Link Prediction GraphEx
Higher-order Link Prediction GraphEx
 
Building a names backbone
Building a names backboneBuilding a names backbone
Building a names backbone
 
02 Network Data Collection
02 Network Data Collection02 Network Data Collection
02 Network Data Collection
 
02 Network Data Collection (2016)
02 Network Data Collection (2016)02 Network Data Collection (2016)
02 Network Data Collection (2016)
 
Simplicial closure & higher-order link prediction
Simplicial closure & higher-order link predictionSimplicial closure & higher-order link prediction
Simplicial closure & higher-order link prediction
 

Último

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Último (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Poster present at the CAIM workshop NYC, Feb 15 2018

  • 1. Building a virtual data assistant Kavitha Srinivas kavitha@rivetlabs.io Poster presentation at the CAIM Meetup, NYC, Feb 15, 2018
  • 2. Why do we need one? A typical business user MANUALLY: 1. Aggregates data across different datasets 2. Applies business logic to the data 3. Shares a filtered view of it with other co-workers Rinse and repeat as source data changes. Process is tedious, error-prone.
  • 3. Example today usage organization 48 hours ABC news 24 hours CBS 20 hours NBC organization salesperson ABC News Corp John Doe CBS news Mary Peters NBC Studios Bob Market Gets weekly usage of service Merges with data from CRM organization usage terms monthly revenue ABC News 48 hours 2.0 per hour usage * terms …. CBS 24 hours 1.0 per hour usage * terms… Merges data, where merge is often hard and needs fuzzy matching Applies a lot of business logic with cut/paste Has no access to functions such as revenue forecasting, spatial distance analysis, etc. Has no knowledge on what has been done by others with the same data Computes revenue John in Finance Repeats the process when source data changes
  • 4. A Virtual Data assistant Recommends higher level functions such as time series forecasting based on data characteristics usage terms organization 48 hours 2.0 per hour ABC news 24 hours 1.0 per hour CBS 20 hours 2.0 per hour NBC organizatio n salesperso nABC News Corp John Doe CBS news Mary Peters NBC Studios Bob Market Notices semantic types of data, handle merges appropriately organization salesperson usage terms monthly revenue ABC News John Doe 48 hours 2.0 per hour usage * terms …. CBS Mary Peters 24 hours 1.0 per hour usage * terms… Suggests prior user actions on similar data, applies business logic to new data Recommends functions based on other users’ use of similar data
  • 5. Technology to make this possible… • Semantic classifiers to classify data to power recommendations (e.g., identify time series data). • Algorithms to help with those difficult merges (fuzzy joins) across columns (e.g. Ted Williams with T. Williams) based on their semantic type. • Algorithms to help users define complex business logic (e.g., synthesis of SQL expressions from examples provided by users). • Algorithms to compute similarity across different datasets. Focus of poster
  • 6. Fuzzy join algorithms Standard approach for fuzzy joins: string matching algorithms. Alternatively use specific rules to match entities. E.g., Ted is a short name for Theodore Ted Kennedy -> Ted is a first name, Kennedy is a last name Ted Kennedy is same as Theodore Kennedy… Can deep learning be used to ‘learn’ the rules for specific entity types? More scalable than having to code entity-specific rules.
  • 7. Fuzzy join Approach 1 Standard approach to avoid N2 comparisons of columnar values is to perform blocking organization Bob Knuth Ruth Rutherford Robert Bailey organization Ruth Knuth Eric Satie Marion Bailey Robert Knuth Bob Knuth Ruth Knuth Robert Knuth Ruth Rutherford Ruth Knuth To join Block on cells having at least one word in common Marion Bailey Robert Bailey Bob Knuth Ruth Knuth Robert Knuth Compare items within a block Bob Knuth = Ruth Knuth? Build a deep learning model for comparison within a block.
  • 8. Similar or not? Estimate distance Fuzzy Join Approach 1 Cal Christensen Calvin L. Christensen Cal Hubbard Calvin L. Christensen Cal Christensen Cal Hubbard Network learns a function that maps entities into a vector space where vectors for positive pairs are mapped closer than vectors for negative pairs Cal Christensen Use a “Siamese network” architecture to map same entities closer in vector space, maximize distance to different entities. Vector space http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf X1 X2 net1 net2 Shared weights across net1 and net2 Positive pair Negative pair
  • 9. Results of Approach 1 dropout dense layer character embedding … Bob Knuth Robert Alan Knuth Distance estimate for positive and negative pairs Siamese network dense layer 183027 positive pairs +183027 negative pairs of people’s names drawn from DBpedia. Negative pairs always shared at least 1 name word in common with blocking implemented in the database. F score: .89 (~9K test pairs) vs .70 word matching baseline 'sergei ivanov' 'sergei borisovich ivanov’ - 0.059 'sergei ivanov' 'sergei anatolevich bozhenov’ - 1.013 'joseph josiah dodd' 'j j dodd' - 0.109 'joseph josiah dodd' 'joseph' - 1.127 'sir richard stydolph 1st baronet' 'richard stydolph’ - 0.157 'sir richard stydolph 1st baronet' 'sir harry chilcott’ - 1.409 'nahum sokolow' 'nachum sokolov’ - 0.666 'nahum sokolow' 'nahum gutman’ - 0.531 'robin chan' 'rabin sophonpanich’ - 0.199 'robin chan' 'eric robin bell’ - 1.259 ‘richard russell' 'richard b russell jr ‘ - 0.275 'richard russell' 'richard paul wesley cresswell’ - 2.793 'kiyoshi kawasaki' 'kawasaki kiyoshi’ - 0.29039219 'kiyoshi kawasaki' 'matsura kiyoshi’ - 0.40613374 Examples where the model discriminates well: Examples where the model has trouble: dropout dense layer character embedding … dense layer
  • 10. Are we done? Clearly promising to use fuzzy join approach 1 since it is more scalable, and more effective than rule based approach. But… we can do better
  • 11. Fuzzy Join Approach 2 Perform a linear pass over the column data and get vector embeddings of the last layer. Find nearest neighbors using approximate nearest neighbors (ANN) algorithm to find elements to join with Cal Hubbard Do away with blocking entirely but still eliminate N2 comparisons of values. Cal Christensen Calvin L. Christensen Vector space If the network computes a vector embedding where positive pairs are closer than negative pairs then finding the nearest neighbors directly from vector embeddings should specify the elements to join with. Build a siamese network for a certain entity type Should be more efficient than blocking because blocking is ineffective for common words (e.g. John). Only join these 2
  • 12. Can we test this idea from the model we just built? Likely not, because negative pairs generated from blocking shared at least one word of the name (e.g. Bob Knuth - Bob Rickets). Running the nearest neighbor algorithm on embeddings from the last layer confirms it: Surprising that the model mistakenly maps seemingly dissimilar pairs as ‘neighbors’ because character input embeddings for those should have nothing in common. Why? Bob Knuth Edith ChickWilliam Cooke Robert Alan Knuth .038 .035 .102 Bob Ricketts 1.6 Test pairs like the ones the model was trained on are mapped correctly Test pairs with no words in common are mapped incorrectly as ‘near neighbors’ by the model
  • 13. What does the vector space for input embeddings look like? Cal Christensen Calvin L. Christensen Vector space character embedding Cal Christensen Cal L. Christensen Siamese network character embedding And feed it to the nearest neighbor algorithm Take the character embeddings for each element of a pair in the siamese network Cal Hubbard Mary Christensen John Williams One might expect to see names with common words clustered together, and those with no words in common far apart.
  • 14. Its not that simple… query=katri helena neighbors ='katya medvedeva', 'natali pronina' positive = katri helena kalaoja - 5.265079498291016 negative = aino katri kurki suonio - 6.6487250328063965 katya medvedeva - 2.918647050857544 natali pronina - 2.9291510581970215 query=ricardo oscar vanni neighbors ='pablo oscar cavallero', 'mariano salvador maella’ positive = vanni - 5.223626613616943 negative = ricardo porro hidalgo - 3.843609094619751 pablo oscar cavallero - 3.6668620109558105 mariano salvador maella - 3.8342466354370117 query=aleksei konakh neighbors ='aliaksei konakh', 'aleksei vanyushin' positive = aliaksei konakh - 1.567959189414978 negative= aleksei anatolyevich yushkov - 4.771781921386719 aliaksei konakh - 1.567959189414978 aleksei vanyushin - 2.4045419692993164 Names with no components in common have input embeddings that are closer than positives Positives have higher distances than negatives Input embeddings sometimes contain positive as nearest neighbor
  • 15. How to build a better model? Use ANN of character embeddings to drive selection of triplets instead of pairs, picking a positive and a negative example for a given anchor (e.g., Cal Christensen) (https://tinyurl.com/ycrw58ap) dropout dense layer dropout dense layer character embedding character embedding … Cal Christensen Calvin L. Christensen … Distance estimate dropout dense layer character embedding Mary Christensen … Mary Christensen Cal Christensen Calvin L. Christensen Build a model to minimize distance to the positive element of a triplet and maximize distance to the negative element. minimize maximize Cal Christensen Calvin L. ChristensenCal Hubbard Mary Christensen John Williams Pick a triplet for an anchor
  • 16. Summary Merge problem can be solved provided we have enough data, at the very least with blocking. To be seen whether we can improve the merge to eliminate blocking entirely. Acknowledgements: This work was done in conjunction with Yehuda Gale, a senior who is conducting this work as part of a senior thesis at Yeshiva University. https://github.com/yehudagale/fuzzyJoiner has the source code.