TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Poster present at the CAIM workshop NYC, Feb 15 2018
1. Building a
virtual data assistant
Kavitha Srinivas
kavitha@rivetlabs.io
Poster presentation at the CAIM Meetup, NYC, Feb 15, 2018
2. Why do we need one?
A typical business user MANUALLY:
1. Aggregates data across different datasets
2. Applies business logic to the data
3. Shares a filtered view of it with other co-workers
Rinse and repeat as source data changes.
Process is tedious, error-prone.
3. Example today
usage organization
48 hours ABC news
24 hours CBS
20 hours NBC
organization salesperson
ABC News
Corp
John Doe
CBS news Mary Peters
NBC Studios Bob Market
Gets weekly usage of service Merges with data from CRM
organization usage terms
monthly
revenue
ABC News 48 hours 2.0 per hour
usage * terms
….
CBS 24 hours 1.0 per hour usage * terms…
Merges data, where
merge is often hard and
needs fuzzy matching
Applies a lot of
business logic with
cut/paste
Has no access to functions
such as revenue forecasting,
spatial distance analysis,
etc.
Has no knowledge on what
has been done by others
with the same data
Computes revenue
John in
Finance
Repeats the process when
source data changes
4. A Virtual Data assistant
Recommends higher
level functions such as
time series forecasting
based on data
characteristics
usage terms organization
48 hours 2.0 per hour ABC news
24 hours 1.0 per hour CBS
20 hours 2.0 per hour NBC
organizatio
n
salesperso
nABC News
Corp
John Doe
CBS news Mary Peters
NBC
Studios
Bob Market
Notices semantic types
of data, handle merges
appropriately
organization salesperson usage terms
monthly
revenue
ABC News John Doe 48 hours 2.0 per hour
usage * terms
….
CBS Mary Peters 24 hours 1.0 per hour usage * terms…
Suggests prior user
actions on similar data,
applies business logic to
new data
Recommends functions
based on other users’
use of similar data
5. Technology to make this
possible…
• Semantic classifiers to classify data to power
recommendations (e.g., identify time series data).
• Algorithms to help with those difficult merges (fuzzy
joins) across columns (e.g. Ted Williams with T.
Williams) based on their semantic type.
• Algorithms to help users define complex business logic
(e.g., synthesis of SQL expressions from examples
provided by users).
• Algorithms to compute similarity across different
datasets.
Focus of
poster
6. Fuzzy join algorithms
Standard approach for fuzzy joins: string matching algorithms.
Alternatively use specific rules to match entities. E.g.,
Ted is a short name for Theodore
Ted Kennedy -> Ted is a first name, Kennedy is a last name
Ted Kennedy is same as Theodore Kennedy…
Can deep learning be used to ‘learn’ the rules for specific entity
types? More scalable than having to code entity-specific rules.
7. Fuzzy join Approach 1
Standard approach to avoid N2 comparisons of columnar values is to perform
blocking
organization
Bob Knuth
Ruth Rutherford
Robert Bailey
organization
Ruth Knuth
Eric Satie
Marion Bailey
Robert Knuth
Bob Knuth
Ruth Knuth
Robert Knuth
Ruth Rutherford
Ruth Knuth
To join
Block on cells having at least
one word in common
Marion Bailey
Robert Bailey
Bob Knuth
Ruth Knuth
Robert Knuth
Compare items within a block
Bob Knuth = Ruth Knuth?
Build a deep learning model for comparison within a block.
8. Similar or not?
Estimate distance
Fuzzy Join Approach 1
Cal
Christensen
Calvin L.
Christensen
Cal Hubbard
Calvin L.
Christensen
Cal
Christensen
Cal
Hubbard
Network learns a function that maps entities into a
vector space where vectors for positive pairs are
mapped closer than vectors for negative pairs
Cal
Christensen
Use a “Siamese network” architecture to map same entities
closer in vector space, maximize distance to different entities.
Vector space
http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf
X1 X2
net1 net2
Shared weights across
net1 and net2
Positive pair
Negative pair
9. Results of Approach 1
dropout
dense layer
character
embedding
…
Bob Knuth
Robert
Alan Knuth
Distance estimate for positive
and negative pairs
Siamese network
dense layer
183027 positive pairs +183027 negative pairs of people’s names drawn from
DBpedia. Negative pairs always shared at least 1 name word in common with
blocking implemented in the database.
F score: .89 (~9K test pairs)
vs .70 word matching baseline
'sergei ivanov' 'sergei borisovich ivanov’ - 0.059
'sergei ivanov' 'sergei anatolevich bozhenov’ - 1.013
'joseph josiah dodd' 'j j dodd' - 0.109
'joseph josiah dodd' 'joseph' - 1.127
'sir richard stydolph 1st baronet' 'richard stydolph’ - 0.157
'sir richard stydolph 1st baronet' 'sir harry chilcott’ - 1.409
'nahum sokolow' 'nachum sokolov’ - 0.666
'nahum sokolow' 'nahum gutman’ - 0.531
'robin chan' 'rabin sophonpanich’ - 0.199
'robin chan' 'eric robin bell’ - 1.259
‘richard russell' 'richard b russell jr ‘ - 0.275
'richard russell' 'richard paul wesley cresswell’ - 2.793
'kiyoshi kawasaki' 'kawasaki kiyoshi’ - 0.29039219
'kiyoshi kawasaki' 'matsura kiyoshi’ - 0.40613374
Examples where the model discriminates well:
Examples where the model has trouble:
dropout
dense layer
character
embedding
…
dense layer
10. Are we done?
Clearly promising to use fuzzy join approach 1 since it is
more scalable, and more effective than rule based
approach.
But… we can do better
11. Fuzzy Join Approach 2
Perform a linear pass over the column data and
get vector embeddings of the last layer.
Find nearest neighbors using approximate
nearest neighbors (ANN) algorithm to find
elements to join with
Cal Hubbard
Do away with blocking entirely but still eliminate N2 comparisons
of values.
Cal Christensen
Calvin L. Christensen
Vector space
If the network computes a vector
embedding where positive pairs are
closer than negative pairs then finding
the nearest neighbors directly from
vector embeddings should specify the
elements to join with.
Build a siamese network for a certain entity type
Should be more efficient than blocking because blocking is
ineffective for common words (e.g. John).
Only join these 2
12. Can we test this idea from the
model we just built?
Likely not, because negative pairs generated from blocking shared at least
one word of the name (e.g. Bob Knuth - Bob Rickets).
Running the nearest neighbor algorithm on embeddings from the last layer
confirms it:
Surprising that the model mistakenly maps seemingly dissimilar pairs
as ‘neighbors’ because character input embeddings for those should
have nothing in common. Why?
Bob Knuth
Edith ChickWilliam Cooke
Robert Alan Knuth
.038 .035
.102
Bob Ricketts
1.6
Test pairs like the ones
the model was trained on
are mapped correctly
Test pairs with no words
in common are mapped
incorrectly as ‘near
neighbors’ by the model
13. What does the vector space for
input embeddings look like?
Cal Christensen
Calvin L. Christensen
Vector space
character
embedding
Cal
Christensen
Cal L.
Christensen
Siamese network
character
embedding
And feed it to the nearest neighbor
algorithm
Take the character embeddings for each
element of a pair in the siamese network
Cal Hubbard
Mary Christensen
John Williams
One might expect to see names with
common words clustered together,
and those with no words in common
far apart.
14. Its not that simple…
query=katri helena
neighbors ='katya medvedeva', 'natali pronina'
positive = katri helena kalaoja - 5.265079498291016
negative = aino katri kurki suonio - 6.6487250328063965
katya medvedeva - 2.918647050857544
natali pronina - 2.9291510581970215
query=ricardo oscar vanni
neighbors ='pablo oscar cavallero', 'mariano salvador
maella’
positive = vanni - 5.223626613616943
negative = ricardo porro hidalgo - 3.843609094619751
pablo oscar cavallero - 3.6668620109558105
mariano salvador maella - 3.8342466354370117
query=aleksei konakh
neighbors ='aliaksei konakh', 'aleksei vanyushin'
positive = aliaksei konakh - 1.567959189414978
negative= aleksei anatolyevich yushkov - 4.771781921386719
aliaksei konakh - 1.567959189414978
aleksei vanyushin - 2.4045419692993164
Names with no
components in
common have input
embeddings that are
closer than positives
Positives have higher
distances than
negatives
Input embeddings
sometimes contain
positive as nearest
neighbor
15. How to build a better model?
Use ANN of character embeddings to
drive selection of triplets instead of
pairs, picking a positive and a
negative example for a given anchor
(e.g., Cal Christensen)
(https://tinyurl.com/ycrw58ap)
dropout
dense layer
dropout
dense layer
character
embedding
character
embedding
…
Cal
Christensen
Calvin L.
Christensen
…
Distance estimate
dropout
dense layer
character
embedding
Mary
Christensen
…
Mary
Christensen
Cal
Christensen
Calvin L.
Christensen
Build a model to minimize
distance to the positive element
of a triplet and maximize distance
to the negative element.
minimize
maximize
Cal Christensen
Calvin L. ChristensenCal Hubbard
Mary Christensen
John Williams
Pick a triplet for
an anchor
16. Summary
Merge problem can be solved provided we have enough data, at
the very least with blocking.
To be seen whether we can improve the merge to eliminate blocking
entirely.
Acknowledgements:
This work was done in conjunction with Yehuda Gale, a senior who
is conducting this work as part of a senior thesis at Yeshiva
University.
https://github.com/yehudagale/fuzzyJoiner has the source code.