SlideShare uma empresa Scribd logo
1 de 63
Baixar para ler offline
Big Graph Data Science 
Lise Getoor 
University of California, Santa Cruz 
SF MLConf 
November 14, 2014
BIG Data is not flat 
©2004-2013 lonnitaylor
Data is multi-modal, multi-relational, 
spatio-temporal, multi-media
NEED: Data Science for Graphs 
New V: Vinculate
Pa#erns 
Key 
Ideas 
Tools
Pa#erns 
Key 
Ideas 
Tools
☐ 
Collec2ve 
Classifica2on 
☐ 
Link 
Predic2on 
☐ 
En2ty 
Resolu2on
Collec&ve 
Classifica&on: 
inferring 
the 
labels 
of 
nodes 
in 
a 
graph
Collective Classification 
 spouse  friend 
 
  
 
spouse 
spouse 
colleague 
colleague 
friend 
friend 
friend 
Question: or ?
Collective Classification 
 spouse  friend 
 
  
 
spouse 
spouse 
colleague 
colleague 
friend 
friend 
friend
Collective Classification 
 spouse  friend 
 
  
 
spouse 
spouse 
colleague 
colleague 
friend 
friend 
friend 
? 
? 
?
☐ 
Collec2ve 
Classifica2on 
☐ 
Link 
Predic2on 
☐ 
En2ty 
Resolu2on 
✔
Link 
Predic&on: 
inferring 
the 
existence 
of 
edges 
in 
a 
graph
Link Prediction 
¢ Entities 
l People, Emails 
¢ Observed relationships 
l communications, co-location 
¢ Predict relationships 
l Supervisor, subordinate, 
colleague 
 # 
 
 
 
 
 
   

☐ 
Collec2ve 
Classifica2on 
☐ 
Link 
Predic2on 
☐ 
En2ty 
Resolu2on 
✔ 
✔
En&ty 
Resolu&on: 
determining 
which 
nodes 
refer 
to 
same 
underlying 
en2ty
Ironically, Entity Resolution has many duplicate names 
Identity uncertainty 
Doubles 
Record linkage Duplicate detection 
Deduplication 
Reference reconciliation 
Object identification 
Object consolidation 
Coreference resolution 
Entity clustering 
Household matching 
Householding Reference matching 
Fuzzy match 
Approximate match 
Merge/purge 
Hardening soft databases
Entity Resolution & Network Analysis 
Tutorial on Entity Resolution in Big Data w/ Ashwin 
Machanavajjhala @ KDD13 (slides available) 
before 
after
☐ 
Collec2ve 
Classifica2on 
☐ 
Link 
Predic2on 
☐ 
En2ty 
Resolu2on 
✔ 
✔ 
✔
Graph 
Iden2fica2on 
• Goal: 
– Given 
an 
input 
graph 
infer 
an 
output 
graph 
• Three 
major 
components: 
– En&ty 
Resolu&on 
(ER): 
Infer 
the 
set 
of 
nodes 
– Link 
Predic&on 
(LP): 
Infer 
the 
set 
of 
edges 
– Collec&ve 
Classifica&on 
(CC): 
Infer 
the 
node 
labels 
• Challenge: 
The 
components 
are 
intra 
and 
inter-­‐dependent 
Namata, 
Kok, 
Getoor, 
KDD 
2011
Pa#erns 
Key 
Ideas 
Tools
☐ 
Feature 
Construc2on 
☐ 
Collec2ve 
Reasoning 
☐ 
Blocking
Key Idea #1: Relational 
Feature Construction
Key Idea: Feature Construction 
¢ Feature informativeness is key to the success of a 
relational classifier 
¢ Relational feature construction 
l Node-specific measures 
• Aggregates: summarize attributes of relational neighbors 
• Structural properties: capture characteristics of the relational 
structure 
l Node-pair measures: summarize properties of 
(potential) edges 
• Attribute-based measures 
• Edge-based measures 
• Neighborhood similarity measures
Relational Classifiers 
¢ Pros: 
l Efficient: Can handle large amounts of data 
• Features can often be pre-computed ahead of time 
l Flexible: Can take advantage of well-understood classification/ 
regression algorithms 
l One of the most commonly-used ways of incorporating 
relational information 
¢ Cons: 
l Features are based on observed values/evidence, cannot be 
based on attributes or relations that are being predicted 
l Makes incorrect independence assumptions 
l Hard to impose global constraints on joint assignments 
• For example, when inferring a hierarchy of individuals, we may want 
to enforce constraint that it is a tree
☐ 
Feature 
Construc2on 
☐ 
Collec2ve 
Reasoning 
☐ 
Blocking 
✔
Collective Classification 
? 

Collective Classification 
Donates(A, ) => Votes(A, ): 5.0 
? 
$ $ 
 
Tweet 
Status 
update 
Mentions(A, “Affordable Health”) => Votes(A, ): 0.3
Collective Classification 
 spouse  friend 
 
  
 
spouse 
spouse 
colleague 
colleague 
friend 
friend 
friend
Collective Classification 
vote(A,P) ∧ friend(B,A) à vote(B,P) : 0.3 
   
spouse 
spouse friend 
friend 
  
colleague 
 
spouse 
colleague 
vote(A,P) ∧ spouse(B,A) à vote(B,P) : 0.8 
friend 
friend
Collective Classification 
 spouse  friend 
 
  
 
spouse 
spouse 
colleague 
colleague 
friend 
friend 
friend
Link Prediction 
§ People, emails, words, 
communication, relations 
§ Use model to express 
dependencies 
- “If email content suggests type X, it 
is of type X” 
- “If A sends deadline emails to B, 
then A is the supervisor of B” 
- “If A is the supervisor of B, and A is 
the supervisor of C, then B and C are 
colleagues” 
 # 
 
 
 
 
 
   

Link Prediction 
HasWord(E, “due”) => Type(E, deadline) : 0.6 
§ People, emails, words, 
communication, relations 
§ Use model to express 
dependencies 
- “If email content suggests type X, it 
is of type X” 
- “If A sends deadline emails to B, 
then A is the supervisor of B” 
- “If A is the supervisor of B, and A is 
the supervisor of C, then B and C are 
colleagues” 
 # 
complete by 
 
due 
 
 
 
 
   

Link Prediction 
§ People, emails, words, 
communication, relations 
§ Use model to express 
dependencies 
- “If email content suggests type X, it 
is of type X” 
- “If A sends deadline emails to B, 
then A is the supervisor of B” 
- “If A is the supervisor of B, and A is 
the supervisor of C, then B and C are 
colleagues” 
 # 
 
 
 
 
 
   
 
Sends(A,B,E) ^ Type(E,deadline) => Supervisor(A,B) : 0.8
Link Prediction 
§ People, emails, words, 
communication, relations 
§ Use model to express 
dependencies 
- “If email content suggests type X, it 
is of type X” 
- “If A sends deadline emails to B, 
then A is the supervisor of B” 
- “If A is the supervisor of B, and A is 
the supervisor of C, then B and C are 
colleagues” 
 # 
 
 
 
 
 
   
 
Supervisor(A,B) ^ Supervisor(A,C) => Colleague(B,C) : ∞
Entity Resolution 
§ Entities 
- People References 
§ Attributes 
- Name 
§ Relationships 
- Friendship 
§ Goal: Identify 
references that denote 
the same person 
John Smith J. Smith 
name name 
A B 
C 
E 
friend friend 
D F G 
H 
= 
=
Entity Resolution 
§ References, names, 
friendships 
§ Use model to express 
dependencies 
- ‘’If two people have similar names, 
they are probably the same’’ 
- ‘’If two people have similar friends, 
they are probably the same’’ 
- ‘’If A=B and B=C, then A and C must 
also denote the same person’’ 
John Smith J. Smith 
name name 
A B 
C 
E 
friend friend 
D F G 
H 
= 
=
Entity Resolution 
§ References, names, 
friendships 
§ Use model to express 
dependencies 
- ‘’If two people have similar names, 
A.name ≈{str_sim} B.name => A≈B : 0.8 
they are probably the same’’ 
- ‘’If two people have similar friends, 
they are probably the same’’ 
- ‘’If A=B and B=C, then A and C must 
also denote the same person’’ 
John Smith J. Smith 
name name 
A B 
C 
E 
friend friend 
D F G 
H 
= 
=
Entity Resolution 
§ References, names, 
friendships 
§ Use model to express 
dependencies 
- ‘’If two people have similar names, 
they are probably the same’’ 
- ‘’If two people have similar friends, 
they are probably the same’’ 
- ‘’If A=B and B=C, then A and C must 
also denote the same person’’ 
John Smith J. Smith 
name name 
A B 
C 
E 
friend friend 
D F G 
H 
= 
= 
{A.friends} ≈{} {B.friends} => A≈B : 0.6
Entity Resolution 
§ References, names, 
friendships 
§ Use model to express 
dependencies 
- ‘’If two people have similar names, 
they are probably the same’’ 
- ‘’If two people have similar friends, 
they are probably the same’’ 
- ‘’If A=B and B=C, then A and C must 
also denote the same person’’ 
John Smith J. Smith 
name name 
A B 
C 
E 
friend friend 
D F G 
H 
= 
= 
A≈B ^ B≈C => A≈C : ∞
Challenges 
¢ Collective Classification: labeling nodes in graph 
l irregular structure, not a chain, not a grid 
l Challenge: One large partially labeled graph 
¢ Link prediction: predicting edges in graph 
l Dependencies among edges 
l Don’t want to reason about all possible edges 
l Challenge: scaling & extremely skewed probabilities 
¢ Entity resolution: determine nodes that refer to 
same entities in a graph 
l Dependencies between clusters 
l Challenge: enforcing constraints, e.g. transitive closure
☐ 
Feature 
Construc2on 
☐ 
Collec2ve 
Reasoning 
☐ 
Blocking 
✔ 
✔
Blocking: Motivation 
¢ Naïve pairwise: |N|2 pairwise comparisons 
l 1000 business listings each from 1,000 different cities 
across the world 
l 1 trillion comparisons 
l 11.6 days (if each comparison is 1 μs) 
¢ Mentions from different cities are unlikely to be 
matches 
l Blocking Criterion: City 
l 1 billion comparisons 
l 16 minutes (if each comparison is 1 μs)
Blocking: Motivation 
¢ Mentions from different cities are unlikely to be 
matches 
l May miss potential matches 
47
Blocking: Motivation 
Matching 
Pairs 
of 
Nodes 
Set 
of 
all 
Pairs 
of 
Nodes 
Pairs 
of 
Nodes 
sa&sfying 
Blocking 
criterion
Blocking Algorithms 1 
¢ Hash based blocking 
l Each block Ci is associated with a hash key hi. 
l Mention x is hashed to Ci if hash(x) = hi. 
l Within a block, all pairs are compared. 
l Each hash function results in disjoint blocks. 
¢ What hash function? 
l Deterministic function of attribute values 
l Boolean Functions over attribute values 
[Bilenko et al ICDM’06, Michelson et al AAAI’06, 
Das Sarma et al CIKM ‘12] 
l minHash (min-wise independent permutations) 
[Broder et al STOC’98]; locality sensitive hashing
Blocking Algorithms 2 
¢ Pairwise Similarity/Neighborhood based blocking 
l Nearby nodes according to a similarity metric are 
clustered together 
l Results in non-disjoint canopies. 
¢ Techniques 
l Sorted Neighborhood Approach [Hernandez et al 
SIGMOD’95] 
l Canopy Clustering [McCallum et al KDD’00]
☐ 
Feature 
Construc2on 
☐ 
Collec2ve 
Reasoning 
☐ 
Blocking 
✔ 
✔ 
✔
Pa#erns 
Key 
Ideas 
Tools
http://psl.umiacs.umd.edu 
Matthias Stephen Bach Broecheler Alex Memory Lily Mihalkova Stanley Kok 
Bert Huang 
Angelika Kimmig 
Ben London Arti Ramesh Jay Pujara Shobeir Fakhraei Hui Miao
http://psl.umiacs.umd.edu 
Probabilistic Soft Logic (PSL) 
Declarative language based on logics to express 
collective probabilistic inference problems 
- Predicate = relationship or property 
- Atom = (continuous) random variable 
- Rule = capture dependency or constraint 
- Set = define aggregates 
PSL Program = Rules + Input DB
Collective Classification 
vote(A,P) ∧ friend(B,A) à vote(B,P) : 0.3 
   
spouse 
spouse friend 
friend 
  
colleague 
 
spouse 
colleague 
vote(A,P) ∧ spouse(B,A) à vote(B,P) : 0.8 
friend 
friend
PSL Foundations 
§ PSL makes large-scale reasoning scalable by 
mapping logical rules to convex functions 
§ Three principles justify this mapping: 
- LP programs for MAX SAT with approximation 
guarantees [Goemans and Williamson, ’94] 
- Pseudomarginal LP relaxations of Boolean Markov 
random fields [Wainwright, et al., ’02] 
- Łukasiewicz logic, a logic for reasoning about 
continuous values [Klir and Yuan, ‘95]
Hinge-loss Markov Random Fields 
P(Y|X) = 
1 
Z 
exp 
2 
4 
Xm 
j=1 
wj max{`j(Y,X), 0}pj 
3 
5 
§ Continuous variables in [0,1] 
§ Potentials are hinge-loss functions 
§ Subject to arbitrary linear constraints 
§ Log-concave!
PSL in a Slide 
§ MAP Inference in PSL translates into convex optimization 
problem - inference is really fast! 
§ Inference further enhanced with state-of-the-art 
optimization and distributed processing paradigms such as 
ADMM  GraphLab - inference even faster! 
§ Outperforms discrete MRFs in terms of speed, and (very) 
often accuracy 
§ PSL is flexible: Applied to image segmentation, activity 
recognition, stance-detection, sentiment analysis, 
document classification, drug target prediction, latent 
social groups and trust, engagement modeling, ontology 
alignment, and looking for more!
psl.umiacs.umd.edu
Discussion
Pa#erns 
Key 
Ideas 
Tools 
NEED: Data Science for Graphs
Closing 
Comments 
• Need 
new 
sta2s2cal 
and 
ML 
theory 
for 
very 
large 
graphs 
– Be#er 
understand 
bias, 
iden2fiability, 
and 
mechanism 
design 
in 
graph 
domains 
• Important 
topics 
not 
touched 
on 
– Genera2ve 
mechanisms, 
causal 
reasoning, 
dynamic 
modeling 
– Privacy, 
Users, 
 
Visualiza2on 
• Many 
exci2ng 
opportuni2es 
to 
develop 
new 
theory 
and 
algorithms 
and 
apply 
them 
to 
compelling 
business, 
intelligence, 
social, 
and 
societal 
problems!
http://psl.umiacs.umd.edu 
Thank You! 
psl.umiacs.umd.edu 
Contact information: 
getoor@soe.ucsc.edu
Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF
Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Mais conteúdo relacionado

Semelhante a Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Private Distributed Collaborative Filtering
Private Distributed Collaborative FilteringPrivate Distributed Collaborative Filtering
Private Distributed Collaborative FilteringNeal Lathia
 
Semantic Relations
Semantic RelationsSemantic Relations
Semantic RelationsJennifer Lee
 
Chapter 2 Relational Data Model-part1
Chapter 2 Relational Data Model-part1Chapter 2 Relational Data Model-part1
Chapter 2 Relational Data Model-part1Eddyzulham Mahluzydde
 
Designing, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural NetworksDesigning, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural Networksconnectbeubax
 
Grouping search engine returned citations for person-name queries
Grouping search engine returned citations for person-name queriesGrouping search engine returned citations for person-name queries
Grouping search engine returned citations for person-name queriesAvelin Huo
 
Pertemuan-4------------------------------------------------
Pertemuan-4------------------------------------------------Pertemuan-4------------------------------------------------
Pertemuan-4------------------------------------------------keishaangelina2
 
Statistical Regression With Python
Statistical Regression With PythonStatistical Regression With Python
Statistical Regression With PythonMosky Liu
 
06 Regression with Networks – EGO Networks and Randomization (2017)
06 Regression with Networks – EGO Networks and Randomization (2017)06 Regression with Networks – EGO Networks and Randomization (2017)
06 Regression with Networks – EGO Networks and Randomization (2017)Duke Network Analysis Center
 
Inconsistencies of Connection for Heterogeneity and a New Rela,on Discovery M...
Inconsistencies of Connection for Heterogeneity and a New Rela,on Discovery M...Inconsistencies of Connection for Heterogeneity and a New Rela,on Discovery M...
Inconsistencies of Connection for Heterogeneity and a New Rela,on Discovery M...Takafumi Nakanishi
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jWilliam Lyon
 
F Meeting 10 24 07 Rrr
F Meeting 10 24 07 RrrF Meeting 10 24 07 Rrr
F Meeting 10 24 07 Rrrcwatters
 
Recsys Presentation
Recsys PresentationRecsys Presentation
Recsys PresentationNeal Lathia
 
Introduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCAIntroduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCADilum Bandara
 
2015 ESWC FrameBase presentation
2015 ESWC FrameBase presentation2015 ESWC FrameBase presentation
2015 ESWC FrameBase presentationJacobo Rouces
 

Semelhante a Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF (20)

PSL Overview
PSL OverviewPSL Overview
PSL Overview
 
Private Distributed Collaborative Filtering
Private Distributed Collaborative FilteringPrivate Distributed Collaborative Filtering
Private Distributed Collaborative Filtering
 
Semantic Relations
Semantic RelationsSemantic Relations
Semantic Relations
 
Chapter 2 Relational Data Model-part1
Chapter 2 Relational Data Model-part1Chapter 2 Relational Data Model-part1
Chapter 2 Relational Data Model-part1
 
01 Network Data Collection
01 Network Data Collection01 Network Data Collection
01 Network Data Collection
 
Designing, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural NetworksDesigning, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural Networks
 
Grouping search engine returned citations for person-name queries
Grouping search engine returned citations for person-name queriesGrouping search engine returned citations for person-name queries
Grouping search engine returned citations for person-name queries
 
Pertemuan-4------------------------------------------------
Pertemuan-4------------------------------------------------Pertemuan-4------------------------------------------------
Pertemuan-4------------------------------------------------
 
Statistical Regression With Python
Statistical Regression With PythonStatistical Regression With Python
Statistical Regression With Python
 
DBMS UNIT1
DBMS UNIT1DBMS UNIT1
DBMS UNIT1
 
Dependency-Based Word Embeddings
Dependency-Based Word EmbeddingsDependency-Based Word Embeddings
Dependency-Based Word Embeddings
 
Data modeling
Data modelingData modeling
Data modeling
 
06 Regression with Networks – EGO Networks and Randomization (2017)
06 Regression with Networks – EGO Networks and Randomization (2017)06 Regression with Networks – EGO Networks and Randomization (2017)
06 Regression with Networks – EGO Networks and Randomization (2017)
 
10287 lecture5(2)
10287 lecture5(2)10287 lecture5(2)
10287 lecture5(2)
 
Inconsistencies of Connection for Heterogeneity and a New Rela,on Discovery M...
Inconsistencies of Connection for Heterogeneity and a New Rela,on Discovery M...Inconsistencies of Connection for Heterogeneity and a New Rela,on Discovery M...
Inconsistencies of Connection for Heterogeneity and a New Rela,on Discovery M...
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4j
 
F Meeting 10 24 07 Rrr
F Meeting 10 24 07 RrrF Meeting 10 24 07 Rrr
F Meeting 10 24 07 Rrr
 
Recsys Presentation
Recsys PresentationRecsys Presentation
Recsys Presentation
 
Introduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCAIntroduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCA
 
2015 ESWC FrameBase presentation
2015 ESWC FrameBase presentation2015 ESWC FrameBase presentation
2015 ESWC FrameBase presentation
 

Mais de MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceMLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionMLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLMLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf
 

Mais de MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Último

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

  • 1. Big Graph Data Science Lise Getoor University of California, Santa Cruz SF MLConf November 14, 2014
  • 2. BIG Data is not flat ©2004-2013 lonnitaylor
  • 3. Data is multi-modal, multi-relational, spatio-temporal, multi-media
  • 4.
  • 5. NEED: Data Science for Graphs New V: Vinculate
  • 8. ☐ Collec2ve Classifica2on ☐ Link Predic2on ☐ En2ty Resolu2on
  • 9. Collec&ve Classifica&on: inferring the labels of nodes in a graph
  • 10. Collective Classification  spouse  friend     spouse spouse colleague colleague friend friend friend Question: or ?
  • 11. Collective Classification  spouse  friend     spouse spouse colleague colleague friend friend friend
  • 12. Collective Classification  spouse  friend     spouse spouse colleague colleague friend friend friend ? ? ?
  • 13. ☐ Collec2ve Classifica2on ☐ Link Predic2on ☐ En2ty Resolu2on ✔
  • 14. Link Predic&on: inferring the existence of edges in a graph
  • 15. Link Prediction ¢ Entities l People, Emails ¢ Observed relationships l communications, co-location ¢ Predict relationships l Supervisor, subordinate, colleague  #         
  • 16. ☐ Collec2ve Classifica2on ☐ Link Predic2on ☐ En2ty Resolu2on ✔ ✔
  • 17. En&ty Resolu&on: determining which nodes refer to same underlying en2ty
  • 18. Ironically, Entity Resolution has many duplicate names Identity uncertainty Doubles Record linkage Duplicate detection Deduplication Reference reconciliation Object identification Object consolidation Coreference resolution Entity clustering Household matching Householding Reference matching Fuzzy match Approximate match Merge/purge Hardening soft databases
  • 19. Entity Resolution & Network Analysis Tutorial on Entity Resolution in Big Data w/ Ashwin Machanavajjhala @ KDD13 (slides available) before after
  • 20. ☐ Collec2ve Classifica2on ☐ Link Predic2on ☐ En2ty Resolu2on ✔ ✔ ✔
  • 21. Graph Iden2fica2on • Goal: – Given an input graph infer an output graph • Three major components: – En&ty Resolu&on (ER): Infer the set of nodes – Link Predic&on (LP): Infer the set of edges – Collec&ve Classifica&on (CC): Infer the node labels • Challenge: The components are intra and inter-­‐dependent Namata, Kok, Getoor, KDD 2011
  • 23. ☐ Feature Construc2on ☐ Collec2ve Reasoning ☐ Blocking
  • 24. Key Idea #1: Relational Feature Construction
  • 25. Key Idea: Feature Construction ¢ Feature informativeness is key to the success of a relational classifier ¢ Relational feature construction l Node-specific measures • Aggregates: summarize attributes of relational neighbors • Structural properties: capture characteristics of the relational structure l Node-pair measures: summarize properties of (potential) edges • Attribute-based measures • Edge-based measures • Neighborhood similarity measures
  • 26. Relational Classifiers ¢ Pros: l Efficient: Can handle large amounts of data • Features can often be pre-computed ahead of time l Flexible: Can take advantage of well-understood classification/ regression algorithms l One of the most commonly-used ways of incorporating relational information ¢ Cons: l Features are based on observed values/evidence, cannot be based on attributes or relations that are being predicted l Makes incorrect independence assumptions l Hard to impose global constraints on joint assignments • For example, when inferring a hierarchy of individuals, we may want to enforce constraint that it is a tree
  • 27. ☐ Feature Construc2on ☐ Collec2ve Reasoning ☐ Blocking ✔
  • 29. Collective Classification Donates(A, ) => Votes(A, ): 5.0 ? $ $  Tweet Status update Mentions(A, “Affordable Health”) => Votes(A, ): 0.3
  • 30. Collective Classification  spouse  friend     spouse spouse colleague colleague friend friend friend
  • 31. Collective Classification vote(A,P) ∧ friend(B,A) à vote(B,P) : 0.3    spouse spouse friend friend   colleague  spouse colleague vote(A,P) ∧ spouse(B,A) à vote(B,P) : 0.8 friend friend
  • 32. Collective Classification  spouse  friend     spouse spouse colleague colleague friend friend friend
  • 33. Link Prediction § People, emails, words, communication, relations § Use model to express dependencies - “If email content suggests type X, it is of type X” - “If A sends deadline emails to B, then A is the supervisor of B” - “If A is the supervisor of B, and A is the supervisor of C, then B and C are colleagues”  #         
  • 34. Link Prediction HasWord(E, “due”) => Type(E, deadline) : 0.6 § People, emails, words, communication, relations § Use model to express dependencies - “If email content suggests type X, it is of type X” - “If A sends deadline emails to B, then A is the supervisor of B” - “If A is the supervisor of B, and A is the supervisor of C, then B and C are colleagues”  # complete by  due        
  • 35. Link Prediction § People, emails, words, communication, relations § Use model to express dependencies - “If email content suggests type X, it is of type X” - “If A sends deadline emails to B, then A is the supervisor of B” - “If A is the supervisor of B, and A is the supervisor of C, then B and C are colleagues”  #          Sends(A,B,E) ^ Type(E,deadline) => Supervisor(A,B) : 0.8
  • 36. Link Prediction § People, emails, words, communication, relations § Use model to express dependencies - “If email content suggests type X, it is of type X” - “If A sends deadline emails to B, then A is the supervisor of B” - “If A is the supervisor of B, and A is the supervisor of C, then B and C are colleagues”  #          Supervisor(A,B) ^ Supervisor(A,C) => Colleague(B,C) : ∞
  • 37. Entity Resolution § Entities - People References § Attributes - Name § Relationships - Friendship § Goal: Identify references that denote the same person John Smith J. Smith name name A B C E friend friend D F G H = =
  • 38. Entity Resolution § References, names, friendships § Use model to express dependencies - ‘’If two people have similar names, they are probably the same’’ - ‘’If two people have similar friends, they are probably the same’’ - ‘’If A=B and B=C, then A and C must also denote the same person’’ John Smith J. Smith name name A B C E friend friend D F G H = =
  • 39. Entity Resolution § References, names, friendships § Use model to express dependencies - ‘’If two people have similar names, A.name ≈{str_sim} B.name => A≈B : 0.8 they are probably the same’’ - ‘’If two people have similar friends, they are probably the same’’ - ‘’If A=B and B=C, then A and C must also denote the same person’’ John Smith J. Smith name name A B C E friend friend D F G H = =
  • 40. Entity Resolution § References, names, friendships § Use model to express dependencies - ‘’If two people have similar names, they are probably the same’’ - ‘’If two people have similar friends, they are probably the same’’ - ‘’If A=B and B=C, then A and C must also denote the same person’’ John Smith J. Smith name name A B C E friend friend D F G H = = {A.friends} ≈{} {B.friends} => A≈B : 0.6
  • 41. Entity Resolution § References, names, friendships § Use model to express dependencies - ‘’If two people have similar names, they are probably the same’’ - ‘’If two people have similar friends, they are probably the same’’ - ‘’If A=B and B=C, then A and C must also denote the same person’’ John Smith J. Smith name name A B C E friend friend D F G H = = A≈B ^ B≈C => A≈C : ∞
  • 42. Challenges ¢ Collective Classification: labeling nodes in graph l irregular structure, not a chain, not a grid l Challenge: One large partially labeled graph ¢ Link prediction: predicting edges in graph l Dependencies among edges l Don’t want to reason about all possible edges l Challenge: scaling & extremely skewed probabilities ¢ Entity resolution: determine nodes that refer to same entities in a graph l Dependencies between clusters l Challenge: enforcing constraints, e.g. transitive closure
  • 43. ☐ Feature Construc2on ☐ Collec2ve Reasoning ☐ Blocking ✔ ✔
  • 44. Blocking: Motivation ¢ Naïve pairwise: |N|2 pairwise comparisons l 1000 business listings each from 1,000 different cities across the world l 1 trillion comparisons l 11.6 days (if each comparison is 1 μs) ¢ Mentions from different cities are unlikely to be matches l Blocking Criterion: City l 1 billion comparisons l 16 minutes (if each comparison is 1 μs)
  • 45. Blocking: Motivation ¢ Mentions from different cities are unlikely to be matches l May miss potential matches 47
  • 46. Blocking: Motivation Matching Pairs of Nodes Set of all Pairs of Nodes Pairs of Nodes sa&sfying Blocking criterion
  • 47. Blocking Algorithms 1 ¢ Hash based blocking l Each block Ci is associated with a hash key hi. l Mention x is hashed to Ci if hash(x) = hi. l Within a block, all pairs are compared. l Each hash function results in disjoint blocks. ¢ What hash function? l Deterministic function of attribute values l Boolean Functions over attribute values [Bilenko et al ICDM’06, Michelson et al AAAI’06, Das Sarma et al CIKM ‘12] l minHash (min-wise independent permutations) [Broder et al STOC’98]; locality sensitive hashing
  • 48. Blocking Algorithms 2 ¢ Pairwise Similarity/Neighborhood based blocking l Nearby nodes according to a similarity metric are clustered together l Results in non-disjoint canopies. ¢ Techniques l Sorted Neighborhood Approach [Hernandez et al SIGMOD’95] l Canopy Clustering [McCallum et al KDD’00]
  • 49. ☐ Feature Construc2on ☐ Collec2ve Reasoning ☐ Blocking ✔ ✔ ✔
  • 51. http://psl.umiacs.umd.edu Matthias Stephen Bach Broecheler Alex Memory Lily Mihalkova Stanley Kok Bert Huang Angelika Kimmig Ben London Arti Ramesh Jay Pujara Shobeir Fakhraei Hui Miao
  • 52. http://psl.umiacs.umd.edu Probabilistic Soft Logic (PSL) Declarative language based on logics to express collective probabilistic inference problems - Predicate = relationship or property - Atom = (continuous) random variable - Rule = capture dependency or constraint - Set = define aggregates PSL Program = Rules + Input DB
  • 53. Collective Classification vote(A,P) ∧ friend(B,A) à vote(B,P) : 0.3    spouse spouse friend friend   colleague  spouse colleague vote(A,P) ∧ spouse(B,A) à vote(B,P) : 0.8 friend friend
  • 54. PSL Foundations § PSL makes large-scale reasoning scalable by mapping logical rules to convex functions § Three principles justify this mapping: - LP programs for MAX SAT with approximation guarantees [Goemans and Williamson, ’94] - Pseudomarginal LP relaxations of Boolean Markov random fields [Wainwright, et al., ’02] - Łukasiewicz logic, a logic for reasoning about continuous values [Klir and Yuan, ‘95]
  • 55. Hinge-loss Markov Random Fields P(Y|X) = 1 Z exp 2 4 Xm j=1 wj max{`j(Y,X), 0}pj 3 5 § Continuous variables in [0,1] § Potentials are hinge-loss functions § Subject to arbitrary linear constraints § Log-concave!
  • 56. PSL in a Slide § MAP Inference in PSL translates into convex optimization problem - inference is really fast! § Inference further enhanced with state-of-the-art optimization and distributed processing paradigms such as ADMM GraphLab - inference even faster! § Outperforms discrete MRFs in terms of speed, and (very) often accuracy § PSL is flexible: Applied to image segmentation, activity recognition, stance-detection, sentiment analysis, document classification, drug target prediction, latent social groups and trust, engagement modeling, ontology alignment, and looking for more!
  • 59. Pa#erns Key Ideas Tools NEED: Data Science for Graphs
  • 60. Closing Comments • Need new sta2s2cal and ML theory for very large graphs – Be#er understand bias, iden2fiability, and mechanism design in graph domains • Important topics not touched on – Genera2ve mechanisms, causal reasoning, dynamic modeling – Privacy, Users, Visualiza2on • Many exci2ng opportuni2es to develop new theory and algorithms and apply them to compelling business, intelligence, social, and societal problems!
  • 61. http://psl.umiacs.umd.edu Thank You! psl.umiacs.umd.edu Contact information: getoor@soe.ucsc.edu