SlideShare uma empresa Scribd logo
1 de 43
7/2/19 Heiko Paulheim 1
Machine Learning & Embeddings
for Large Knowledge Graphs
Heiko Paulheim
7/2/19 Heiko Paulheim 2
Crossing the Bridge from the Other Side
7/2/19 Heiko Paulheim 3
Crossing the Bridge from the Other Side
• There are plenty of established ML and DM toolkits...
– Weka
– RapidMiner
– scikit-learn
– R
• ...implementing all your favorite algorithms...
– Naive Bayes
– Random Forests
– SVMs
– (Deep) Neural Networks
– ...
• ...but they all work on feature vectors, not graphs!
7/2/19 Heiko Paulheim 4
Typical Tasks
• Knowledge Graph Internal
– Type prediction
– Link prediction
– Link validation
• Knowledge Graph External
– i.e., using the KG as background knowledge in some other task
– e.g., content-based recommender systems
– e.g., predictive modeling
●
who is the next nobel prize winner?
Gao et al.: Link Prediction Methods and Their Accuracy for Different Social Networks and Network Metrics.
Scientific Programming, 2014
Xu et al.: Explainable Reasoning over Knowledge Graphs for Recommendation. ebay tech blog, 2019
7/2/19 Heiko Paulheim 5
Example: Knowledge Graph Internal
• Type prediction
– Many instances in KGs are not typed or have very abstract types
– e.g., many actors are just typed as persons
• Classic approach
– Exploit ontology
– Shown to be rather sensitive to noise
• Example: ontology-based typing of Germany in DBpedia
– Airport, Award, Building, City, Country, Ethnic Group, Genre,
Language, Military Conflict, Mountain, Mountain Range, Person
Function, Place, Populated Place, Race, Route of Transportation,
Settlement, Stadium, Wine Region
Paulheim & Bizer: Type Inference on Noisy RDF Data. ISWC, 2013
Melo et al.: Type Prediction in Noisy RDF Knowledge Bases using Hierarchical Multilabel Classification with
Graph and Latent Features. IJAIT, 2017
7/2/19 Heiko Paulheim 6
Example: Knowledge Graph Internal
• Alternative: learn model for type prediction
– Train classifier to predict types (binary or hierarchical)
– More noise tolerant
Paulheim & Bizer: Improving the quality of linked data using statistical distributions. IJSWIS, 2014
7/2/19 Heiko Paulheim 7
Example: Knowledge Graph External
• Example machine learning task: predicting book sales
ISBN City Sold
3-2347-3427-1 Darmstadt 124
3-43784-324-2 Mannheim 493
3-145-34587-0 Roßdorf 14
...
ISBN City Population ... Genre Publisher ... Sold
3-2347-3427-1 Darm-
stadt
144402 ... Crime Bloody
Books
... 124
3-43784-324-2 Mann-
heim
291458 … Crime Guns Ltd. … 493
3-145-34587-0 Roß-
dorf
12019 ... Travel Up&Away ... 14
...
→ Crime novels sell better in larger cities
Paulheim & Fürnkranz: Unsupervised Generation of Data Mining Features from Linked Open Data.
WIMS, 2012
7/2/19 Heiko Paulheim 8
Example: The FeGeLOD Framework
IS B N
3 -2 3 4 7 -3 4 2 7 -1
C ity
D a r m s ta d t
# s o ld
1 2 4
N a m e d E n t it y
R e c o g n it io n
IS B N
3 -2 3 4 7 -3 4 2 7 - 1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t
F e a t u r e
G e n e r a t io n
IS B N
3 - 2 3 4 7 -3 4 2 7 -1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e / D a r m s ta d t
C ity _ U R I_ d b p e d ia -o w l: p o p u la tio n T o ta l
1 4 1 4 7 1
C ity _ U R I_ ...
...
F e a t u r e
S e le c t io n
IS B N
3 -2 3 4 7 -3 4 2 7 - 1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t
C ity _ U R I_ d b p e d ia -o w l:p o p u la tio n T o ta l
1 4 1 4 7 1
Paulheim & Fürnkranz: Unsupervised Generation of Data Mining Features from Linked Open Data.
WIMS, 2012
7/2/19 Heiko Paulheim 9
The FeGeLOD Framework
• Entity Recognition
– Simple approach: guess DBpedia URIs
– Hit rate >95% for cities and countries (by English name)
• Feature Generation
– augmenting the dataset with additional attributes from KG
• Feature Selection
– Filter noise: >95% unknown, identical, or different nominals
Paulheim & Fürnkranz: Unsupervised Generation of Data Mining Features from Linked Open Data.
WIMS, 2012
7/2/19 Heiko Paulheim 10
Propositionalization
• Bridge Problem: Knowledge Graphs
vs. ML algorithms expecting Feature Vectors
→ wanted: a transformation from nodes to sets of features
?
Ristoski & Paulheim: A Comparison of Propositionalization Strategies for Creating Features from Linked
Open Data. LD4KD, 2014
7/2/19 Heiko Paulheim 11
Propositionalization
• Bridge Problem: Knowledge Graphs
vs. ML algorithms expecting Feature Vectors
→ wanted: a transformation from nodes to sets of features
• Basic strategies:
– literal values (e.g., population) are used directly
– instance types become binary features
– relations are counted (absolute, relative, TF-IDF)
– combinations of relations and object types are counted
(absolute, relative, TF-IDF)
– ...
Ristoski & Paulheim: A Comparison of Propositionalization Strategies for Creating Features from Linked
Open Data. LD4KD, 2014
7/2/19 Heiko Paulheim 12
Propositionalization ctd.
• Observations
– Even simple features (e.g., add all numbers and types)
can help on many problems
– More sophisticated features often bring additional improvements
●
Combinations of relations and individuals
– e.g., movies directed by Steven Spielberg
●
Combinations of relations and types
– e.g., movies directed by Oscar-winning directors
●
…
– But
●
The search space is enormous!
●
Generate first, filter later does not scale well
Ristoski & Paulheim: A Comparison of Propositionalization Strategies for Creating Features from Linked
Open Data. LD4KD, 2014
7/2/19 Heiko Paulheim 13
From Naive Propositionalization to
Knowledge Graph Embeddings
• Reconsidering the previous examples:
– We want to predict some attribute of a KG entity
●
e.g., types
●
e.g., sales figures of books
– ...given the entity’s vector representation
• How do we get a “good” vector representation for an entity?
– ...and: what is “good” in the first place?
7/2/19 Heiko Paulheim 14
From Naive Propositionalization to
Knowledge Graph Embeddings
• How do we get a “good” vector representation for an entity?
– ...and: what is “good” in the first place?
• “good” for machine learning means separable
– similar entities are close together
– different entities are further away
https://appliedmachinelearning.blog/2017/03/09/understanding-support-vector-machines-a-primer/
7/2/19 Heiko Paulheim 15
A Brief Excursion to word2vec
• A vector space model for words
• Introduced in 2013
• Each word becomes a vector
– similar words are close
– relations are preserved
– vector arithmetics are possible
https://www.adityathakker.com/introduction-to-word2vec-how-it-works/
7/2/19 Heiko Paulheim 16
A Brief Excursion to word2vec
• Assumption:
– Similar words appear in similar contexts
{Bush,Obama,Trump} was elected president of the United States
United States president {Bush,Obama,Trump} announced…
…
• Idea
– Train a network that can predict a word from its context (CBOW)
or the context from a word (Skip Gram)
Mikolov et al.: Efficient Estimation of Word Representations in Vector Space. 2013
7/2/19 Heiko Paulheim 17
A Brief Excursion to word2vec
• Skip Gram: train a neural network with one hidden layer
• Use output values at hidden layer as vector representation
• Observation:
– Bush, Obama, Trump will activate similar context words
– i.e., their output weights at the projection layer have to be similar
Mikolov et al.: Efficient Estimation of Word Representations in Vector Space. 2013
7/2/19 Heiko Paulheim 18
From word2vec to RDF2vec
• Word2vec operates on sentences, i.e., sequences of words
• Idea of RDF2vec
– First extract “sentences” from a graph
– Then train embedding using RDF2vec
• “Sentences” are extracted by performing random graph walks:
Year Zero Nine Inch Nails Trent
Reznor
• Experiments
– RDF2vec can be trained on large KGs (DBpedia, Wikidata)
– 300-500 dimensional vectors outperform
other propositionalization strategies
artist member
Ristoski & Paulheim: RDF2vec: RDF Graph Embeddings for Data Mining. ISWC, 2016
7/2/19 Heiko Paulheim 19
From word2vec to RDF2vec
• RDF2vec example
– similar instances form clusters
– direction of relations is stable
Ristoski & Paulheim: RDF2vec: RDF Graph Embeddings for Data Mining. ISWC, 2016
7/2/19 Heiko Paulheim 20
From word2vec to RDF2vec
• RecSys example: using proximity in latent RDF2vec feature space
Ristoski et al.: RDF2Vec: RDF Graph Embeddings and their Applications. SWJ 10(4), 2019
7/2/19 Heiko Paulheim 21
Extensions of RDF2vec
• Maybe random walks are not such a good idea
– They may give too much weight on less-known entities and facts
●
Strategies:
– Prefer edges with more frequent predicates
– Prefer nodes with higher indegree
– Prefer nodes with higher PageRank
– …
– They may cover less-known entities and facts too little
●
Strategies:
– The opposite of all of the above strategies
• Bottom line of experimental evaluation:
– Not one strategy fits all
Cochez et al.: Biased Graph Walks for RDF Graph Embeddings. WIMS, 2017
7/2/19 Heiko Paulheim 22
Other Word Embedding Methods
• GloVe (Global Word Embedding Vectors)
• Computes embeddings out of co-occurence statistics
– Using matrix factorization
• Has been applied to random RDF walks as well
• Experimental evaluation:
– In some cases, RDFGloVe outperforms RDF2vec
https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-
glove.html
Cochez et al.: Global RDF Vector Space Embeddings, ISWC, 2017
7/2/19 Heiko Paulheim 23
Other Word Embedding Methods
• There is a lot of promising stuff not yet tried
– e.g., biasing walks based on human factors
– e.g., more recent word embedding methods such as ELMo and BERT
https://www.nbcnews.com/feature/nbc-out/bert-ernie-are-gay-couple-sesame-street-writer-claims-n910701
7/2/19 Heiko Paulheim 24
TransE and its Descendants
• In RDF2vec, relation preservation is a by-product
• TransE: direct modeling
– Formulates RDF embedding as an optimization problem
– Find mapping of entities and relations to Rn
so that
●
across all triples <s,p,o>
Σ ||s+p-o|| is minimized
●
try to obtain a smaller error
for existing triples
than for non-existing ones
Bordes et al: Translating Embeddings for Modeling Multi-relational Data. NIPS 2013.
Fan et al.: Learning Embedding Representations for Knowledge Inference on Imperfect and Incomplete
Repositories. WI 2016
7/2/19 Heiko Paulheim 25
Limitations of TransE
• Symmetric properties
– we have to minimize
||Barack + spouse – Michelle|| and ||Michelle + spouse – Barack||
simultaneously
– ideally, Barack + spouse = Michelle and Michelle + spouse = Barack
●
Michelle and Barack become infinitely close
●
spouse becomes 0 vector
Michelle
Barack
7/2/19 Heiko Paulheim 26
Limitations of TransE
• Transitive Properties
– we have to minimize
||Miami + partOf – Florida|| and ||Florida + partOf – USA||, but also
||Miami + partOf – USA||
– ideally, Miami + partOf = Florida, Florida + partOf = USA,
Miami + partOf = USA
●
Again: all three become infinitely close
●
partOf becomes 0 vector
Florida
Miami
USA
7/2/19 Heiko Paulheim 27
Limitations of TransE
• One to many properties
– we have to minimize
||New York + partOf – USA||, ||Florida + partOf – USA||,
||Ohio + partOf – USA||, …
– ideally, NewYork + partOf = USA, Florida + partOf = USA,
Ohio + partOf = USA
●
all the subjects become infinitely close
Florida
USA
New York
Ohio
7/2/19 Heiko Paulheim 28
Limitations of TransE
• Reflexive properties
– we have to minimize
||Tom + knows - Tom||
– ideally, Tom + knows = Tom
●
Knows becomes 0 vector
Tom
7/2/19 Heiko Paulheim 29
TransE
RDF2Vec
HolE
DistMult
RESCAL
NTN
TransR
TransH
TransD
KG2E
ComplEx
Limitations of TransE
• Numerous variants of TransE have been proposed
to overcome limitations (e.g., TransH, TransR, TransD, …)
• Plus: embedding approaches based on tensor factorization etc.
7/2/19 Heiko Paulheim 30
Are we Driving on the Wrong Side of the Road?
7/2/19 Heiko Paulheim 31
Are we Driving on the Wrong Side of the Road?
• Original ideas:
– Assign meaning to data
– Allow for machine inference
– Explain inference results to the user
Berners-Lee et al: The Semantic Web. Scientific American, May 2001
7/2/19 Heiko Paulheim 32
Running Example: Recommender Systems
• Content based recommender systems backed by Semantic Web
data
– (today: knowledge graphs)
• Advantages
– use rich background information about recommended items (for free)
– justifications can be generated (e.g., you like movies by that director)
https://lazyprogrammer.me/tutorial-on-collaborative-filtering-and-matrix-factorization-in-python/
7/2/19 Heiko Paulheim 33
The 2009 Semantic Web Layer Cake
7/2/19 Heiko Paulheim 34
The 2019 Semantic Web Layer Cake
Embeddings
7/2/19 Heiko Paulheim 35
Towards Semantic Vector Space Embeddings
cartoon
superhero
Ristoski et al.: RDF2Vec: RDF Graph Embeddings and their Applications. SWJ 10(4), 2019
7/2/19 Heiko Paulheim 36
The Holy Grail
• Combine semantics and embeddings
– e.g., directly create meaningful dimensions
– e.g., learn interpretation of dimensions a posteriori
– ...
7/2/19 Heiko Paulheim 37
A New Design Space
quantitative
performance
semantic
interpretability
7/2/19 Heiko Paulheim 38
Software to Check Out
• http://openke.thunlp.org/
– Implements many embedding approaches
– Pre-trained vectors available, e.g., for Wikidata
7/2/19 Heiko Paulheim 39
Software to Check Out
• Loading RDF in Python: https://github.com/RDFLib/rdflib
7/2/19 Heiko Paulheim 40
RapidMiner Linked Open Data Extension
caution: works
only until RM6! :-(
7/2/19 Heiko Paulheim 41
References (1)
• Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific american, 284(5),
28-37.
• Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., & Yakhnenko, O. (2013). Translating
embeddings for modeling multi-relational data. In NIPS (pp. 2787-2795).
• Cochez, M., Ristoski, P., Ponzetto, S. P., & Paulheim, H. (2017). Biased graph walks for RDF
graph embeddings. In WIMS (p. 21). ACM.
• Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805.
• Melo, A., Völker, J., & Paulheim, H. (2017). Type prediction in noisy RDF knowledge bases using
hierarchical multilabel classification with graph and latent features. IJAIT, 26(02).
• Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781.
• Paulheim, H., & Fümkranz, J. (2012). Unsupervised generation of data mining features from
linked open data. In WIMS (p. 31). ACM.
• Paulheim, H., & Bizer, C. (2013). Type inference on noisy RDF data. In International semantic
web conference (pp. 510-525). Springer, Berlin, Heidelberg.
7/2/19 Heiko Paulheim 42
References (2)
• Paulheim, H., & Bizer, C. (2014). Improving the quality of linked data using statistical
distributions. IJSWIS, 10(2), 63-86.
• Paulheim, H. (2017). Knowledge graph refinement: A survey of approaches and evaluation
methods. Semantic web, 8(3), 489-508.
• Paulheim, H. (2018). Make Embeddings Semantic Again! ISWC (Blue Sky Track)
• Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018).
Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
• Ristoski, P., & Paulheim, H. (2014). A comparison of propositionalization strategies for creating
features from linked open data. Linked Data for Knowledge Discovery, 6.
• Ristoski, P., Bizer, C., & Paulheim, H. (2015). Mining the web of linked data with rapidminer. Web
Semantics: Science, Services and Agents on the World Wide Web, 35, 142-151.
• Ristoski, P., & Paulheim, H. (2016). Semantic Web in data mining and knowledge discovery: A
comprehensive survey. Web semantics, 36, 1-22.
• Ristoski, P., & Paulheim, H. (2016). RDF2vec: RDF graph embeddings for data mining. In
International Semantic Web Conference (pp. 498-514). Springer, Cham.
• Ristoski, P., Rosati, J., Di Noia, T., De Leone, R., & Paulheim, H. (2019). RDF2Vec: RDF graph
embeddings and their applications. Semantic Web, 10(4), 1-32.
7/2/19 Heiko Paulheim 43
Machine Learning & Embeddings
for Large Knowledge Graphs
Heiko Paulheim

Mais conteúdo relacionado

Mais procurados

Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
Liang Xiang
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
NYC Predictive Analytics
 

Mais procurados (20)

MapReduce
MapReduceMapReduce
MapReduce
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
 
Link Analysis
Link AnalysisLink Analysis
Link Analysis
 
Ranking algorithms
Ranking algorithmsRanking algorithms
Ranking algorithms
 
Building a Knowledge Graph using NLP and Ontologies
Building a Knowledge Graph using NLP and OntologiesBuilding a Knowledge Graph using NLP and Ontologies
Building a Knowledge Graph using NLP and Ontologies
 
Graph databases
Graph databasesGraph databases
Graph databases
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4j
 
Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems
 
Syntactic analysis in NLP
Syntactic analysis in NLPSyntactic analysis in NLP
Syntactic analysis in NLP
 
WEB Scraping.pptx
WEB Scraping.pptxWEB Scraping.pptx
WEB Scraping.pptx
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
Session-Based Recommender Systems
Session-Based Recommender SystemsSession-Based Recommender Systems
Session-Based Recommender Systems
 
Web mining
Web miningWeb mining
Web mining
 
Social Network Analysis (SNA)
Social Network Analysis (SNA)Social Network Analysis (SNA)
Social Network Analysis (SNA)
 
Data mining
Data miningData mining
Data mining
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
 

Semelhante a Machine Learning & Embeddings for Large Knowledge Graphs

Ariadne's Thread -- Exploring a world of networked information built from fre...
Ariadne's Thread -- Exploring a world of networked information built from fre...Ariadne's Thread -- Exploring a world of networked information built from fre...
Ariadne's Thread -- Exploring a world of networked information built from fre...
Shenghui Wang
 
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talkDistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
Gezim Sejdiu
 
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczFOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
Ioan Toma
 

Semelhante a Machine Learning & Embeddings for Large Knowledge Graphs (20)

New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
 
Ariadne's Thread -- Exploring a world of networked information built from fre...
Ariadne's Thread -- Exploring a world of networked information built from fre...Ariadne's Thread -- Exploring a world of networked information built from fre...
Ariadne's Thread -- Exploring a world of networked information built from fre...
 
Semantics 2017 - Trying Not to Die Benchmarking using LITMUS
Semantics 2017 - Trying Not to Die Benchmarking using LITMUSSemantics 2017 - Trying Not to Die Benchmarking using LITMUS
Semantics 2017 - Trying Not to Die Benchmarking using LITMUS
 
Exploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data MiningExploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data Mining
 
BDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the massesBDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the masses
 
DS2014: Feature selection in hierarchical feature spaces
DS2014: Feature selection in hierarchical feature spacesDS2014: Feature selection in hierarchical feature spaces
DS2014: Feature selection in hierarchical feature spaces
 
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talkDistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
 
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sfSparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
 
Machine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge GraphsMachine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge Graphs
 
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Can Deep Learning Techniques Improve Entity Linking?
Can Deep Learning Techniques Improve Entity Linking?Can Deep Learning Techniques Improve Entity Linking?
Can Deep Learning Techniques Improve Entity Linking?
 
Linked Open Data Utrecht University Library
Linked Open Data Utrecht University LibraryLinked Open Data Utrecht University Library
Linked Open Data Utrecht University Library
 
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczFOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
 
Data-mining the Semantic Web @TCD
Data-mining the Semantic Web @TCDData-mining the Semantic Web @TCD
Data-mining the Semantic Web @TCD
 
NISO Bibliographic Roadmap Meeting Proposal
NISO Bibliographic Roadmap Meeting ProposalNISO Bibliographic Roadmap Meeting Proposal
NISO Bibliographic Roadmap Meeting Proposal
 
Quality, Relevance and Importance in Information Retrieval with Fuzzy Semanti...
Quality, Relevance and Importance in Information Retrieval with Fuzzy Semanti...Quality, Relevance and Importance in Information Retrieval with Fuzzy Semanti...
Quality, Relevance and Importance in Information Retrieval with Fuzzy Semanti...
 

Mais de Heiko Paulheim

Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids  on the Knowledge Graph BlockBeyond DBpedia and YAGO – The New Kids  on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Heiko Paulheim
 

Mais de Heiko Paulheim (20)

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
 
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsKnowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
 
From Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsFrom Wikis to Knowledge Graphs
From Wikis to Knowledge Graphs
 
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids  on the Knowledge Graph BlockBeyond DBpedia and YAGO – The New Kids  on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
 
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge GraphFrom Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!
 
How much is a Triple?
How much is a Triple?How much is a Triple?
How much is a Triple?
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on Twitter
 
Towards Knowledge Graph Profiling
Towards Knowledge Graph ProfilingTowards Knowledge Graph Profiling
Towards Knowledge Graph Profiling
 
Knowledge Graphs on the Web
Knowledge Graphs on the WebKnowledge Graphs on the Web
Knowledge Graphs on the Web
 
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyData-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine Learning
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly Detection
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open Data
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge Discovery
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMiner
 

Último

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 

Último (20)

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 

Machine Learning & Embeddings for Large Knowledge Graphs

  • 1. 7/2/19 Heiko Paulheim 1 Machine Learning & Embeddings for Large Knowledge Graphs Heiko Paulheim
  • 2. 7/2/19 Heiko Paulheim 2 Crossing the Bridge from the Other Side
  • 3. 7/2/19 Heiko Paulheim 3 Crossing the Bridge from the Other Side • There are plenty of established ML and DM toolkits... – Weka – RapidMiner – scikit-learn – R • ...implementing all your favorite algorithms... – Naive Bayes – Random Forests – SVMs – (Deep) Neural Networks – ... • ...but they all work on feature vectors, not graphs!
  • 4. 7/2/19 Heiko Paulheim 4 Typical Tasks • Knowledge Graph Internal – Type prediction – Link prediction – Link validation • Knowledge Graph External – i.e., using the KG as background knowledge in some other task – e.g., content-based recommender systems – e.g., predictive modeling ● who is the next nobel prize winner? Gao et al.: Link Prediction Methods and Their Accuracy for Different Social Networks and Network Metrics. Scientific Programming, 2014 Xu et al.: Explainable Reasoning over Knowledge Graphs for Recommendation. ebay tech blog, 2019
  • 5. 7/2/19 Heiko Paulheim 5 Example: Knowledge Graph Internal • Type prediction – Many instances in KGs are not typed or have very abstract types – e.g., many actors are just typed as persons • Classic approach – Exploit ontology – Shown to be rather sensitive to noise • Example: ontology-based typing of Germany in DBpedia – Airport, Award, Building, City, Country, Ethnic Group, Genre, Language, Military Conflict, Mountain, Mountain Range, Person Function, Place, Populated Place, Race, Route of Transportation, Settlement, Stadium, Wine Region Paulheim & Bizer: Type Inference on Noisy RDF Data. ISWC, 2013 Melo et al.: Type Prediction in Noisy RDF Knowledge Bases using Hierarchical Multilabel Classification with Graph and Latent Features. IJAIT, 2017
  • 6. 7/2/19 Heiko Paulheim 6 Example: Knowledge Graph Internal • Alternative: learn model for type prediction – Train classifier to predict types (binary or hierarchical) – More noise tolerant Paulheim & Bizer: Improving the quality of linked data using statistical distributions. IJSWIS, 2014
  • 7. 7/2/19 Heiko Paulheim 7 Example: Knowledge Graph External • Example machine learning task: predicting book sales ISBN City Sold 3-2347-3427-1 Darmstadt 124 3-43784-324-2 Mannheim 493 3-145-34587-0 Roßdorf 14 ... ISBN City Population ... Genre Publisher ... Sold 3-2347-3427-1 Darm- stadt 144402 ... Crime Bloody Books ... 124 3-43784-324-2 Mann- heim 291458 … Crime Guns Ltd. … 493 3-145-34587-0 Roß- dorf 12019 ... Travel Up&Away ... 14 ... → Crime novels sell better in larger cities Paulheim & Fürnkranz: Unsupervised Generation of Data Mining Features from Linked Open Data. WIMS, 2012
  • 8. 7/2/19 Heiko Paulheim 8 Example: The FeGeLOD Framework IS B N 3 -2 3 4 7 -3 4 2 7 -1 C ity D a r m s ta d t # s o ld 1 2 4 N a m e d E n t it y R e c o g n it io n IS B N 3 -2 3 4 7 -3 4 2 7 - 1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t F e a t u r e G e n e r a t io n IS B N 3 - 2 3 4 7 -3 4 2 7 -1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e / D a r m s ta d t C ity _ U R I_ d b p e d ia -o w l: p o p u la tio n T o ta l 1 4 1 4 7 1 C ity _ U R I_ ... ... F e a t u r e S e le c t io n IS B N 3 -2 3 4 7 -3 4 2 7 - 1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t C ity _ U R I_ d b p e d ia -o w l:p o p u la tio n T o ta l 1 4 1 4 7 1 Paulheim & Fürnkranz: Unsupervised Generation of Data Mining Features from Linked Open Data. WIMS, 2012
  • 9. 7/2/19 Heiko Paulheim 9 The FeGeLOD Framework • Entity Recognition – Simple approach: guess DBpedia URIs – Hit rate >95% for cities and countries (by English name) • Feature Generation – augmenting the dataset with additional attributes from KG • Feature Selection – Filter noise: >95% unknown, identical, or different nominals Paulheim & Fürnkranz: Unsupervised Generation of Data Mining Features from Linked Open Data. WIMS, 2012
  • 10. 7/2/19 Heiko Paulheim 10 Propositionalization • Bridge Problem: Knowledge Graphs vs. ML algorithms expecting Feature Vectors → wanted: a transformation from nodes to sets of features ? Ristoski & Paulheim: A Comparison of Propositionalization Strategies for Creating Features from Linked Open Data. LD4KD, 2014
  • 11. 7/2/19 Heiko Paulheim 11 Propositionalization • Bridge Problem: Knowledge Graphs vs. ML algorithms expecting Feature Vectors → wanted: a transformation from nodes to sets of features • Basic strategies: – literal values (e.g., population) are used directly – instance types become binary features – relations are counted (absolute, relative, TF-IDF) – combinations of relations and object types are counted (absolute, relative, TF-IDF) – ... Ristoski & Paulheim: A Comparison of Propositionalization Strategies for Creating Features from Linked Open Data. LD4KD, 2014
  • 12. 7/2/19 Heiko Paulheim 12 Propositionalization ctd. • Observations – Even simple features (e.g., add all numbers and types) can help on many problems – More sophisticated features often bring additional improvements ● Combinations of relations and individuals – e.g., movies directed by Steven Spielberg ● Combinations of relations and types – e.g., movies directed by Oscar-winning directors ● … – But ● The search space is enormous! ● Generate first, filter later does not scale well Ristoski & Paulheim: A Comparison of Propositionalization Strategies for Creating Features from Linked Open Data. LD4KD, 2014
  • 13. 7/2/19 Heiko Paulheim 13 From Naive Propositionalization to Knowledge Graph Embeddings • Reconsidering the previous examples: – We want to predict some attribute of a KG entity ● e.g., types ● e.g., sales figures of books – ...given the entity’s vector representation • How do we get a “good” vector representation for an entity? – ...and: what is “good” in the first place?
  • 14. 7/2/19 Heiko Paulheim 14 From Naive Propositionalization to Knowledge Graph Embeddings • How do we get a “good” vector representation for an entity? – ...and: what is “good” in the first place? • “good” for machine learning means separable – similar entities are close together – different entities are further away https://appliedmachinelearning.blog/2017/03/09/understanding-support-vector-machines-a-primer/
  • 15. 7/2/19 Heiko Paulheim 15 A Brief Excursion to word2vec • A vector space model for words • Introduced in 2013 • Each word becomes a vector – similar words are close – relations are preserved – vector arithmetics are possible https://www.adityathakker.com/introduction-to-word2vec-how-it-works/
  • 16. 7/2/19 Heiko Paulheim 16 A Brief Excursion to word2vec • Assumption: – Similar words appear in similar contexts {Bush,Obama,Trump} was elected president of the United States United States president {Bush,Obama,Trump} announced… … • Idea – Train a network that can predict a word from its context (CBOW) or the context from a word (Skip Gram) Mikolov et al.: Efficient Estimation of Word Representations in Vector Space. 2013
  • 17. 7/2/19 Heiko Paulheim 17 A Brief Excursion to word2vec • Skip Gram: train a neural network with one hidden layer • Use output values at hidden layer as vector representation • Observation: – Bush, Obama, Trump will activate similar context words – i.e., their output weights at the projection layer have to be similar Mikolov et al.: Efficient Estimation of Word Representations in Vector Space. 2013
  • 18. 7/2/19 Heiko Paulheim 18 From word2vec to RDF2vec • Word2vec operates on sentences, i.e., sequences of words • Idea of RDF2vec – First extract “sentences” from a graph – Then train embedding using RDF2vec • “Sentences” are extracted by performing random graph walks: Year Zero Nine Inch Nails Trent Reznor • Experiments – RDF2vec can be trained on large KGs (DBpedia, Wikidata) – 300-500 dimensional vectors outperform other propositionalization strategies artist member Ristoski & Paulheim: RDF2vec: RDF Graph Embeddings for Data Mining. ISWC, 2016
  • 19. 7/2/19 Heiko Paulheim 19 From word2vec to RDF2vec • RDF2vec example – similar instances form clusters – direction of relations is stable Ristoski & Paulheim: RDF2vec: RDF Graph Embeddings for Data Mining. ISWC, 2016
  • 20. 7/2/19 Heiko Paulheim 20 From word2vec to RDF2vec • RecSys example: using proximity in latent RDF2vec feature space Ristoski et al.: RDF2Vec: RDF Graph Embeddings and their Applications. SWJ 10(4), 2019
  • 21. 7/2/19 Heiko Paulheim 21 Extensions of RDF2vec • Maybe random walks are not such a good idea – They may give too much weight on less-known entities and facts ● Strategies: – Prefer edges with more frequent predicates – Prefer nodes with higher indegree – Prefer nodes with higher PageRank – … – They may cover less-known entities and facts too little ● Strategies: – The opposite of all of the above strategies • Bottom line of experimental evaluation: – Not one strategy fits all Cochez et al.: Biased Graph Walks for RDF Graph Embeddings. WIMS, 2017
  • 22. 7/2/19 Heiko Paulheim 22 Other Word Embedding Methods • GloVe (Global Word Embedding Vectors) • Computes embeddings out of co-occurence statistics – Using matrix factorization • Has been applied to random RDF walks as well • Experimental evaluation: – In some cases, RDFGloVe outperforms RDF2vec https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data- glove.html Cochez et al.: Global RDF Vector Space Embeddings, ISWC, 2017
  • 23. 7/2/19 Heiko Paulheim 23 Other Word Embedding Methods • There is a lot of promising stuff not yet tried – e.g., biasing walks based on human factors – e.g., more recent word embedding methods such as ELMo and BERT https://www.nbcnews.com/feature/nbc-out/bert-ernie-are-gay-couple-sesame-street-writer-claims-n910701
  • 24. 7/2/19 Heiko Paulheim 24 TransE and its Descendants • In RDF2vec, relation preservation is a by-product • TransE: direct modeling – Formulates RDF embedding as an optimization problem – Find mapping of entities and relations to Rn so that ● across all triples <s,p,o> Σ ||s+p-o|| is minimized ● try to obtain a smaller error for existing triples than for non-existing ones Bordes et al: Translating Embeddings for Modeling Multi-relational Data. NIPS 2013. Fan et al.: Learning Embedding Representations for Knowledge Inference on Imperfect and Incomplete Repositories. WI 2016
  • 25. 7/2/19 Heiko Paulheim 25 Limitations of TransE • Symmetric properties – we have to minimize ||Barack + spouse – Michelle|| and ||Michelle + spouse – Barack|| simultaneously – ideally, Barack + spouse = Michelle and Michelle + spouse = Barack ● Michelle and Barack become infinitely close ● spouse becomes 0 vector Michelle Barack
  • 26. 7/2/19 Heiko Paulheim 26 Limitations of TransE • Transitive Properties – we have to minimize ||Miami + partOf – Florida|| and ||Florida + partOf – USA||, but also ||Miami + partOf – USA|| – ideally, Miami + partOf = Florida, Florida + partOf = USA, Miami + partOf = USA ● Again: all three become infinitely close ● partOf becomes 0 vector Florida Miami USA
  • 27. 7/2/19 Heiko Paulheim 27 Limitations of TransE • One to many properties – we have to minimize ||New York + partOf – USA||, ||Florida + partOf – USA||, ||Ohio + partOf – USA||, … – ideally, NewYork + partOf = USA, Florida + partOf = USA, Ohio + partOf = USA ● all the subjects become infinitely close Florida USA New York Ohio
  • 28. 7/2/19 Heiko Paulheim 28 Limitations of TransE • Reflexive properties – we have to minimize ||Tom + knows - Tom|| – ideally, Tom + knows = Tom ● Knows becomes 0 vector Tom
  • 29. 7/2/19 Heiko Paulheim 29 TransE RDF2Vec HolE DistMult RESCAL NTN TransR TransH TransD KG2E ComplEx Limitations of TransE • Numerous variants of TransE have been proposed to overcome limitations (e.g., TransH, TransR, TransD, …) • Plus: embedding approaches based on tensor factorization etc.
  • 30. 7/2/19 Heiko Paulheim 30 Are we Driving on the Wrong Side of the Road?
  • 31. 7/2/19 Heiko Paulheim 31 Are we Driving on the Wrong Side of the Road? • Original ideas: – Assign meaning to data – Allow for machine inference – Explain inference results to the user Berners-Lee et al: The Semantic Web. Scientific American, May 2001
  • 32. 7/2/19 Heiko Paulheim 32 Running Example: Recommender Systems • Content based recommender systems backed by Semantic Web data – (today: knowledge graphs) • Advantages – use rich background information about recommended items (for free) – justifications can be generated (e.g., you like movies by that director) https://lazyprogrammer.me/tutorial-on-collaborative-filtering-and-matrix-factorization-in-python/
  • 33. 7/2/19 Heiko Paulheim 33 The 2009 Semantic Web Layer Cake
  • 34. 7/2/19 Heiko Paulheim 34 The 2019 Semantic Web Layer Cake Embeddings
  • 35. 7/2/19 Heiko Paulheim 35 Towards Semantic Vector Space Embeddings cartoon superhero Ristoski et al.: RDF2Vec: RDF Graph Embeddings and their Applications. SWJ 10(4), 2019
  • 36. 7/2/19 Heiko Paulheim 36 The Holy Grail • Combine semantics and embeddings – e.g., directly create meaningful dimensions – e.g., learn interpretation of dimensions a posteriori – ...
  • 37. 7/2/19 Heiko Paulheim 37 A New Design Space quantitative performance semantic interpretability
  • 38. 7/2/19 Heiko Paulheim 38 Software to Check Out • http://openke.thunlp.org/ – Implements many embedding approaches – Pre-trained vectors available, e.g., for Wikidata
  • 39. 7/2/19 Heiko Paulheim 39 Software to Check Out • Loading RDF in Python: https://github.com/RDFLib/rdflib
  • 40. 7/2/19 Heiko Paulheim 40 RapidMiner Linked Open Data Extension caution: works only until RM6! :-(
  • 41. 7/2/19 Heiko Paulheim 41 References (1) • Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific american, 284(5), 28-37. • Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., & Yakhnenko, O. (2013). Translating embeddings for modeling multi-relational data. In NIPS (pp. 2787-2795). • Cochez, M., Ristoski, P., Ponzetto, S. P., & Paulheim, H. (2017). Biased graph walks for RDF graph embeddings. In WIMS (p. 21). ACM. • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. • Melo, A., Völker, J., & Paulheim, H. (2017). Type prediction in noisy RDF knowledge bases using hierarchical multilabel classification with graph and latent features. IJAIT, 26(02). • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. • Paulheim, H., & Fümkranz, J. (2012). Unsupervised generation of data mining features from linked open data. In WIMS (p. 31). ACM. • Paulheim, H., & Bizer, C. (2013). Type inference on noisy RDF data. In International semantic web conference (pp. 510-525). Springer, Berlin, Heidelberg.
  • 42. 7/2/19 Heiko Paulheim 42 References (2) • Paulheim, H., & Bizer, C. (2014). Improving the quality of linked data using statistical distributions. IJSWIS, 10(2), 63-86. • Paulheim, H. (2017). Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic web, 8(3), 489-508. • Paulheim, H. (2018). Make Embeddings Semantic Again! ISWC (Blue Sky Track) • Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365. • Ristoski, P., & Paulheim, H. (2014). A comparison of propositionalization strategies for creating features from linked open data. Linked Data for Knowledge Discovery, 6. • Ristoski, P., Bizer, C., & Paulheim, H. (2015). Mining the web of linked data with rapidminer. Web Semantics: Science, Services and Agents on the World Wide Web, 35, 142-151. • Ristoski, P., & Paulheim, H. (2016). Semantic Web in data mining and knowledge discovery: A comprehensive survey. Web semantics, 36, 1-22. • Ristoski, P., & Paulheim, H. (2016). RDF2vec: RDF graph embeddings for data mining. In International Semantic Web Conference (pp. 498-514). Springer, Cham. • Ristoski, P., Rosati, J., Di Noia, T., De Leone, R., & Paulheim, H. (2019). RDF2Vec: RDF graph embeddings and their applications. Semantic Web, 10(4), 1-32.
  • 43. 7/2/19 Heiko Paulheim 43 Machine Learning & Embeddings for Large Knowledge Graphs Heiko Paulheim