SlideShare uma empresa Scribd logo
1 de 49
What knowledge bases know
(and what they don't)
Simon Razniewski
Free University of Bozen-Bolzano, Italy
Max Planck Institute for Informatics
(starting November 2017)
About myself
• Assistant professor at FU Bozen-Bolzano, South Tyrol, Italy (since 2014)
• PhD from FU Bozen-Bolzano (2014)
• Diplom from TU Dresden, Germany (2010)
• Research visits at UCSD (2012), AT&T Labs-Research (2013),
UQ (2015), MPII (2016)
Trilingual
The Alps’ oldest
criminal case: Ötzi
1/8th of EU apples
What do knowledge bases know?
What is a knowledge base?
 A collection of general world knowledge
• Common sense:
• Apples are sweet or sour,
• Cats are smaller than cars
• Activities:
• “whisper” and “shout” are implementations of “talk”
• Facts:
• Saarbrücken is the capital of the Saarland
• Ötzi has blood type O
3
Factual KBs: An old dream of AI
• Early manual efforts (CYC, 1980s)
• Structured extraction (YAGO, DBpedia, 2000s)
• Text mining and extraction (NELL, Prospera,
Textrunner, 2000s)
• Back to the roots: Wikidata (2012)
4
KBs are useful (1/2): QA
What is the capital of the Saarland?
Try yourself:
• When was Trump born?
• What is the nickname of Ronaldo?
• Who invented the light bulb?
Q: What is the capital of the Saarland?
KBs are useful (2/2): Language Generation
7
• Wikipedia in world’s most spoken language:
1/10 as many articles as English Wikipedia
• World’s fourth most spoken language: 1/100
 Wikidata intended to help
resource-poor languages
KB construction: Current state
• More than 2300 papers with titles containing
“information extraction” in the last 4 years [Google Scholar]
• Large KBs at Google, Microsoft, Alibaba, Bloomberg, …
• Progress visible downstream
• IBM Watson beats humans in trivia game in 2011
• Entity linking systems close to human performance on
popular news corpora
• Systems pass 8th grade science tests
in the AllenAI Science challenge in 2016
• But how good are KBs themself?
8
How good are the KBs that we build?
Is what they know true?
(precision or correctness)
 Do they know what is true?
(recall or completeness)
9
KBs know much of what is true
10
Google Knowledge Graph: 39 out of 48 Tarantino movies 
DBpedia: 167 out of 204 Nobel laureates
in Physics 
Wikidata: 2 out of 2
children of Obama 
Affiliations
https://query.wikidata.org/
SELECT (COUNT(?p) as ?result)
WHERE {?p worksFor Saarland_University.}
• Saarland University:
• MPI-INF:
• MPI-SWS:
11
325
2
0
(wdt:P108) (wd:Q700758)
KBs know little of what is true
12
DBpedia: contains 6 out of 35
Dijkstra Prize winners  Google Knowledge Graph:
``Points of Interest’’ – Completeness? 
Wikidata knows not so well
about employees here 
So, how complete are KBs?
13
What previous work says
14
[Dong et al., KDD 2014]
There are known knowns; there are
things we know we know. We also
know there are known unknowns;
that is to say we know there are some
things we do not know. But there are
also unknown unknowns – the ones
we don't know we don't know.
KB engineers have only tried to
make KBs bigger. The point,
however is to understand what
they are trying to approximate.
Outline – Assessing KB recall
1. Logical foundations
2. Rule mining
3. Information extraction
4. Data presence heuristic
15
Outline – Assessing KB recall
1. Logical foundations
2. Rule mining
3. Information extraction
4. Data presence heuristic
16
Closed and open-world assumption
worksIn
Name Department
John D1
Mary D2
Bob D3
17
worksIn(John, D1)?
worksIn(Ellen, D3)?
Closed-world
assumption
Open-world
assumption
• (Relational) databases traditionally employ the closed-world assumption
• KBs necessarily operate under the open-world assumption
 Yes  Yes
 No  Maybe
Open-world assumption
• Q: Hamlet written by Goethe?
KB: Maybe
• Q: Schwarzenegger lives in Dudweiler?
KB: Maybe
• Q: Trump brother of Kim Jong Un?
KB: Maybe
 Open-world assumption often too cautious
18
Teaching KBs to say “no”
• Need power to express
both maybe and no
= Partial-closed world assumption
• Approach: Completeness statements [Motro 1989]
19
Completeness statement:
worksIn is complete for employees of D1
worksIn(John, D1)?
worksIn(Ellen, D1)?
worksIn(Ellen, D3)?
 Yes
 No
 Maybe
worksIn
Name Department
John D1
Mary D2
Bob D3
Completeness statements
• Assertions about the available database containing
all information on a certain topic
“worksIn is complete for employees of D1”
• Form constraints between an ideal database and
the available database
∀𝑥: 𝑤𝑜𝑟𝑘𝑠𝐼𝑛𝑖
𝑥, 𝐷1 → 𝑤𝑜𝑟𝑘𝑠𝐼𝑛 𝑎
(𝑥, 𝐷1)
• Can have expressivity ranging from simple
selections up to first-order-logic
20
If you have completeness statements
you can do wonderful things…
• Develop techniques for deciding whether a
conjunctive query answer is complete [VLDB 2011]
• Assign unambiguous semantics to SQL nulls
[CIKM 2012]
• Create an algebra for propagating completeness
[SIGMOD 2015]
• Ensure the soundness of queries with negation
[ICWE 2016]
• ….
21
Where would completeness
statements come from?
• Data creators should pass them along as metadata
• Or editors should add them in curation steps
• Developed plugin and external tool COOL-WD
(Completeness tool for Wikidata)
22
23
But…
• Requires human effort
• Editors are lazy
• Automatically created KBs do not even have editors
Remainder of this talk:
How to automatically acquire information
about KB completeness/recall
24
Outline – Assessing KB recall
1. Logical foundations
2. Rule mining
3. Information extraction
4. Data presence heuristic
25
Rule mining: Idea (1/2)
Certain patterns in data hint at completeness/incompleteness
• People with a death date but no death place are incomplete for death place
• Movies with a producer are complete for directors
• People with less than two parents are incomplete for parents
26
Rule mining: Idea (2/2)
• Examples can be expressed as Horn rules:
dateOfDeath(X, Y) ∧ lessThan1(X, placeOfDeath)
⇒ incomplete(X, placeOfDeath)
movie(X) ∧ producer(X, Z) ⇒ complete(X, director)
lessThan2(X, hasParent) ⇒ incomplete(X, hasParent)
Can such patterns be discovered
with association rule mining?
27
Rule mining: Implementation
• We extended the AMIE association rule mining system
with predicates on
• Complete/incomplete complete(X, director)
• Object counts lessThan2(X, hasParent)
• Popularity popular(X)
• Negated classes person(X) ∧ ¬ adult(X)
• Then mined rules with complete/incomplete in the head
for 20 YAGO/Wikidata relations
• Result: Can predict (in-)completeness
with 46-100% F-score
28[Galarraga et al., WSDM 2017]
Rule mining: Challenges
• Consensus:
human(x)  Complete(x, graduatedFrom)
schoolteacher(x)  Incomplete(x, graduatedFrom)
professor(x)  Complete(x, graduatedFrom)
John ∈ (human, schoolteacher, professor)
 Complete(John, graduatedFrom)?
• Rare properties require very large training data
• E.g., monks being complete for spouses
• Annotated ~3000 rows at 10ct/row  0 monks
29
Outline – Assessing KB recall
1. Logical foundations
2. Rule mining
3. Information extraction
4. Data presence heuristic
30
Information extraction: Idea
31
KB: 0 KB: 1 KB: 2
Recall: 0% Recall: 50% Recall: 100%
…
Barack and Michelle
have two children
…
Information extraction: Implementation
• Developed a CRF-based classifier for identifying
numbers that express relation cardinalities
• Works for a variety of topics such as
• Family relations has 2 siblings
• Geopolitics is composed of seven boroughs
• Artwork consists of three episodes
• Finds the existence of 178% more children than
currently in Wikidata
32
[Mirza et al, ISWC 2016 + ACL 2017]
Information extraction: Challenges
• Cardinalities are frequently expressed nonnumeric:
• Nouns has twins, is a trilogy
• Indefinite articles They have a daughter
• Negation/adjectives Have no children/is childless
• Often requires reasoning
Has 3 children from Ivana and one from Marla
• Training (dist. supervision) struggles with false positives
• KBs used for training are themselves incomplete
President Garfield: Wikidata knows only of 4 out of 7 children
33
Vision: Make IE recall-aware
Textual information extraction usually gives precision estimates
“John was born in Malmö, Sweden.” citizenship(John, Sweden) – precision 95%
“John grew up in Malmö, Sweden.” citizenship(John, Sweden) – precision 70%
Can we also produce recall estimates?
“John has a son, Tom, and a daughter, Susan.”
child(John, Tom), child(John, Susan) – recall 90%
“John brought his children Susan and Tom to school.”
child(John, Tom), child(John, Susan) – recall 30%
34
Outline – Assessing KB recall
1. Logical foundations
2. Rule mining
3. Information extraction
4. Data presence heuristic
35
Data presence heuristic: Idea
KB: dateOfBirth(John, 17.5.1983)
Q: dateOfBirth(John, 31.12.1999)?
A: Probably not
Single-value properties:
• Having one value  Property is complete
• Look at data alone suffices
36
What are single-value properties?
37
year
Extreme case, but…
• Multiple
citizenships
• More parents due
to adoption
• Several Twitter
accounts due to
presidentship
All hopes lost?
• Presence of a value is better than nothing
• Even better: For non-functional attributes,
data is still frequently added in batches
• All clubs Diego Maradona played for
• All ministers of Merkel’s new cabinet
• …
• Checking data presence is a common heuristic
among Wikidata editors
38
Value presence heuristic - example
[https://www.wikidata.org/wiki/Wikidata:Wikivoyage/Lists/Embassies]
Data presence heuristic: Challenges
4.1: Which properties to look at?
4.2: How to quantify data presence?
40
4.1: Which properties to look at? (1/2)
• Complete(Wikidata for Putin)?
• There are more than 3000 properties one can assign to Putin…
• Not all properties are relevant to everyone.
(Think of goals scored or monastic order)
• Are at least all relevant properties there?
• What do you mean by relevant?
41
42
State-of-the-art approach gets 61% of high-agreement triples right
• Mistakes frequency for interestingness
Our method using also linguistic similarity achieves 75%
We used crowdsourcing to annotate 350 random
(person, property1, property2)
triples with human perception of interestingness
[Razniewski et al., ADMA 2017]
4.1: Which properties to look at? (2/2)
4.2: How to quantify data presence?
We have values for 46 out of 77 relevant properties for Putin
 Hard to interpret
Proposal: Quantify based on comparison
with other similar entities
Ingredients:
• Similarity metric Who is similar to Trump?
• Data quantification How much data is good/bad?
• Deployed on Wikidata, but evaluation difficult
43
[Ahmeti et al., ESWC 2017]
https://www.wikidata.org/wiki/User:Ls1g/Recoin
45
Quantifying groups
Outline – Assessing KB recall
1. Logical foundations
2. Rule mining
3. Information extraction
4. Data presence heuristic
5. Summary
46
Summary (1/3)
• Increasing KB quality can to some extent
be noticed downstream
• Precision easy to evaluate
• Recall largely unknown
47
Summary (2/3)
• Ideal is human-curated completeness information
• Created in conjunction with data (COOL-WD tool)
• Not really scalable
• Automated alternatives:
• Association rule mining
• Information extraction
• Looking at existence of data is a useful start
48
Summary (3/3)
• Recall-aware information extraction an open
challenge
• Concepts of relevance and relative completeness
in KBs little understood to date
• I look forward to fruitful collaborations with UdS,
MPI-SWS and MPI-INF
49

Mais conteúdo relacionado

Semelhante a What knowledge bases know (and what they don't)

The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?Frank van Harmelen
 
Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013Michael Scovetta
 
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 1
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 1Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 1
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 1Dr. Aparna Varde
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Miningebelani
 
Semi-Automatic Generation of Quizzes and Learning Artifacts from Linked Data
Semi-Automatic Generation of Quizzes and Learning Artifacts from Linked DataSemi-Automatic Generation of Quizzes and Learning Artifacts from Linked Data
Semi-Automatic Generation of Quizzes and Learning Artifacts from Linked DataGuillermo Álvaro Rey
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingIla Group
 
Microsoft PROSE SDK: A Framework for Inductive Program Synthesis
Microsoft PROSE SDK: A Framework for Inductive Program SynthesisMicrosoft PROSE SDK: A Framework for Inductive Program Synthesis
Microsoft PROSE SDK: A Framework for Inductive Program SynthesisAlex Polozov
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersRoelof Pieters
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and ChallengesJens Lehmann
 
Design of learning experiences for science teaching & faculty development - W...
Design of learning experiences for science teaching & faculty development - W...Design of learning experiences for science teaching & faculty development - W...
Design of learning experiences for science teaching & faculty development - W...Liz Dorland
 
Ordinal Common-sense Inference
Ordinal Common-sense InferenceOrdinal Common-sense Inference
Ordinal Common-sense InferenceNaoki Otani
 
AI3391 Artificial Intelligence session 24 knowledge representation.pptx
AI3391 Artificial Intelligence session 24 knowledge representation.pptxAI3391 Artificial Intelligence session 24 knowledge representation.pptx
AI3391 Artificial Intelligence session 24 knowledge representation.pptxAsst.prof M.Gokilavani
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2Karthik Murugesan
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Roelof Pieters
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Andre Freitas
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...Anubhav Jain
 
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...Codiax
 
Improving search with neural ranking methods
Improving search with neural ranking methodsImproving search with neural ranking methods
Improving search with neural ranking methodsvoginip
 
BPM Cluster Meeting 2014
BPM Cluster Meeting 2014BPM Cluster Meeting 2014
BPM Cluster Meeting 2014Jan Claes
 
Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...
Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...
Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...Numenta
 

Semelhante a What knowledge bases know (and what they don't) (20)

The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 
Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013
 
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 1
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 1Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 1
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 1
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Mining
 
Semi-Automatic Generation of Quizzes and Learning Artifacts from Linked Data
Semi-Automatic Generation of Quizzes and Learning Artifacts from Linked DataSemi-Automatic Generation of Quizzes and Learning Artifacts from Linked Data
Semi-Automatic Generation of Quizzes and Learning Artifacts from Linked Data
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Microsoft PROSE SDK: A Framework for Inductive Program Synthesis
Microsoft PROSE SDK: A Framework for Inductive Program SynthesisMicrosoft PROSE SDK: A Framework for Inductive Program Synthesis
Microsoft PROSE SDK: A Framework for Inductive Program Synthesis
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ers
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and Challenges
 
Design of learning experiences for science teaching & faculty development - W...
Design of learning experiences for science teaching & faculty development - W...Design of learning experiences for science teaching & faculty development - W...
Design of learning experiences for science teaching & faculty development - W...
 
Ordinal Common-sense Inference
Ordinal Common-sense InferenceOrdinal Common-sense Inference
Ordinal Common-sense Inference
 
AI3391 Artificial Intelligence session 24 knowledge representation.pptx
AI3391 Artificial Intelligence session 24 knowledge representation.pptxAI3391 Artificial Intelligence session 24 knowledge representation.pptx
AI3391 Artificial Intelligence session 24 knowledge representation.pptx
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...
 
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
 
Improving search with neural ranking methods
Improving search with neural ranking methodsImproving search with neural ranking methods
Improving search with neural ranking methods
 
BPM Cluster Meeting 2014
BPM Cluster Meeting 2014BPM Cluster Meeting 2014
BPM Cluster Meeting 2014
 
Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...
Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...
Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...
 

Último

Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 

Último (20)

Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 

What knowledge bases know (and what they don't)

  • 1. What knowledge bases know (and what they don't) Simon Razniewski Free University of Bozen-Bolzano, Italy Max Planck Institute for Informatics (starting November 2017)
  • 2. About myself • Assistant professor at FU Bozen-Bolzano, South Tyrol, Italy (since 2014) • PhD from FU Bozen-Bolzano (2014) • Diplom from TU Dresden, Germany (2010) • Research visits at UCSD (2012), AT&T Labs-Research (2013), UQ (2015), MPII (2016) Trilingual The Alps’ oldest criminal case: Ötzi 1/8th of EU apples
  • 3. What do knowledge bases know? What is a knowledge base?  A collection of general world knowledge • Common sense: • Apples are sweet or sour, • Cats are smaller than cars • Activities: • “whisper” and “shout” are implementations of “talk” • Facts: • Saarbrücken is the capital of the Saarland • Ötzi has blood type O 3
  • 4. Factual KBs: An old dream of AI • Early manual efforts (CYC, 1980s) • Structured extraction (YAGO, DBpedia, 2000s) • Text mining and extraction (NELL, Prospera, Textrunner, 2000s) • Back to the roots: Wikidata (2012) 4
  • 5.
  • 6. KBs are useful (1/2): QA What is the capital of the Saarland? Try yourself: • When was Trump born? • What is the nickname of Ronaldo? • Who invented the light bulb? Q: What is the capital of the Saarland?
  • 7. KBs are useful (2/2): Language Generation 7 • Wikipedia in world’s most spoken language: 1/10 as many articles as English Wikipedia • World’s fourth most spoken language: 1/100  Wikidata intended to help resource-poor languages
  • 8. KB construction: Current state • More than 2300 papers with titles containing “information extraction” in the last 4 years [Google Scholar] • Large KBs at Google, Microsoft, Alibaba, Bloomberg, … • Progress visible downstream • IBM Watson beats humans in trivia game in 2011 • Entity linking systems close to human performance on popular news corpora • Systems pass 8th grade science tests in the AllenAI Science challenge in 2016 • But how good are KBs themself? 8
  • 9. How good are the KBs that we build? Is what they know true? (precision or correctness)  Do they know what is true? (recall or completeness) 9
  • 10. KBs know much of what is true 10 Google Knowledge Graph: 39 out of 48 Tarantino movies  DBpedia: 167 out of 204 Nobel laureates in Physics  Wikidata: 2 out of 2 children of Obama 
  • 11. Affiliations https://query.wikidata.org/ SELECT (COUNT(?p) as ?result) WHERE {?p worksFor Saarland_University.} • Saarland University: • MPI-INF: • MPI-SWS: 11 325 2 0 (wdt:P108) (wd:Q700758)
  • 12. KBs know little of what is true 12 DBpedia: contains 6 out of 35 Dijkstra Prize winners  Google Knowledge Graph: ``Points of Interest’’ – Completeness?  Wikidata knows not so well about employees here 
  • 13. So, how complete are KBs? 13
  • 14. What previous work says 14 [Dong et al., KDD 2014] There are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don't know we don't know. KB engineers have only tried to make KBs bigger. The point, however is to understand what they are trying to approximate.
  • 15. Outline – Assessing KB recall 1. Logical foundations 2. Rule mining 3. Information extraction 4. Data presence heuristic 15
  • 16. Outline – Assessing KB recall 1. Logical foundations 2. Rule mining 3. Information extraction 4. Data presence heuristic 16
  • 17. Closed and open-world assumption worksIn Name Department John D1 Mary D2 Bob D3 17 worksIn(John, D1)? worksIn(Ellen, D3)? Closed-world assumption Open-world assumption • (Relational) databases traditionally employ the closed-world assumption • KBs necessarily operate under the open-world assumption  Yes  Yes  No  Maybe
  • 18. Open-world assumption • Q: Hamlet written by Goethe? KB: Maybe • Q: Schwarzenegger lives in Dudweiler? KB: Maybe • Q: Trump brother of Kim Jong Un? KB: Maybe  Open-world assumption often too cautious 18
  • 19. Teaching KBs to say “no” • Need power to express both maybe and no = Partial-closed world assumption • Approach: Completeness statements [Motro 1989] 19 Completeness statement: worksIn is complete for employees of D1 worksIn(John, D1)? worksIn(Ellen, D1)? worksIn(Ellen, D3)?  Yes  No  Maybe worksIn Name Department John D1 Mary D2 Bob D3
  • 20. Completeness statements • Assertions about the available database containing all information on a certain topic “worksIn is complete for employees of D1” • Form constraints between an ideal database and the available database ∀𝑥: 𝑤𝑜𝑟𝑘𝑠𝐼𝑛𝑖 𝑥, 𝐷1 → 𝑤𝑜𝑟𝑘𝑠𝐼𝑛 𝑎 (𝑥, 𝐷1) • Can have expressivity ranging from simple selections up to first-order-logic 20
  • 21. If you have completeness statements you can do wonderful things… • Develop techniques for deciding whether a conjunctive query answer is complete [VLDB 2011] • Assign unambiguous semantics to SQL nulls [CIKM 2012] • Create an algebra for propagating completeness [SIGMOD 2015] • Ensure the soundness of queries with negation [ICWE 2016] • …. 21
  • 22. Where would completeness statements come from? • Data creators should pass them along as metadata • Or editors should add them in curation steps • Developed plugin and external tool COOL-WD (Completeness tool for Wikidata) 22
  • 23. 23
  • 24. But… • Requires human effort • Editors are lazy • Automatically created KBs do not even have editors Remainder of this talk: How to automatically acquire information about KB completeness/recall 24
  • 25. Outline – Assessing KB recall 1. Logical foundations 2. Rule mining 3. Information extraction 4. Data presence heuristic 25
  • 26. Rule mining: Idea (1/2) Certain patterns in data hint at completeness/incompleteness • People with a death date but no death place are incomplete for death place • Movies with a producer are complete for directors • People with less than two parents are incomplete for parents 26
  • 27. Rule mining: Idea (2/2) • Examples can be expressed as Horn rules: dateOfDeath(X, Y) ∧ lessThan1(X, placeOfDeath) ⇒ incomplete(X, placeOfDeath) movie(X) ∧ producer(X, Z) ⇒ complete(X, director) lessThan2(X, hasParent) ⇒ incomplete(X, hasParent) Can such patterns be discovered with association rule mining? 27
  • 28. Rule mining: Implementation • We extended the AMIE association rule mining system with predicates on • Complete/incomplete complete(X, director) • Object counts lessThan2(X, hasParent) • Popularity popular(X) • Negated classes person(X) ∧ ¬ adult(X) • Then mined rules with complete/incomplete in the head for 20 YAGO/Wikidata relations • Result: Can predict (in-)completeness with 46-100% F-score 28[Galarraga et al., WSDM 2017]
  • 29. Rule mining: Challenges • Consensus: human(x)  Complete(x, graduatedFrom) schoolteacher(x)  Incomplete(x, graduatedFrom) professor(x)  Complete(x, graduatedFrom) John ∈ (human, schoolteacher, professor)  Complete(John, graduatedFrom)? • Rare properties require very large training data • E.g., monks being complete for spouses • Annotated ~3000 rows at 10ct/row  0 monks 29
  • 30. Outline – Assessing KB recall 1. Logical foundations 2. Rule mining 3. Information extraction 4. Data presence heuristic 30
  • 31. Information extraction: Idea 31 KB: 0 KB: 1 KB: 2 Recall: 0% Recall: 50% Recall: 100% … Barack and Michelle have two children …
  • 32. Information extraction: Implementation • Developed a CRF-based classifier for identifying numbers that express relation cardinalities • Works for a variety of topics such as • Family relations has 2 siblings • Geopolitics is composed of seven boroughs • Artwork consists of three episodes • Finds the existence of 178% more children than currently in Wikidata 32 [Mirza et al, ISWC 2016 + ACL 2017]
  • 33. Information extraction: Challenges • Cardinalities are frequently expressed nonnumeric: • Nouns has twins, is a trilogy • Indefinite articles They have a daughter • Negation/adjectives Have no children/is childless • Often requires reasoning Has 3 children from Ivana and one from Marla • Training (dist. supervision) struggles with false positives • KBs used for training are themselves incomplete President Garfield: Wikidata knows only of 4 out of 7 children 33
  • 34. Vision: Make IE recall-aware Textual information extraction usually gives precision estimates “John was born in Malmö, Sweden.” citizenship(John, Sweden) – precision 95% “John grew up in Malmö, Sweden.” citizenship(John, Sweden) – precision 70% Can we also produce recall estimates? “John has a son, Tom, and a daughter, Susan.” child(John, Tom), child(John, Susan) – recall 90% “John brought his children Susan and Tom to school.” child(John, Tom), child(John, Susan) – recall 30% 34
  • 35. Outline – Assessing KB recall 1. Logical foundations 2. Rule mining 3. Information extraction 4. Data presence heuristic 35
  • 36. Data presence heuristic: Idea KB: dateOfBirth(John, 17.5.1983) Q: dateOfBirth(John, 31.12.1999)? A: Probably not Single-value properties: • Having one value  Property is complete • Look at data alone suffices 36
  • 37. What are single-value properties? 37 year Extreme case, but… • Multiple citizenships • More parents due to adoption • Several Twitter accounts due to presidentship
  • 38. All hopes lost? • Presence of a value is better than nothing • Even better: For non-functional attributes, data is still frequently added in batches • All clubs Diego Maradona played for • All ministers of Merkel’s new cabinet • … • Checking data presence is a common heuristic among Wikidata editors 38
  • 39. Value presence heuristic - example [https://www.wikidata.org/wiki/Wikidata:Wikivoyage/Lists/Embassies]
  • 40. Data presence heuristic: Challenges 4.1: Which properties to look at? 4.2: How to quantify data presence? 40
  • 41. 4.1: Which properties to look at? (1/2) • Complete(Wikidata for Putin)? • There are more than 3000 properties one can assign to Putin… • Not all properties are relevant to everyone. (Think of goals scored or monastic order) • Are at least all relevant properties there? • What do you mean by relevant? 41
  • 42. 42 State-of-the-art approach gets 61% of high-agreement triples right • Mistakes frequency for interestingness Our method using also linguistic similarity achieves 75% We used crowdsourcing to annotate 350 random (person, property1, property2) triples with human perception of interestingness [Razniewski et al., ADMA 2017] 4.1: Which properties to look at? (2/2)
  • 43. 4.2: How to quantify data presence? We have values for 46 out of 77 relevant properties for Putin  Hard to interpret Proposal: Quantify based on comparison with other similar entities Ingredients: • Similarity metric Who is similar to Trump? • Data quantification How much data is good/bad? • Deployed on Wikidata, but evaluation difficult 43 [Ahmeti et al., ESWC 2017]
  • 46. Outline – Assessing KB recall 1. Logical foundations 2. Rule mining 3. Information extraction 4. Data presence heuristic 5. Summary 46
  • 47. Summary (1/3) • Increasing KB quality can to some extent be noticed downstream • Precision easy to evaluate • Recall largely unknown 47
  • 48. Summary (2/3) • Ideal is human-curated completeness information • Created in conjunction with data (COOL-WD tool) • Not really scalable • Automated alternatives: • Association rule mining • Information extraction • Looking at existence of data is a useful start 48
  • 49. Summary (3/3) • Recall-aware information extraction an open challenge • Concepts of relevance and relative completeness in KBs little understood to date • I look forward to fruitful collaborations with UdS, MPI-SWS and MPI-INF 49

Notas do Editor

  1. O-like letter - otto
  2. 350 man years to complete, estimate 1986
  3. Google launched 1998 (1995 other name)
  4. First Chinese, fourth Hindi
  5. Marx point: see what you are actually trying to approximate
  6. -> rule mining with constraints?
  7. Here multiple claims, but so when do we have all?
  8. Sl – sitelink yes or no, www yes or no, img yes or no Coordinate yes or no Phone yes or no
  9. What is good/bad: Problem could be that very few are good/bad
  10. Question: What are/how to find interesting facets?
  11. Much work on entity and fact ranking, little on predicate ranking