SlideShare uma empresa Scribd logo
1 de 24
Baixar para ler offline
06/02/14 Dominik Wienand, Heiko Paulheim 1
Detecting Incorrect Numerical Data
in DBpedia
Dominik Wienand, Heiko Paulheim
06/02/14 Dominik Wienand, Heiko Paulheim 3
Motivation
• Challenge
– Wikipedia is made for humans, not machines
– Input format in Wikipedia is not constrained
• The following are all valid representations of the same height value
(and perfectly understandable by humans)
– 6 ft 6 in, 6ft 6in, 6'6'', 6'6”, 6´6´´, …
– 1.98m, 1,98m, 1m 98, 1m 98cm, 198cm, 198 cm, …
– 6 ft 6 in (198 cm), 6ft 6in (1.98m), 6'6'' (1.98 m), …
– 6 ft 6 in[1], 6 ft 6 in [citation needed], …
– ...
06/02/14 Dominik Wienand, Heiko Paulheim 4
Motivation
• Challenge
– We're (hopefully) slowly stepping out of the research labs
• e.g., applications in Emergency Management, Finance, …
– so we need reliable information
• i.e., DBpedia has to be able to deal with all of those variants
• But
– it is hard to cover each and every case
– if the case is rare, we may not even know it
• Idea
– A posteriori plausibility checking
– Find values that are likely to be wrongly extracted
06/02/14 Dominik Wienand, Heiko Paulheim 5
Idea
• Use outlier detection to find unlikely values
– e.g., extremely large or small values
• Outlier Detection
– “An outlying observation, or outlier, is one that appears to deviate
markedly from other members of the sample in which it occurs.”
(Grubbs,1969)
– Outliers are not necessarily wrong!
06/02/14 Dominik Wienand, Heiko Paulheim 6
Approach
• Basic approach
– use all values of a numerical property (e.g., height) as a population
– find outliers in that population
• Outlier detection approaches used
– Median Absolute Deviation (Dispersion)
– Interquartile Range
– Kernel Density Estimation
– Kernel Density Estimation iterative
• i.e., remove found outliers and repeat
06/02/14 Dominik Wienand, Heiko Paulheim 7
Median Absolute Deviation (MAD)
• MAD is the median deviation from the median of a sample, i.e.
• MAD can be used for outlier detection
– all values that are k*MAD away from the median
are considered to be outliers
– e.g., k=3
MAD:=mediani |X i−medianj( X j)|
Carl Friedrich Gauss
06/02/14 Dominik Wienand, Heiko Paulheim 10
Approach
• Observation
– some properties are used on a variety of different things
• Example: dbpedia-owl:height
– persons, vehicles
• Example: dbpedia-owl:population
– villages, cities, countries, continents
• Finding outliers in those mixed sets might be hard
– refined approach: preprocess data
– divide into subpopulations
06/02/14 Dominik Wienand, Heiko Paulheim 11
Approach
• Preprocessing A: single type
– split by single type
– one data population per type (in the DBpedia ontology)
– only the most specific type is used
• Preprocessing B: cluster by type vectors
– each instance represented by vector of types
– cluster instances with similar type vectors
– EM algorithm (Weka)
06/02/14 Dominik Wienand, Heiko Paulheim 12
Evaluation
• Two-fold evaluation
– Pre study on three attributes (height, population, elevation)
– Most promising approaches tested on random sample from DBpedia
• Evaluation strategy
– a posteriori evaluation
– each identified outlier is checked
– bulk checking possible due to obvious clusters/patterns in outliers
06/02/14 Dominik Wienand, Heiko Paulheim 13
Evaluation: Pre-Study
• Sample sizes:
– height: 52,522
– population: 237,700
– elevation: 206,977
• Different distributions
– e.g., height: approximate normal distribution
– e.g., population: power law distribution
06/02/14 Dominik Wienand, Heiko Paulheim 14
Evaluation: Pre-Study
• Grouping and clustering improves the precision
• IQR and KDE iterative are best
06/02/14 Dominik Wienand, Heiko Paulheim 15
Evaluation: Random Sample
• Findings from Pre-Study:
– IQR and KDE iterative work well
– clustering is too slow (>24h)
– KDE FFT has poor precision
• Building a random sample
– select 50 random resources
– get all their datatype properties
– retrieve all triples that use those properties
– remove those properties that have <50% or <100 numbers as objects
• Resulting sample:
– 12,054,727 triples
06/02/14 Dominik Wienand, Heiko Paulheim 16
Evaluation: Random Sample
• Tried best performing configuration in pre-study
– further explored parameter settings from there
– IQR provides best trade-off between runtime and quality
• Best configuration for IQR:
– 1,703 values marked as outliers
– manually checked
• shortcuts, e.g., all three digit ZIP codes for places in the US
– Precision of outlier detection: 81%
• i.e., we identify roughly 1 wrong value in 1,000
06/02/14 Dominik Wienand, Heiko Paulheim 17
Systematic Errors Found
• The footprint of an error in the extraction framework code
– here: imperial measures after 5' are truncated
→ observation: suspiciously many height values of 1.524m (=5')
06/02/14 Dominik Wienand, Heiko Paulheim 18
Systematic Errors Found
• Imperial conversion is the most severe problem
– causes almost 90% of all outliers in our sample
06/02/14 Dominik Wienand, Heiko Paulheim 19
Systematic Errors Found
• Additional numbers cause problems
• Example: village of Semaphore
– population: 28,322,006
(all of Australia: 23,379,555!)
– a clear outlier among villages
06/02/14 Dominik Wienand, Heiko Paulheim 20
Limitations
• Telling natural outliers from errors
– hard without additional evidence
• e.g., an adult person 58cm high
• e.g., a 7.4m high vehicle
06/02/14 Dominik Wienand, Heiko Paulheim 21
Beyond DBpedia
• So far, this work has been carried out only on DBpedia
• But can be transferred to any LOD dataset
• Particularly useful for crowdsourced/heuristic approaches
– Information extraction from text (e.g., NELL, ReVerb)
– Automatic fact completion
– Datasets heuristically integrated from diverse sources
06/02/14 Dominik Wienand, Heiko Paulheim 22
Ongoing Work
• Handling natural outliers
– cross checking with other sources
– for DBpedia: other language editions
• Preprocessing techniques
– e.g., dynamically building a tree of meaningful subpopulations
• Pinpointing errors
– text pattern induction on outliers found
– e.g., [0-9.,]* ([0-9]{4}) (years in parentheses cause problems)
– could also help identifying natural outliers
06/02/14 Dominik Wienand, Heiko Paulheim 23
Questions?
http://xkcd.com/539/
06/02/14 Dominik Wienand, Heiko Paulheim 24
Detecting Incorrect Numerical Data
in DBpedia
Dominik Wienand, Heiko Paulheim

Mais conteúdo relacionado

Mais procurados

Open access for business phd students
Open access for business phd studentsOpen access for business phd students
Open access for business phd studentsRosemary Russell
 
MSc DEMM Oct 2013 Finding Research Evidence
MSc DEMM Oct 2013 Finding Research EvidenceMSc DEMM Oct 2013 Finding Research Evidence
MSc DEMM Oct 2013 Finding Research EvidenceEISLibrarian
 
The undergrads are coming, the undergrads are coming! Managing large-scale st...
The undergrads are coming, the undergrads are coming! Managing large-scale st...The undergrads are coming, the undergrads are coming! Managing large-scale st...
The undergrads are coming, the undergrads are coming! Managing large-scale st...Rebecca Goldman
 
UKSG Conference 2016 Breakout Session - Measuring the research impact of digi...
UKSG Conference 2016 Breakout Session - Measuring the research impact of digi...UKSG Conference 2016 Breakout Session - Measuring the research impact of digi...
UKSG Conference 2016 Breakout Session - Measuring the research impact of digi...UKSG: connecting the knowledge community
 
Beyond the Pale: grey literature as a method of publication
Beyond the Pale: grey literature as a method of publicationBeyond the Pale: grey literature as a method of publication
Beyond the Pale: grey literature as a method of publicationariadnenetwork
 
Beyond the Pay Wall!: Repositories as sources of supply
Beyond the Pay Wall!: Repositories as sources of supplyBeyond the Pay Wall!: Repositories as sources of supply
Beyond the Pay Wall!: Repositories as sources of supplyGaz Johnson
 
Library Services & Finding Information for M.Sc (DL) Students
Library Services & Finding Information for M.Sc (DL) StudentsLibrary Services & Finding Information for M.Sc (DL) Students
Library Services & Finding Information for M.Sc (DL) StudentsGaz Johnson
 

Mais procurados (8)

Open access for business phd students
Open access for business phd studentsOpen access for business phd students
Open access for business phd students
 
MSc DEMM Oct 2013 Finding Research Evidence
MSc DEMM Oct 2013 Finding Research EvidenceMSc DEMM Oct 2013 Finding Research Evidence
MSc DEMM Oct 2013 Finding Research Evidence
 
The undergrads are coming, the undergrads are coming! Managing large-scale st...
The undergrads are coming, the undergrads are coming! Managing large-scale st...The undergrads are coming, the undergrads are coming! Managing large-scale st...
The undergrads are coming, the undergrads are coming! Managing large-scale st...
 
UKSG Conference 2016 Breakout Session - Measuring the research impact of digi...
UKSG Conference 2016 Breakout Session - Measuring the research impact of digi...UKSG Conference 2016 Breakout Session - Measuring the research impact of digi...
UKSG Conference 2016 Breakout Session - Measuring the research impact of digi...
 
Beyond the Pale: grey literature as a method of publication
Beyond the Pale: grey literature as a method of publicationBeyond the Pale: grey literature as a method of publication
Beyond the Pale: grey literature as a method of publication
 
PSY1020 Better than Google (2021)
PSY1020 Better than Google (2021)PSY1020 Better than Google (2021)
PSY1020 Better than Google (2021)
 
Beyond the Pay Wall!: Repositories as sources of supply
Beyond the Pay Wall!: Repositories as sources of supplyBeyond the Pay Wall!: Repositories as sources of supply
Beyond the Pay Wall!: Repositories as sources of supply
 
Library Services & Finding Information for M.Sc (DL) Students
Library Services & Finding Information for M.Sc (DL) StudentsLibrary Services & Finding Information for M.Sc (DL) Students
Library Services & Finding Information for M.Sc (DL) Students
 

Semelhante a Detecting Incorrect Numerical Data in DBpedia

lecture1.ppt
lecture1.pptlecture1.ppt
lecture1.pptbayhehua
 
Bim based process mining master thesis presentation
Bim based process mining master thesis presentation Bim based process mining master thesis presentation
Bim based process mining master thesis presentation Stijn van Schaijk
 
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...Heiko Paulheim
 
Born print, reborn digital - the Hoppenstedt Data Archive
Born print, reborn digital - the Hoppenstedt Data ArchiveBorn print, reborn digital - the Hoppenstedt Data Archive
Born print, reborn digital - the Hoppenstedt Data ArchiveSebastian Weindel
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningJeff Heaton
 
Systems and Services: Adding Value For Research Data Assets
Systems and Services: Adding Value For Research Data AssetsSystems and Services: Adding Value For Research Data Assets
Systems and Services: Adding Value For Research Data AssetsLIBER Europe
 
How is research conducted in my field
How is research conducted in my fieldHow is research conducted in my field
How is research conducted in my fieldCristian Klein
 
Introduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's GuideIntroduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's Guidegokulprasath06
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptxXanGwaps
 
Data science workflow v1.1
Data science workflow v1.1Data science workflow v1.1
Data science workflow v1.1Jessie_N
 
Module 5 - Data Science Methodology.pdf
Module 5 - Data Science Methodology.pdfModule 5 - Data Science Methodology.pdf
Module 5 - Data Science Methodology.pdffathiah5
 
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]ssuser23e4f31
 
Bda life cycle slideshare
Bda life cycle   slideshareBda life cycle   slideshare
Bda life cycle slideshareSathyaseelanK1
 
H2O intro at Dallas Meetup
H2O intro at Dallas MeetupH2O intro at Dallas Meetup
H2O intro at Dallas MeetupSri Ambati
 

Semelhante a Detecting Incorrect Numerical Data in DBpedia (20)

lecture1.ppt
lecture1.pptlecture1.ppt
lecture1.ppt
 
Lecture1
Lecture1Lecture1
Lecture1
 
Bim based process mining master thesis presentation
Bim based process mining master thesis presentation Bim based process mining master thesis presentation
Bim based process mining master thesis presentation
 
Daming
DamingDaming
Daming
 
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
 
Born print, reborn digital - the Hoppenstedt Data Archive
Born print, reborn digital - the Hoppenstedt Data ArchiveBorn print, reborn digital - the Hoppenstedt Data Archive
Born print, reborn digital - the Hoppenstedt Data Archive
 
data mining
data miningdata mining
data mining
 
Part1
Part1Part1
Part1
 
DBMS
DBMSDBMS
DBMS
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morning
 
Systems and Services: Adding Value For Research Data Assets
Systems and Services: Adding Value For Research Data AssetsSystems and Services: Adding Value For Research Data Assets
Systems and Services: Adding Value For Research Data Assets
 
How is research conducted in my field
How is research conducted in my fieldHow is research conducted in my field
How is research conducted in my field
 
Introduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's GuideIntroduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's Guide
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
Data science workflow v1.1
Data science workflow v1.1Data science workflow v1.1
Data science workflow v1.1
 
Module 5 - Data Science Methodology.pdf
Module 5 - Data Science Methodology.pdfModule 5 - Data Science Methodology.pdf
Module 5 - Data Science Methodology.pdf
 
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
 
Bda life cycle slideshare
Bda life cycle   slideshareBda life cycle   slideshare
Bda life cycle slideshare
 
H2O intro at Dallas Meetup
H2O intro at Dallas MeetupH2O intro at Dallas Meetup
H2O intro at Dallas Meetup
 

Mais de Heiko Paulheim

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...Heiko Paulheim
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfHeiko Paulheim
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vecHeiko Paulheim
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vecHeiko Paulheim
 
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsKnowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsHeiko Paulheim
 
From Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsFrom Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsHeiko Paulheim
 
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Heiko Paulheim
 
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids  on the Knowledge Graph BlockBeyond DBpedia and YAGO – The New Kids  on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph BlockHeiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Heiko Paulheim
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge GraphsHeiko Paulheim
 
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge GraphFrom Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge GraphHeiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Heiko Paulheim
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Heiko Paulheim
 
Machine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge GraphsMachine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge GraphsHeiko Paulheim
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterHeiko Paulheim
 
Towards Knowledge Graph Profiling
Towards Knowledge Graph ProfilingTowards Knowledge Graph Profiling
Towards Knowledge Graph ProfilingHeiko Paulheim
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesHeiko Paulheim
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataHeiko Paulheim
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge DiscoveryHeiko Paulheim
 

Mais de Heiko Paulheim (20)

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsKnowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
 
From Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsFrom Wikis to Knowledge Graphs
From Wikis to Knowledge Graphs
 
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
 
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids  on the Knowledge Graph BlockBeyond DBpedia and YAGO – The New Kids  on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge Graphs
 
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge GraphFrom Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!
 
How much is a Triple?
How much is a Triple?How much is a Triple?
How much is a Triple?
 
Machine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge GraphsMachine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge Graphs
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on Twitter
 
Towards Knowledge Graph Profiling
Towards Knowledge Graph ProfilingTowards Knowledge Graph Profiling
Towards Knowledge Graph Profiling
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open Data
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge Discovery
 

Último

Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 

Último (20)

Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 

Detecting Incorrect Numerical Data in DBpedia

  • 1. 06/02/14 Dominik Wienand, Heiko Paulheim 1 Detecting Incorrect Numerical Data in DBpedia Dominik Wienand, Heiko Paulheim
  • 2.
  • 3. 06/02/14 Dominik Wienand, Heiko Paulheim 3 Motivation • Challenge – Wikipedia is made for humans, not machines – Input format in Wikipedia is not constrained • The following are all valid representations of the same height value (and perfectly understandable by humans) – 6 ft 6 in, 6ft 6in, 6'6'', 6'6”, 6´6´´, … – 1.98m, 1,98m, 1m 98, 1m 98cm, 198cm, 198 cm, … – 6 ft 6 in (198 cm), 6ft 6in (1.98m), 6'6'' (1.98 m), … – 6 ft 6 in[1], 6 ft 6 in [citation needed], … – ...
  • 4. 06/02/14 Dominik Wienand, Heiko Paulheim 4 Motivation • Challenge – We're (hopefully) slowly stepping out of the research labs • e.g., applications in Emergency Management, Finance, … – so we need reliable information • i.e., DBpedia has to be able to deal with all of those variants • But – it is hard to cover each and every case – if the case is rare, we may not even know it • Idea – A posteriori plausibility checking – Find values that are likely to be wrongly extracted
  • 5. 06/02/14 Dominik Wienand, Heiko Paulheim 5 Idea • Use outlier detection to find unlikely values – e.g., extremely large or small values • Outlier Detection – “An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs.” (Grubbs,1969) – Outliers are not necessarily wrong!
  • 6. 06/02/14 Dominik Wienand, Heiko Paulheim 6 Approach • Basic approach – use all values of a numerical property (e.g., height) as a population – find outliers in that population • Outlier detection approaches used – Median Absolute Deviation (Dispersion) – Interquartile Range – Kernel Density Estimation – Kernel Density Estimation iterative • i.e., remove found outliers and repeat
  • 7. 06/02/14 Dominik Wienand, Heiko Paulheim 7 Median Absolute Deviation (MAD) • MAD is the median deviation from the median of a sample, i.e. • MAD can be used for outlier detection – all values that are k*MAD away from the median are considered to be outliers – e.g., k=3 MAD:=mediani |X i−medianj( X j)| Carl Friedrich Gauss
  • 8.
  • 9.
  • 10. 06/02/14 Dominik Wienand, Heiko Paulheim 10 Approach • Observation – some properties are used on a variety of different things • Example: dbpedia-owl:height – persons, vehicles • Example: dbpedia-owl:population – villages, cities, countries, continents • Finding outliers in those mixed sets might be hard – refined approach: preprocess data – divide into subpopulations
  • 11. 06/02/14 Dominik Wienand, Heiko Paulheim 11 Approach • Preprocessing A: single type – split by single type – one data population per type (in the DBpedia ontology) – only the most specific type is used • Preprocessing B: cluster by type vectors – each instance represented by vector of types – cluster instances with similar type vectors – EM algorithm (Weka)
  • 12. 06/02/14 Dominik Wienand, Heiko Paulheim 12 Evaluation • Two-fold evaluation – Pre study on three attributes (height, population, elevation) – Most promising approaches tested on random sample from DBpedia • Evaluation strategy – a posteriori evaluation – each identified outlier is checked – bulk checking possible due to obvious clusters/patterns in outliers
  • 13. 06/02/14 Dominik Wienand, Heiko Paulheim 13 Evaluation: Pre-Study • Sample sizes: – height: 52,522 – population: 237,700 – elevation: 206,977 • Different distributions – e.g., height: approximate normal distribution – e.g., population: power law distribution
  • 14. 06/02/14 Dominik Wienand, Heiko Paulheim 14 Evaluation: Pre-Study • Grouping and clustering improves the precision • IQR and KDE iterative are best
  • 15. 06/02/14 Dominik Wienand, Heiko Paulheim 15 Evaluation: Random Sample • Findings from Pre-Study: – IQR and KDE iterative work well – clustering is too slow (>24h) – KDE FFT has poor precision • Building a random sample – select 50 random resources – get all their datatype properties – retrieve all triples that use those properties – remove those properties that have <50% or <100 numbers as objects • Resulting sample: – 12,054,727 triples
  • 16. 06/02/14 Dominik Wienand, Heiko Paulheim 16 Evaluation: Random Sample • Tried best performing configuration in pre-study – further explored parameter settings from there – IQR provides best trade-off between runtime and quality • Best configuration for IQR: – 1,703 values marked as outliers – manually checked • shortcuts, e.g., all three digit ZIP codes for places in the US – Precision of outlier detection: 81% • i.e., we identify roughly 1 wrong value in 1,000
  • 17. 06/02/14 Dominik Wienand, Heiko Paulheim 17 Systematic Errors Found • The footprint of an error in the extraction framework code – here: imperial measures after 5' are truncated → observation: suspiciously many height values of 1.524m (=5')
  • 18. 06/02/14 Dominik Wienand, Heiko Paulheim 18 Systematic Errors Found • Imperial conversion is the most severe problem – causes almost 90% of all outliers in our sample
  • 19. 06/02/14 Dominik Wienand, Heiko Paulheim 19 Systematic Errors Found • Additional numbers cause problems • Example: village of Semaphore – population: 28,322,006 (all of Australia: 23,379,555!) – a clear outlier among villages
  • 20. 06/02/14 Dominik Wienand, Heiko Paulheim 20 Limitations • Telling natural outliers from errors – hard without additional evidence • e.g., an adult person 58cm high • e.g., a 7.4m high vehicle
  • 21. 06/02/14 Dominik Wienand, Heiko Paulheim 21 Beyond DBpedia • So far, this work has been carried out only on DBpedia • But can be transferred to any LOD dataset • Particularly useful for crowdsourced/heuristic approaches – Information extraction from text (e.g., NELL, ReVerb) – Automatic fact completion – Datasets heuristically integrated from diverse sources
  • 22. 06/02/14 Dominik Wienand, Heiko Paulheim 22 Ongoing Work • Handling natural outliers – cross checking with other sources – for DBpedia: other language editions • Preprocessing techniques – e.g., dynamically building a tree of meaningful subpopulations • Pinpointing errors – text pattern induction on outliers found – e.g., [0-9.,]* ([0-9]{4}) (years in parentheses cause problems) – could also help identifying natural outliers
  • 23. 06/02/14 Dominik Wienand, Heiko Paulheim 23 Questions? http://xkcd.com/539/
  • 24. 06/02/14 Dominik Wienand, Heiko Paulheim 24 Detecting Incorrect Numerical Data in DBpedia Dominik Wienand, Heiko Paulheim