Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

WWW.LEDS-PROJEKT.DE
LEDS
KNOWLEDGE EXTRACTION
FROM HETEROGENEOUS
SEMI-STRUCTURED DATA SOURCES
MARTIN SEIDEL, MICHAEL KRUG, FRANK BURIAN, MARTIN GAEDKE
12. September 2016

LEDSCURRENT SITUATION
• knowledge in the Web often only available as weakly
interlinked, heterogeneous, semi-structured data
à no semantic classification
• how to link or merge data?
• how to do semantic queries?
à not usable in a meaningful way
2 12. September 2016

LEDSGOAL
Extraction of knowledge from semi-structured data
• knowledge in terms of semantic metadata
• semantically enriched data then can utilize
the potential of Linked Data
à provide an automatic process
3 13. September 2016

LEDSTHE KESEDA APPROACH
• Especially designed to work on JSON data
• Challenges when working with JSON data
à no schema, only name-value pairs
à any structure and depth possible
12. September 20165

{
"id": "krug”,
"firstName": "Michael",
"lastName": "Krug",
"title": "Dipl.-Inf.",
"phone": "+49 371 531 39929",
"email": "michael.krug@informatik.tu-chemnitz.de",
[...]
}
12. September 20166

{
"id": "2015-007",
"title": "SmartComposition: ...",
"author": [ "Michael Krug", "Martin Gaedke"],
"year": "2015",
"type": "Conference Paper",
"event": {
"name": "24th International World Wide Web Conference",
"url": http://www.www2015.it/
},
[...]
}
12. September 20167
Arrays
Objects

• multi-step algorithm
• work in existing JSON structure
• find and store various matches with different weights
• use additional information sources like API descriptions
• assign classes to objects with multiple properties
• link detected entities
12. September 20168

1. Differentiation of input sources / formats
2. Preparation of data structure
3. Analysis of property labels
4. Analysis of property values
5. Mapping of classes
6. Generate JSON-LD document
7. Evaluation of results
13. September 20169

LEDSPROTOTYPE
• prototype implemented in Node.js
• working with properties and classes from:
• schema.org
• foaf
• dublincore
• goodrelations
• music ontology
• dictionaries for: first & last names, cities, streets, languages
• list of manually curated synonyms
• option to provide pre-defined mappings
12. September 201611

LEDSPROTOTYPE
• Web interface for
• pre-configuration
• mappings, synonyms, dictionaries
• data upload
• result analysis
• statistics and browsing

LEDSPROTOTYPE
CONFIGURATION

LEDSPROTOTYPE
RESULTS

LEDSEVALUATION
Algorithm applied to datasets of
1) JSON array of people
2) JSON array of publications
a) Without custom pre-configuration
b) With custom pre-configuration

LEDSEVALUATION
Initial Setup
• dictionary and structure pattern matching
• label à predicate string matching
• classes and properties: schema.org, foaf, dublincore, goodrelations
Custom Pre-Configuration
• set of label à predicate mappings (hand-picked for data context)
• list of known synonyms
• more structure patterns

LEDS1A) PEOPLE W/O CONFIG

LEDS2A) PEOPLE W/ CONFIG

LEDS1B) PUBLICATIONS W/O CONFIG

LEDS2B) PUBLICATIONS W/ CONFIG

LEDSSUMMARY
➙ Approach for extracting knowledge from semi-
structured data
➙ by applying a multi-step algorithm
➙ to convert JSON data to RDF
➙ that assigns known classes to objects and maps
their properties to S-P-O triples

LEDSOPEN CHALLENGES
• detect and reuse JSON structure pattern
• disambiguate values
• apply quality control to results
• improve scalability for large datasets
• research application of machine learning

WWW.LEDS-PROJEKT.DE
LEDS
THANK YOU!
MICHAEL.KRUG@INFORMATIK.TU-CHEMNITZ.DE
VSR.INFORMATIK.TU-CHEMNITZ.DE
WWW.LEDS-PROJEKT.DE

1. Differentiation of input sources / formats
• text, file, URL, API
• check for format
• optional conversion of XML to JSON

2. Preparation of data structure
• pre-process JSON tree to store matches and mappings
• keep original structure to preserve hierachie for later
relations
• detect arrays and objects for seperate processing
• clean up: remove empty entries

3. Analysis of property labels
• string matching (substrings, prefixes, …)
• synonyms
• pre-defined mappings
• use metadata from API description, if available

4. Analysis of property values
• dictionaries
• structure patterns (uri, date, address, color…)
• data types (date, time, number, boolean…)
• (lower weighted)

5. Mapping of classes
• find class by number of matched properties
• select match that is most appropriate for chosen class
• take different weights into account

6. Generate JSON-LD document
• use matches and mappings
• link entities depending on JSON tree structure
• validation of output
• optional conversion to various RDF formats

7. Evaluation of results
• manual or automatic comparision of actual vs. desired
result to reweight matching components
• store correctly applied mappings for later reuse

Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

Similar to Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources (20)

More from semanticsconference

More from semanticsconference (20)

Recently uploaded

Recently uploaded (20)

Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources