Similar to Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources (20)
2. LEDSCURRENT SITUATION
• knowledge in the Web often only available as weakly
interlinked, heterogeneous, semi-structured data
à no semantic classification
• how to link or merge data?
• how to do semantic queries?
à not usable in a meaningful way
2 12. September 2016
3. LEDSGOAL
Extraction of knowledge from semi-structured data
• knowledge in terms of semantic metadata
• semantically enriched data then can utilize
the potential of Linked Data
à provide an automatic process
3 13. September 2016
5. LEDSTHE KESEDA APPROACH
• Especially designed to work on JSON data
• Challenges when working with JSON data
à no schema, only name-value pairs
à any structure and depth possible
12. September 20165
7. LEDSTHE KESEDA APPROACH
{
"id": "2015-007",
"title": "SmartComposition: ...",
"author": [ "Michael Krug", "Martin Gaedke"],
"year": "2015",
"type": "Conference Paper",
"event": {
"name": "24th International World Wide Web Conference",
"url": http://www.www2015.it/
},
[...]
}
12. September 20167
Arrays
Objects
8. LEDSTHE KESEDA APPROACH
• multi-step algorithm
• work in existing JSON structure
• find and store various matches with different weights
• use additional information sources like API descriptions
• assign classes to objects with multiple properties
• link detected entities
12. September 20168
9. LEDSTHE KESEDA APPROACH
1. Differentiation of input sources / formats
2. Preparation of data structure
3. Analysis of property labels
4. Analysis of property values
5. Mapping of classes
6. Generate JSON-LD document
7. Evaluation of results
13. September 20169
11. LEDSPROTOTYPE
• prototype implemented in Node.js
• working with properties and classes from:
• schema.org
• foaf
• dublincore
• goodrelations
• music ontology
• dictionaries for: first & last names, cities, streets, languages
• list of manually curated synonyms
• option to provide pre-defined mappings
12. September 201611
12. LEDSPROTOTYPE
• Web interface for
• pre-configuration
• mappings, synonyms, dictionaries
• data upload
• result analysis
• statistics and browsing
12. September 201612
16. LEDSEVALUATION
Algorithm applied to datasets of
1) JSON array of people
2) JSON array of publications
a) Without custom pre-configuration
b) With custom pre-configuration
12. September 201616
17. LEDSEVALUATION
Initial Setup
• dictionary and structure pattern matching
• label à predicate string matching
• classes and properties: schema.org, foaf, dublincore, goodrelations
Custom Pre-Configuration
• set of label à predicate mappings (hand-picked for data context)
• list of known synonyms
• more structure patterns
12. September 201617
27. LEDSSUMMARY
➙ Approach for extracting knowledge from semi-
structured data
➙ by applying a multi-step algorithm
➙ to convert JSON data to RDF
➙ that assigns known classes to objects and maps
their properties to S-P-O triples
12. September 201627
28. LEDSOPEN CHALLENGES
• detect and reuse JSON structure pattern
• disambiguate values
• apply quality control to results
• improve scalability for large datasets
• research application of machine learning
12. September 201628
31. LEDSTHE KESEDA APPROACH
1. Differentiation of input sources / formats
• text, file, URL, API
• check for format
• optional conversion of XML to JSON
13. September 201631
32. LEDSTHE KESEDA APPROACH
2. Preparation of data structure
• pre-process JSON tree to store matches and mappings
• keep original structure to preserve hierachie for later
relations
• detect arrays and objects for seperate processing
• clean up: remove empty entries
12. September 201632
33. LEDSTHE KESEDA APPROACH
3. Analysis of property labels
• string matching (substrings, prefixes, …)
• synonyms
• pre-defined mappings
• use metadata from API description, if available
12. September 201633
34. LEDSTHE KESEDA APPROACH
4. Analysis of property values
• dictionaries
• structure patterns (uri, date, address, color…)
• data types (date, time, number, boolean…)
• (lower weighted)
12. September 201634
35. LEDSTHE KESEDA APPROACH
5. Mapping of classes
• find class by number of matched properties
• select match that is most appropriate for chosen class
• take different weights into account
12. September 201635
36. LEDSTHE KESEDA APPROACH
6. Generate JSON-LD document
• use matches and mappings
• link entities depending on JSON tree structure
• validation of output
• optional conversion to various RDF formats
12. September 201636
37. LEDSTHE KESEDA APPROACH
7. Evaluation of results
• manual or automatic comparision of actual vs. desired
result to reweight matching components
• store correctly applied mappings for later reuse
12. September 201637