Linked Open Data (LOD) has emerged as one of the largest collections of interlinked structured datasets on the Web. Although the adoption of such datasets for applications is
increasing, identifying relevant datasets for a specific task or topic is still challenging. As an initial step to make such identification easier, we provide an approach to automatically identify the topic domains of given datasets. Our method utilizes existing knowledge sources, more specifically Freebase, and we present an evaluation which validates the topic domains we can identify with our system. Furthermore, we evaluate the effectiveness of identified topic domains for the purpose of finding relevant datasets, thus showing that our approach improves reusability of LOD datasets.
What's New in Teams Calling, Meetings and Devices March 2024
Domain Identification for Linked Open Data
1. Domain Identification for Linked Open Data
Sarasi Lalithsena
Pascal Hitzler
Amit Sheth
Kno.e.sis Center
Wright State University, Dayton, OH
Prateek Jain
IBM T.J. Watson Research Center
Yorktown, NY, USA
WI 2013 Atlanta, GA, USA
2. Motivation
lod cloud
262 datasets
870 alive datasets
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lodcloud.net/”
2
4. Problem
• How do we identify the relevant datasets from this structured
knowledge space?
– How do we create a registry of topics which describe the
domain of a dataset?
4
5. State of the Art - CKAN
• In order to organize this large cloud CKAN encourages users to
tag their datasets in to following domains
- media
- geography
- life sciences
- publications
- government
- e-commerce
- social web
- user generated content
- schemata
- cross-domain
• CKAN administrators then manually go through these tagging
and organize the diagram
• CKAN provides a search for the datasets based on these manual
tagging and keywords
5
6. State of the Art - CKAN
But,
• Fixed set of tags can’t cope with the increasing diversity of the
datasets
– For an example what would be tags for Lingvoj dataset?
• Manual reviewing process will soon be unsustainable
• Classification is subjective
6
7. State of the Art- LODStats
• Stream based approach to collect the statistics of datasets
• Allow searching for datasets based on the keyword and
metadata provided by data publishers
7
8. State of the Art – Other
• Semantic Search Engine (SSE)
– SSEs such as Sigma, Swoogle and Watson allow to search
for instances and give the releted URI instance
– But are not designed for dataset search
• Federated Querying systems on LOD datasets
– Need to know seed URIs to find the relevant datasets
8
9. State of the Art – Existing Problems to
dataset lookup
• Rely on manual tagging provided by users and the manual
reviewing process
• Rely on keywords and metadata provided by users
• Need to know seed URIs to find the relevant datasets
• Need to know instances to start explore the datasets
9
10. What we propose?
• Introduce a systematic and sophisticated way to identify
possible domains, topics, tags (Topic Domain) to better describe
these datasets
• What are these topic domain can be?
– Predefined set of list
– Type of the schema of each dataset
10
12. How do we address the previous
problems
• Use the category system of existing knowledge sources as the
vocabulary to describe the domain
– Does not need to either rely on a predefined set of tags
– Does not need to rely on metadata and keywords
• Automatic way to identify the topic domains
• Vocabulary can be used to search the datasets and organize the
datasets
12
13. Our approach - Freebase
• Use Freebase as our knowledge source to identify the topic
domains
• Why Freebase?
– Wide Coverage
Has 39 million topics
– Simple Category Hierarchy System
• Freebase category system categorizes each topic in to types and
types are grouped in to domains
music
Domain
Artist
Type
• Utilized Freebase types and domains as our topic domains
13
15. Our Approach
STEP 1 Instance Identification
– Extract the instances of the dataset with its type
– Extract the human readable values of the instances and type
Granite and its type Rock
– Identify the closely related instance from the freebase for
each instance in our dataset
Ignimbrite, Rock
Slate, Rock
Granite, Rock
http://www.freebase.com/m/
01tx7r
http://www.freebase.com/m/
01c_9j
http://www.freebase.com/m/
03fcm
15
16. Our Approach
• Instance Identification
We attach the type information as well to the query string
Apple
Apple Company
Apple Fruit
Apple Fruit
16
17. Our Approach
• STEP 2 Category Hierarchy Creation
Ignimbrite
/geology/rock_type
geography
geology
{domain/type}
geography
Ignimbrite
rock type
geology
mountain
geography
mountain range
music
music
slate
rock type
geology
mountain
release track
recording
geography
granite
rock type
mountain
17
18. Our Approach
• Category Hierarchy Merging
geography
geology
Ignimbrite
mountain
rock type
mountain range
geology
geography
slate
music
release track
rock type
mountain
recording
geology
geography
granite
rock type
mountain
18
19. Our Approach
• Candidate Category Hierarchy Selection
Filter out insignificant category hierarchies using a simple
heuristics
geography
geology
Ignimbrite
mountain
rock type
mountain range
geology
geography
slate
music
release track
rock type
mountain
recording
geology
geography
granite
rock type
mountain
19
20. Our Approach
• Frequency Count Generation
Count the number of occurrences for each category (number of
instances having the given category)
Term
Frequency
Parent Node
geology
3
rock type
3
geology
mountain range
1
geography
…..
…
….
20
21. Implementation
• Map Reduce Deployment
STEP 2 and 3
map1
STEP 4
Reducer
1
map2
<Inst, type>
……
.......
……
……
Map 3
map4
…
STEP 5
Post Processing
…
…
Reducer
m
…
Map n
Instances belong to same type will go into a
single reducer
21
22. Evaluation
• We ran our experiments with 30 datasets in LOD for evaluation
Evaluation
Appropriateness of the identified
domain
Effectiveness in finding the datasets
User Study
22
23. Evaluation : Appropriateness of the
identified domain
• Select four high frequent domains and types from our results
• Mixed it with other randomly selected four domains and types
• Asked from users to select the terms that best represent the
higher level domains for the dataset – had 20 users
*
50% of the users
agreed on 73% of
the terms (88 out of
120)
23
24. Evaluation : Appropriateness of the
identified domain
TERMS WITH HIGHEST USER AGREEMENT FOR EACH DATASET, WE INDICATE BY A STAR (*)
THAT TERM WAS ALSO THE HIGHEST RANKED BY OUR SYSTEM (for 22 datasets)
24
26. Evaluation – Effectiveness in finding the
datasets
• Developed a search application using the normalized frequency
count
• User study with three other existing state of the art
– CKAN, LOD Stat and Sigma
• Term selection
• Top ten results are retrieved
• Asked users to rank which set of results they preferred
– 1(best ) to 4(worst)
• Calculate a user preference score using weighted average
26
28. Evaluation
Evaluation
Appropriateness of the identified
domain
User Study
Effectiveness in finding the datasets
1. User Study with three other SE
2. Evaluate CKAN as the baseline
29
30. Evaluation
Evaluation
Appropriateness of the identified
domain
User Study
Effectiveness in finding the datasets
1. User Study with three other SE
2. Evaluate CKAN as the baseline
3. Evaluate both CKAN and our
approach using a manually curated
gold standard
34
32. Conclusion and Future Work
• Our approach is helpful for systematically categorizing the
datasets
• Demonstrate the potential of using the categorization for finding
relevant datasets
• Utilize a diverse classification hierarchy such as Freebase
• There are other potential application that this work might be
important such browsing, interlinking and querying
• Plan to improve the domain coverage by using knowledge
sources such as Wikipedia
• Compare the interpretation given by multiple knowledge sources
to see which one gives you a better interpretation
37