Towards Integration of Web Data into a coherent Educational Data Graph

Motivation
Data on the Web
18/06/13Lile 2013 – Rio de Janeiro
Some eyecatching opener illustrating growth and or diversity of web data
Towards Integration of Web Data into a
coherent Educational Data Graph
LILE 2013 : 3rd International Workshop on Learning and Education with the Web of Data
14 May 2013, Rio de Janeiro, Brazil
Davide Taibi – Besnik Fetahu – Stefan Dietze
(CNR – ITD, IT) (L3S Research Center, DE)

Outline
• Linked Open Data serving data-intensive applications
• Heterogeneity of datasets and schemas
• Is it all that easy to use Linked Open Data and what are they all about?
– Interlinking of datasets only at a superficial level
– Different schemas for similar resource classes accross datasets
– Non-structured resource descriptions
– Best-case scenario: very abstract topic definitions
– Difficult to query for a subset of resources and datasets for a specific topic
• Our approach
– Schema level integration
– Enhanced dataset & resource descriptions
– Instance level integration
– Scalable annotation extraction
– Clustering and correlation of datasets
18/06/13 Lile 2013 – Rio de Janeiro

Introduction
• Large amounts of publicly available Linked Open Data of educational relevance
• Difficulties on providing large-scale integration
• Dataset and resource description annotation
• Clustering and dataset interlinking
Educational Data

Steps towards a Linked Education Data Graph

Schema Level Integration
http://data.linkededucation.org/ns/linked-education.rdf

http://data.linkededucation.org/ns/linked-education.rdf
LinkedUniversities Dataset

• VoID based schema:
– http://data.linkededucation.org/ns/linked-education.rdf
– Dataset cataloging and classification
– Mappings (types, properties)
• Datasets:
– LinkedUniversities Dataset
– mEducator
– Europeana
• Imported resources for clustering experiments:
– 6 millions of distinct resources
– 97 millions of RDF triples
– 21.6 GB of data
• SPARQL endpoint:
– http://okkam.l3s.uni-hannover.de:8880/openrdf-workbench/repositories/linked-
learning-rdf
 DBLP-L3S
 BBC programmes
 ACM publications

Instance-level integration
<http://dbpedia.org/page/Gravitation>
<http://dbpedia.org/page/Strong>
<http://dbpedia.org/page/Dense>
• DBpedia Spotlight as NER & NED tool
• Annotation of unstructured content
• Selective & Scalable annotation
• Annotate tokens of different size

Instance-level integration
Characteristics of enrichments
•Disambiguation
•Acronyms detection (e.g. “dns”, “gmt”)
•Synonyms detection (e.g. “globe”, “earth”)
•Context detection (e.g. “apple” fruits, “apple” computer)
<http://dbpedia.org/page/Gravitation>

Correlation and Clustering
Gravitation
Equations
Earth
• Annotations used to construct a network of resources, with edges based on common
resource annotations.

Correlation and Clustering
• Methods used for clustering
• Based on the shared enrichments
• Naïve
• Based on the ef-irf (Enrichment Frequency-Inverse Resource Frequency) index
• Jaccard
• Cosine
Different threshold have been used to generate clusters

Evaluation
Three evaluation stages:
•Quantitative & Qualitative
• Assess annotation accuracy for exhaustive and scalable approaches
• Measure standard precision/recall metrics
• 250 resources for each dataset used for assessment
•Performance
• Gains in terms of scalability

Quantitative Evaluation
Context #Resources #Annotations #Entity Types
ACM 249 200 239
mEducator 250 495 355
BBC 250 1364 769
LinkedUniversities 243 166 283
DBLP 250 295 161
Europeana 249 938 672
Total 1491 3458 937
• Number of extracted entities is related to the length of a textual description in a
resource
• For long texts up to 87 distinct entities and more than 200 entity type associations

Qualitative Evaluation
• Human evaluators to measure annotation accuracy
• 2000 annotations for both (exhaustive and scalable) approaches were
assessed
• Number of evaluators for the first approach was 32, with an average of 63
tasks per user, while for the second, there were 23 users with an average
of 87 completed tasks
Precision Recall
Exhaustive 0.82 0.429
Scalable 0.77 0.687
∆[E-S] -0.05 +0.26

Performance Evaluation
Size-k No Filtering Filtered:resource level Filtered: dataset level
1 53089 24850 7464
2 51346 17919 13281
3 49603 11800 9607
4 47871 7793 6432
5 46153 5184 4289
6 44480 3529 2922
• Reduction of textual content to be analyzed for the annotation phase:
• Terms of tags {NN,NNP,NNPS}, reduce the amount of text by almost 40%.
• For various token sizes, the reduced amount goes up to 86%
• NER complexity task from DBpedia Spotlight:
• Reduction of HTTP requests.
• Avoid annotating similar chunks of text.
• Significant gains in terms of execution time: 3.5hrs vs. 20mins

Conclusion
• Large-scale educational data-graph
• Well-interlinked datasets at schema and instance level
• Enhanced dataset and resource description
• Scalable annotation procedure
• EF-IRF clustering approach
• Clusters and correlated datasets

Thank you!
Questions?

Towards Integration of Web Data into a coherent Educational Data Graph

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (9)

Semelhante a Towards Integration of Web Data into a coherent Educational Data Graph

Semelhante a Towards Integration of Web Data into a coherent Educational Data Graph (20)

Último

Último (20)

Towards Integration of Web Data into a coherent Educational Data Graph

Notas do Editor