1. Linked Data in Linguistics
Representing and Connecting Language Data and Language Metadata
Sebastian Hellmann, Christian Chiarcos, Sebastian Nordhoff
34th Annual Meeting of the German Linguistic Society (DGfS), AG 2
Frankfurt/M., Germany, March 7th – 9th, 2012
If not otherwise noted,
content is cc-by
2. Overview
Technological Background (SH)
Linked Open Data and Collaborative Research (SH)
Linked Data for Linguistics (CC)
Building a Linguistic Linked Open Data Cloud
Prospects of Linked Data in Linguistics (CC)
Annotated Corpora (CC)
Lexical-Semantic Resource (SH)
Linguistic Databases (SN)
What to Expect from LDL-2012
March 7th, 2012 Linked Data in Linguistics 2012 2
3. From Excel to RDF and Linked Data
March 7th, 2012 Linked Data in Linguistics 2012 3
4. From Excel to RDF and Linked Data
A data collection about sailing ships:
Source http://en.wikipedia.org/wiki/File:Bounty_modified_photo.jpg
March 7th, 2012 Linked Data in Linguistics 2012 4
5. From Excel to RDF and Linked Data
Add the Gorch Fock
Source http://en.wikipedia.org/wiki/File:Gorch_Fock_unter_Segeln_Kieler_Foerde_2006.jpg
March 7th, 2012 Linked Data in Linguistics 2012 5
6. From Excel to RDF and Linked Data
Add the auxiliary propulsion of the Gorch Fock
The field is now irregular
March 7th, 2012 Linked Data in Linguistics 2012 6
7. From Excel to RDF and Linked Data
A first empty field is introduced
March 7th, 2012 Linked Data in Linguistics 2012 7
8. From Excel to RDF and Linked Data
Entity Attribute Value, data represented in triples
March 7th, 2012 Linked Data in Linguistics 2012 8
9. From Excel to RDF and Linked Data
XML does also not produce sparsity or anomalies, but
what about:
1. Automatically infer rows (reduces size)
2. Check consistency (not validity)
3. Merge two tables (not only syntactically, but
semantically)
4. Enrich with external data (also retrieve updates)
5. Query
March 7th, 2012 Linked Data in Linguistics 2012 9
10. From Excel to RDF and Linked Data
XML does also not produce sparsity or anomalies, but
what about:
1. Automatically infer rows (reduces size)
2. Check consistency (not validity)
3. Merge two tables (not only syntactically, but
semantically)
4. Enrich with external data (also retrieve updates)
5. Query
March 7th, 2012 Linked Data in Linguistics 2012 10
11. From Excel to RDF and Linked Data
Description Logic (DL) is a family of formal knowledge
representation languages
fragments of first order logic
usually decidable inference problems
Well researched complexity
Basis for the Web Ontology Language (OWL)
Reasoner implementations available
Franz Baader, Ian Horrocks, and Ulrike Sattler Chapter 3 Description Logics. In Frank van Harmelen,
Vladimir Lifschitz, and Bruce Porter, editors, Handbook of Knowledge Representation. Elsevier, 2007.
March 7th, 2012 Linked Data in Linguistics 2012 11
12. From Excel to RDF and Linked Data
Description Logic inference
March 7th, 2012 Linked Data in Linguistics 2012 12
13. From Excel to RDF and Linked Data
Description Logic constraints
Possible to detect inconsistencies, i.e. Gorch Fock must
not be a Sailingship
March 7th, 2012 Linked Data in Linguistics 2012 13
14. From Excel to RDF and Linked Data
XML does also not produce sparsity or anomalies, but
what about:
1. Automatically infer rows (reduces size)
2. Check consistency (not validity)
3. Merge two tables (not only syntactically, but
semantically)
4. Enrich with external data (also retrieve updates)
5. Query
March 7th, 2012 Linked Data in Linguistics 2012 14
15. Uniform Resource Identifiers (URIs)
Agree on a common vocabulary and names for entities
On the schema level, coherence of properties and types
is required for data integration
URIs allow for globally unique identifiers:
“Gorch Fock”
vs.
http://en.wikipedia.org/wiki/Gorch_Fock_(1958)
vs.
http://dbpedia.org/resource/Gorch_Fock_(1958)
dbpedia:Gorch_Fock_(1958)
March 7th, 2012 Linked Data in Linguistics 2012 15
16. From Excel to RDF and Linked Data
Last table before we get more technical
4 Types
of Object
March 7th, 2012 Linked Data in Linguistics 2012 16
17. From Excel to RDF and Linked Data
owl:sameAs dbpedia:Gorch_Fock owl:sameAs
my:Gorch_ _(1958)
Fock
Other
datasets
my:owner
my:German
_Navy dbpedia:German_N
owl:sameAs avy
dbprop:shipLength
More data
“81.2”^^xsd:double
March 7th, 2012 Linked Data in Linguistics 2012 17
18. RDF and OWL - recap
RDF – Resource Description Framework
Entity Attribute Value + URIs
Triples
Shared Vocabularies
Graphs
OWL – Web Ontology Language
Based on Description Logic and extends RDF
OWL-DL Reasoning
Consistency checks
Both are W3C standards
March 7th, 2012 Linked Data in Linguistics 2012 18
19. Syntax training
Presenters will probably show you some code during the
next days
On the next slide you will see some syntax examples
March 7th, 2012 Linked Data in Linguistics 2012 19
22. SPARQL
Ability to merge data and query it using the W3C
standard SPARQL (SPARQL Protocol and Query
Language)
SPARQL is the SQL of the Semantic Web
SELECT ?ship WHERE {
?ship rdf:type my:SailingShip .
?ship my:propulsion ?engine .
?engine my:fuelType my:Diesel .
?ship dbprop:shipLength ?length .
Filter (xsd:double (?length) >= 80.0 )
}
March 7th, 2012 Linked Data in Linguistics 2012 22
24. Linked Open Data cloud
March 7th, 2012 Linked Data in Linguistics 2012 24
25. Linked Open Data cloud
March 7th, 2012 Linked Data in Linguistics 2012 25
26. Linked Open Data cloud
March 7th, 2012 Linked Data in Linguistics 2012 26
27. Linked Open Data cloud
March 7th, 2012 Linked Data in Linguistics 2012 27
28. Linked Open Data cloud
March 7th, 2012 Linked Data in Linguistics 2012 28
29. Linked Open Data cloud
March 7th, 2012 Linked Data in Linguistics 2012 29
30. Linked Open Data cloud
March 7th, 2012 Linked Data in Linguistics 2012 30
31. Linked Open Data cloud
March 7th, 2012 Linked Data in Linguistics 2012 31
32. Linked Open Data cloud
Image of a table with some data
March 7th, 2012 Linked Data in Linguistics 2012 32
Source http://lod-cloud.net
33. 4 Rules of Linked Data
Use URIs as names for things
Use HTTP URIs so that people can look up those
names.
When someone looks up a URI, provide useful
information, using the standards (RDF*, SPARQL)
Include links to other URIs. so that they can discover
more things.
http://www.w3.org/DesignIssues/LinkedData.html
March 7th, 2012 Linked Data in Linguistics 2012 33
34. Linked Data - Content Negotiation
Different views for different data consumers: Browser
March 7th, 2012 Linked Data in Linguistics 2012 34
35. Linked Data - Content Negotiation
Different views for different data consumers:
Applications
March 7th, 2012 Linked Data in Linguistics 2012 35
36. Linked Data
A dataset is a set of RDF triples that is published,
maintained or aggregated by a single provider.
An RDF link is an RDF triple whose subject and object
are described in different datasets
A linkset is a collection of such RDF links between two
March 7th, 2012 Linked Data in Linguistics 2012 36
38. Why going for the fifth star?
Central Contractor
Registration (CCR)
Geonames
Source: http://webofdata.wordpress.com/2011/05/22/why-we-link/
March 7th, 2012 Linked Data in Linguistics 2012 38
39. Open Licence allow republishing and reuse
Motivation for collaboration:
High potential that invested efforts can be reused, i.e.
data, links, vocabularies, schemas
(Effortful) feedback: Users complement data, extend
vocabularies and contribute changes. VoCamps for
achieving coherence.
Source: Chiarcos, Hellmann, Nordhoff, Towards a Linguistic Linked Open Data cloud: The
Open Linguistics Working Group, Traitement Automatique des Langues, to appear
March 7th, 2012 Linked Data in Linguistics 2012 39
40. Example DBpedia
Data is extracted from Wikipedia
Wikipedia just publishes the
unstructured data
Small DBpedia team creates RDF
Community of stakeholders clean
the data and create links
Estimate:
10-20% to consolidate
community effort
March 7th, 2012 Linked Data in Linguistics 2012 40
42. Linked Data for Linguistics
March 7th, 2012 Linked Data in Linguistics 2012 42
43. Linked Data for Linguistics
Representation and modelling
Structural interoperability
Integrating distributed resources
Conceptual interoperability
Dynamic Import
March 7th, 2012 Linked Data in Linguistics 2012 43
45. Representation and modelling
Different linguistic subcommunities have developed
representation standards, e.g.,
LMF: Lexical Markup Framework (Francopoulo et al. 2009)
lexical-semantic resources
GrAF: Graph Annotation Framework (Ide and Suderman 2007)
for annotated corpora
based on labelled directed acyclic graphs (feature structures)
RDF data model: labelled directed (multi-)graphs
Uniform formalism for different resource types
Sublanguages (e.g., RDFS, OWL) allow to define domain-
specific vocabularies
March 7th, 2012 Linked Data in Linguistics 2012 45
46. Structural interoperability
With different language resources represented in RDF,
we can combine both sources of information freely
cross-resource queries with RDF query languages (e.g.,
SPARQL)
Given a corpus with WordNet sense annotations (e.g., the
Manually Annotated Sub-Corpus MASC) (Ide et al. 2010)
“Retrieve all sentences that describe locations”
i.e., sentences containing a token annotated with a
WordNet sense that is a hyponym of “location”
Difficult to realize with GrAF or LMF
March 7th, 2012 Linked Data in Linguistics 2012 46
47. Integrating distributed resources
SPARQL supports nested subqueries to run on different
repositories
No physical integration of resources in a single data
base required
Easy to link to centralized repositories of reference
terminology, etc.
March 7th, 2012 Linked Data in Linguistics 2012 47
48. Conceptual interoperability
Resources should specify which vocabulary (e.g., for
annotation) they use and how it is defined
By reference to community-maintained terminology
repositories, e.g.,
GOLD (Farrar and Langendoen 2010)
ISOcat (Windhouwer and Wright @ LDL-2012)
Can be used, e.g., for disambiguation
If a lexeme in a lexicon has a certain morphosyntactic
categorization, we can retrieve all sentences from a
corpus with corresponding annotations
e.g., land as a noun, but not as a verb
March 7th, 2012 Linked Data in Linguistics 2012 48
49. Dynamic import
Linking resources implemented with URIs, which can be
resolved on-the-fly to update and enrich data sets
For a token in a corpus, additional information can be
aggregated from different repositories by resolving
links (retrieving senses from a lexical-semantic
repository or concepts from a terminology
repository)
If the information in the target resource was updated
since the original annotation was performed, then the
updates are available at query time
Inconsistencies can be avoided through versioning
March 7th, 2012 Linked Data in Linguistics 2012 49
50. Ecosystem, infrastructure and community
RDF and related standards are maintained by an active
and relatively large community
Different fields of application
Libraries, GeoData, BioMed, ...
Established W3C standard and technological
infrastructure
Linguistically relevant resources already provided
lexical-semantic resources (e.g., WordNet)
RDF facilitates distributed development, re-using data,
and, indirectly, interdisciplinary cooperation
March 7th, 2012 Linked Data in Linguistics 2012 50
51. Building a Linguistic Linked Open Data cloud
March 7th, 2012 Linked Data in Linguistics 2012 51
52. Building a Linguistic Linked Open Data cloud
In LOD cloud
Lexical Semantic
resources
Linguistic meta data
Further relevant types
for linguistic research:
Annotated corpora
Input and output of NLP
tools
Linguistic data bases
Repositories of linguistic
terminology
March 7th, 2012 Linked Data in Linguistics 2012 52
53. Building a Linguistic Linked Open Data cloud
Each single provider has different incentives to use
Linked Data and/or RDF
Concepts of RDF and Linked Data have been brought
up to solve open problems in different subcommunities
of linguistics and neighboring fields
As an illustration, we briefly introduce three examples
March 7th, 2012 Linked Data in Linguistics 2012 53
54. Building a Linguistic Linked Open Data cloud
Annotated corpora
Underlying problem: structural and conceptual
interoperability
Natural Language Processing for the semantic web
Underlying problem: NLP output represented in
idiosyncratic formalisms, results to be represented in
RDF
Typological data bases
Underlying problem: globally unique identifiers (not just for
“languages”, but for dialects, language families, etc.)
March 7th, 2012 Linked Data in Linguistics 2012 54
55. Annotated corpora
Linked Data and Corpus Interoperability
March 7th, 2012 Linked Data in Linguistics 2012 55
56. Linked Data and Corpus Interoperability
Linked Data can be used to address interoperability
issues of annotated corpora
Corpus: collection of texts developed to analyze
language and to develop tools for this purpose
=> Annotated corpora
Different types of annotations, different communities
involved, different languages
=> Interoperability challenge
March 7th, 2012 Linked Data in Linguistics 2012 56
57. Linked Data and Corpus Interoperability
Linked Data can be used to address interoperability
issues of annotated corpora
Corpus: collection of texts developed to analyze
language and to develop tools for this purpose
=> Annotation
Structural interoperability Interoperable representation form
Different types of annotations, different communities
involved, different languages
Conceptual interoperability
Reference definitions for linguistic categories and features
=> Interoperability challenge
March 7th, 2012 Linked Data in Linguistics 2012 57
58. Structural Interoperability
Analyses produced by different researchers / NLP tools
use different representation formalisms
word annotations
(‘tokens‘)
March 7th, 2012 Linked Data in Linguistics 2012 58
59. Structural Interoperability
Analyses produced by different researchers / NLP tools
use different representation formalisms
word annotations
(‘tokens‘)
span annotations
(‘markables‘)
March 7th, 2012 Linked Data in Linguistics 2012 59
60. Structural Interoperability
Analyses produced by different researchers / NLP tools
use different representation formalisms
word annotations
(‘tokens‘)
span annotations
(‘markables‘)
tree-like
annotations
March 7th, 2012 Linked Data in Linguistics 2012 60
61. Structural Interoperability
Analyses produced by different researchers / NLP tools
use different representation formalisms
relational
annotations
March 7th, 2012 Linked Data in Linguistics 2012 61
62. Structural Interoperability
Analyses produced by different researchers / NLP tools
use different representation formalisms
March 7th, 2012 Linked Data in Linguistics 2012 62
63. Structural Interoperability
State-of-the art approaches
Graph-based data model
Represent data in standoff XML
(Ide and Suderman 2007, Chiarcos et al. 2008, Eckart et al. @ LDL)
Presentation of Nancy Ide @ LDL 2012
March 7th, 2012 Linked Data in Linguistics 2012 63
64. XML standoff
MASC corpus, GrAF format
March 7th, 2012 Linked Data in Linguistics 2012 64
65. Working with XML standoff
How to store, retrieve and query XML standoff data
efficiently ?
Direct use with XML data bases inefficient (Eckart 2008)
Inline XML (e.g., Dipper et al. 2007)
Relational DB formats (e.g., Eckart et al. @ LDL)
RDF as another possibility (e.g., Chiarcos 2012)
Databases are optimized for graph querying
Extensive (open source) infrastructure available
Conceptual interoperability
Integration with Linked Data resources
March 7th, 2012 Linked Data in Linguistics 2012 65
66. Corpus Interoperability with RDF
Structural Interoperability
e.g. POWLA - http://purl.org/powla
lossless transformation to RDF from standoff XML
Linking to lexical-semantic resources (WordNet)
Conceptual Interoperability
Cross-Linking to terminology repositories (OLiA,
GOLD, ISOcat)
Entity-Linking to metadata (Geodata, LOD cloud)
March 7th, 2012 Linked Data in Linguistics 2012 66
68. NLP Interchange Format (NIF)
NIF is an RDF/OWL-based format
Achieve interoperability for:
Output of NLP tools
Linguistic data in RDF
Text documents
Web of Data (LOD cloud)
March 7th, 2012 Linked Data in Linguistics 2012 68
71. A Transparent Formalization of Text for
Machines
Intransparent for machines
March 7th, 2012 Linked Data in Linguistics 2012 71
72. A Transparent Formalization of Text for
Machines
Universe of discourse is defined as the words
over the alphabet of Unicode characters
URI
http://example.org/sample “The city Berlin is the capital
#offset_0_42 of Germany.”
March 7th, 2012 Linked Data in Linguistics 2012 72
73. NLP Interchange Format
Specification for NIF 1.0 (http://nlp2rdf.org/nif-1-0/)
different implementations (alpha/beta) are available as
Open Source (UIMA, Gate Annie, Stanford Parser,
DBpedia Spotlight)
Mailing list available at http://nlp2rdf.org
Demo: http://nlp2rdf.lod2.eu/demo.php
Poster during the poster session
Thursday 13:00-14:30
March 7th, 2012 Linked Data in Linguistics 2012 73
74. Typological databases
Glottolog/Langdoc
March 7th, 2012 Linked Data in Linguistics 2012 74
75. Glottolog/Langdoc
Two subprojects
Glottolog provides identifiers and additional
information for 100k languoids (languages, dialects,
families)
main competitor projects:
ISO 639-3/Ethnologue
Multitree
Langdoc provides identifiers and additional
information for 180k references
main competitor project: OLAC
March 7th, 2012 Linked Data in Linguistics 2012 75
76. Problems to address
existing identifiers are not granular enough (ISO
636-3: 7k)
existing identifiers have unclear reference (multitree
altc refers to both Micro-Altaic and Macro-Altaic)
existing identifiers have no verifiable empirical basis
Solutions
21k identifiers for main tree
total of 104k identifiers for all nodes of multitree trees
March 7th, 2012 Linked Data in Linguistics 2012 76
77. RDF
gl o t t o l o g : 1 2 345 gl : s u b l an g u o i d gl o t t o l o g : 41 2 02 .
gl o t t o l o g : 1 2 345 gl : s u p e r l an g u o i d gl o t t o l o g : 9421 1 .
March 7th, 2012 Linked Data in Linguistics 2012 77
78. Langdoc
180k references to literature treating (mostly) lesser-
known languages
annotated for language, document type, macro-area
limited full text indexing
“give me any grammar or grammar sketch from an Afro-
Asiatic language spoken in Eurasia where the word
'dual' occurs in the text”
March 7th, 2012 Linked Data in Linguistics 2012 78
79. RDF
gl o t t o l o g : 1 2 345 gl : i mme d i at e l yd e s c r i b e d i n l an g d o c : 2 345 6 .
March 7th, 2012 Linked Data in Linguistics 2012 79
80. Position of G/L in the LLOD cloud
March 7th, 2012 Linked Data in Linguistics 2012 80
82. Outlook
Outlook
March 7th, 2012 Linked Data in Linguistics 2012 82
83. From OWLG to DGfS
The Open Knowledge Foundation Working Group for
Open Data in Linguistics (OKFN-OWLG) was founded
in late 2010
We first established a series of meetings and a mailing
list
Build the structure, create momentum
Two workshops: OKCon 2011 in Berlin, this workshop
This afternoon: Christian Kreutz presents the OKFN
March 7th, 2012 Linked Data in Linguistics 2012 83
84. Building the Linguistic Linked Data Cloud
March 7th, 2012 Linked Data in Linguistics 2012 84
85. This workshop
Exploratory workshop
Chart domains as to the amount and kind of data which
can be integrated into the LLOD-cloud
increase coverage
more domains
increase density
more links between resources
increase discussion between independent
subcommunities
March 7th, 2012 Linked Data in Linguistics 2012 85
87. Spread the word
http://linguistics.okfn.org/
open-linguistics@lists.okfn.org
poster at DGfS-CL session on Thursday
start this workshop: first talk:
Declerck et al.
“Towards Linked Language Data (LLD) for Digital
Humanities”
March 7th, 2012 Linked Data in Linguistics 2012 87
88. We would like to thank
MPI
Springer
LOD2
March 7th, 2012 Linked Data in Linguistics 2012 88
Editor's Notes
NB: An asterisk in these notes, like this: * Indicates a transition on the slide.
@book{mcenery2001corpus, title={Corpus linguistics: an introduction}, author={McEnery, T. and Wilson, A.}, year={2001}, publisher={Edinburgh Univ Pr} } @book{tognini2001corpus, title={Corpus linguistics at work}, author={Tognini-Bonelli, E.}, volume={6}, year={2001}, publisher={John Benjamins Publishing Co} } @inproceedings{brewster2004data, title={Data driven ontology evaluation}, author={Brewster, C. and Alani, H. and Dasmahapatra, S. and Wilks, Y.}, booktitle={Proceedings of LREC}, volume={2004}, year={2004}, organization={Citeseer} } @inproceedings{mahesh1995semantic, title={Semantic classification for practical natural language processing}, author={Mahesh, K. and Nirenburg, S.}, booktitle={Proceedings of the Sixth ASlS SIG/CR Classification Research Workshop: An Interdisciplinary Meeting. Chicago, IL}, year={1995}, organization={Citeseer} }
@book{mcenery2001corpus, title={Corpus linguistics: an introduction}, author={McEnery, T. and Wilson, A.}, year={2001}, publisher={Edinburgh Univ Pr} } @book{tognini2001corpus, title={Corpus linguistics at work}, author={Tognini-Bonelli, E.}, volume={6}, year={2001}, publisher={John Benjamins Publishing Co} } @inproceedings{brewster2004data, title={Data driven ontology evaluation}, author={Brewster, C. and Alani, H. and Dasmahapatra, S. and Wilks, Y.}, booktitle={Proceedings of LREC}, volume={2004}, year={2004}, organization={Citeseer} } @inproceedings{mahesh1995semantic, title={Semantic classification for practical natural language processing}, author={Mahesh, K. and Nirenburg, S.}, booktitle={Proceedings of the Sixth ASlS SIG/CR Classification Research Workshop: An Interdisciplinary Meeting. Chicago, IL}, year={1995}, organization={Citeseer} }