In this tutorial we explain the basics of a 'Linked Data and Ontology' approach for combining data, in particular for the study of rare diseases. The approach is motivated by a case study provided by health care researcher Ulrike Braisch. The main take home lesson is that with this approach the effort for data integration can be substantially lowered, i.e. lead to a shorter path to new treatments for (rare) diseases.
The presentation is based on a tutorial given at the RD-Connect/Neuromics/Euronomics plenary meeting in Heidelberg, Germany, February 26, 2014. It was made possible by RD-Connect, a European project to support Rare Disease research (http://www.rd-connect.eu).
Russian Call Girls Gunjur Mugalur Road : 7001305949 High Profile Model Escort...
Linked Data and Ontology Tutorial (for RD-Connect)
1. LINKED DATA AND ONTOLOGY
TUTORIAL
R D - C O N N E C T T U T O R I A L , H E I D E L B E RG 2 0 1 4
M a r c o R o o s , P e d r o L o p e s , M a r k T h o m p s o n , R a j a r a m K a i l y a p e r u m a l
A c k n o w l e d g e m e n t s : U l r i k e B r a i s c h ( U L M ) , P a u l G r o t h a n d F r a n k v a n H a r m e l e n ( V U
A m s t e r d a m ) , B i o S e m a n t i c s g r o u p L U M C
R D - C o n n e c t L i n k e d D a t a & O n t o l o g y T a s k F o r c e , 2 0 1 3 - 2 0 1 4
1
2. 2
1. Basic introduction to Linked Data
1. The problem
2. Linked Data Approach
3. Linked Data Architecture
4. Nanopublication
Agenda
3. Marco Roos1, Pedro Lopes2,
Mark Thompson1, Rajaram Kaliyaperumal1
1. BioSemantics Group, Human Genetics Department,Leiden University Medical
Center, The Netherlands – http://biosemantics.org
2. Bioinformatics & Computational Biology Group, University of Aveiro, Portugal –
http://bioinformatics.ua.pt
Acknowledgements:
Ulrike Braisch (ULM), Paul Groth (VU Amsterdam),
BioSemantics group EMC/LUMC,
RD-Connect Linked Data & Ontology Task Force
Introduction to Linked Data3
4. 4
Ulrike Braisch’ Problem
C (USA) R2 (EU) R3 (EU)
Education
level
C_EDUC:
7 levels
Edlevel:
9 levels
Isced:
7 levels
Marital
status
C_MARSTAT:
never, now,
separated,
divorced,
divorced
Maristat:
single, married,
partnership,
divorced,
widowed
Maristat:
single, married,
partnership,
divorced,
widowed
Age/date
of birth
Age at baseline
in years
Exact age at
visit
Exact age at
visit
I wish to correlate
patient characteristics
5. 5
Ulrike Braisch’ Problem
C (USA) R2 (EU) R3 (EU)
Education
level
C_EDUC:
7 levels
Edlevel:
9 levels
Isced:
7 levels
Marital
status
C_MARSTAT:
never, now,
separated,
divorced,
divorced
Maristat:
single, married,
partnership,
divorced,
widowed
Maristat:
single, married,
partnership,
divorced,
widowed
Age/date
of birth
Age at baseline
in years
Exact age at
visit
Exact age at
visit
Ulrike’s Problem: the data in the
fields pertain to very similar things,
but not exactly the same. How
similar she does not know a priori.
I wish to correlate
patient characteristics
6. 6
Ulrike Braisch’ Problem
6
Registry 1
Registry 2
Registry 3
A ≠ A’ ≠ A’’, B ≠ B’ ≠ B’’,
C ≠ C’ ≠ C’’
Can I rely on what I think
the headers mean?
A B C
A’’ B’’ C’’
A’ B’ C’
How to align
the data?
I wish to correlate
patient characteristics
7. 7
Solution 1: Ulrike solves the problem
7
Registry 1
A B C
Registry 2
A’ B’ C’
My ‘Registry’
A’’’ B’’’ C’’’
Ulrike has to do
the alignment
herself. She has
to do the heavy
lifting for data
integration
8. 8
I wish to...
correlate patient characteristics with CAG repeat
length (Ulrike)
correlate clinical data with genome data (Bob)
compare Huntington data with Alzheimer data (Alice)
study social aspects of clinical surveys (Christian)
compute the commonalities between all diseases
(Don)
Not just Ulrike’s problem
9. 9
I wish to...
correlate patient characteristics with CAG repeat
length (Ulrike)
correlate clinical data with genome data (Bob)
compare Huntington data with Alzheimer data (Alice)
study social aspects of clinical surveys (Christian)
compute the commonalities between all diseases
(Don)
Not just Ulrike’s problem
The data are valuable for many
people; they all face the same
problem
10. 10
Solution 1: Bob, Alice, Ulrike,
Christian, Don solve the problem
Registry 1
A B C
Registry 2
A’ B’ C’
They all
do the
heavy
lifting
11. 11
Can computers help? – NO!
Registry 1
A B C
Registry 2
A’ B’ C’
Computers
cannot help;
not for
alignment
12. 12
Effort for data integration
Experiment
Data
generation
Data
Integration
Analysis
Application
Gain
Data
Knowledge
The (simplified) steps
of data integration.
How is the pain for
data integration
distributed?
13. 13
PainPain
Effort for data integration
Experiment
Data
generation
Data
Integration
Analysis
Application
Gain
Pain
Pain
Data
Knowledge
Pain
14. 14
PainPain
Effort for data integration
Experiment
Data
generation
Data
Integration
Analysis
Application
Gain
Pain
Pain
Data
Knowledge
Pain
Data are not explicitly prepared for
data integration (apart from storing
them in tables/files/databases).
The pain of data integration is with
Ulrike. Computers can not help her
with that.
15. 15
Pain
Pain
Linked Data = Redistribution of pain
to enable computers to help us
15
Pain
Gain
Pain
Pain
Experiment
Data
generation
Integration
Analysis
Application
Data
Knowledge
“Linked Data”
moves the pain
and enables
computers
16. 16
Pain
Pain
Linked Data = Redistribution of pain
to enable computers to help us
16
Pain
Gain
Pain
Pain
Experiment
Data
generation
Integration
Analysis
Application
Data
Knowledge
The goal of
“Linked Data”Take home message:
“Linked Data” does not take the
pain of data integration away;
alignment remains necessary. But it
moves the pain to data experts,
making the overall workflow more
efficient. And it enables computers
to help.
Next we explain how…
17. The three layers of data “harmonization”
The key role of “Uniform Resource Identifiers”
Sayings things with Linked Data
Linked Data Infrastructure
Linked Data
and Ontology approach
17
20. 20
Harmonize what is
measured and how
Harmonize classification
and relations (meaning)
Harmonize how we make
it computable
Disentangling harmonization
21. 21
1) Harmonize what is
measured and how
2) Harmonize classification
and relations (meaning)
3) Harmonize how we
make it computable
Disentangling harmonization
Ontologies
Linked Data
Consensus
(1) is about agreement between people, (2) is about how to
call things in our data, (3) is about enabling computers to help
22. 22
Harmonize what is
measured and how
Harmonize classification
and relations (meaning)
Harmonize how we make
it computable
Disentangling harmonization
Ontologies
Linked Data
Consensus
Syntax
Semantics
Ontologies have 2 roles: (i) enforce compliance with the
consensus, (ii) convey meaning to computers; they have a
human and computer-readable representation
Agreement
23. 23
Use of ontologies, but not Linked Data
C (USA) R2 (EU) R3 (EU) Ontology
Education
level
C_EDUC:
7 levels
Edlevel:
9 levels
Isced:
7 levels
Onto:1234
Marital
status
C_MARSTAT:
never, now,
separated,
divorced,
divorced
Maristat:
single, married,
partnership,
divorced,
widowed
Maristat:
single, married,
partnership,
divorced,
widowed
Onto:2345
Age/date
of birth
Age at baseline
in years
Exact age at visit Exact age at visit Onto:3456
Perhaps confusing, but ontology identifiers (like GO or HPO
IDs) are often not readily readable for computers...
24. 24
Use of ontologies, but not Linked Data
C (USA) R2 (EU) R3 (EU) Ontology
Education
level
C_EDUC:
7 levels
Edlevel:
9 levels
Isced:
7 levels
Onto:1234
Marital
status
C_MARSTAT:
never, now,
separated,
divorced,
divorced
Maristat:
single, married,
partnership,
divorced,
widowed
Maristat:
single, married,
partnership,
divorced,
widowed
Onto:2345
Age/date
of birth
Age at baseline
in years
Exact age at visit Exact age at visit Onto:3456For a computer they are but
a string of symbols; adding
these IDs to a table is good,
but it is not Linked Data yet.
25. 25
Universal Resource Identifier
Linked Data: unique computer-
readable identifiers
<URI> <URI> <URI> <URI>
<URI> <URI> <URI> <URI> <URI>
<URI> <URI> <URI> <URI> <URI>
<URI> <URI> <URI> <URI> <URI>
This is more like it for
computers!
26. 26
Universal Resource Identifier
Linked Data: unique computer-
readable identifiers
<URI> <URI> <URI> <URI>
<URI> <URI> <URI> <URI> <URI>
<URI> <URI> <URI> <URI> <URI>
<URI> <URI> <URI> <URI> <URI>
‘Uniform Resource
Identifiers’ are identifiers for
computers
The URI is an international
recommendation by the World
Wide Web Consortium (W3C)
31. 31
Reuse of technology:
world wide web hyperlinks
<a href=“http://www.ni.nlm.nih.gove/pubmed/18927111">
For Linked Data we simply
reuse what made the World
Wide Web such a success:
the hyperlink…
What is different?...
32. 32
Documents for human consumption
Document 1
Document 2
http://www.ncbi.nlm.nih.gov/
pubmed/18927111
Hyperlinks (URIs) link documents
The Web as we know it links
documents for humans
33. 33
Data for computer consumption
http://www.ncbi.nlm.nih.gov/
pubmed/18927111
Hyperlinks (URIs) can link data
‘Linked Data’ links data for computers
(enabling them to support us)
37. 37
Predicate Objectsubject
<HDAC1>
<malaria>
<mutation X>
<interacts with>
<is transmitted by>
<has frequency>
<ParvB>
<mosquitos>
<0.25%>
Can we say things with URIs?
Subject, Predicate, and Object are each URIs
URIs are not for humans, but they are often
supplied with a web page for humans…
43. 43
http://purl.uniprot.org/uniprot/Q13547
We said all that by just
this reference
Things we can say
URIs are references. No
need to download a whole
ontology or all of UniProt in
your own knowledge base
What kind of
things can
we say?
44. 44
http://purl.uniprot.org/uniprot/Q13547
<URI for a type of relation>
<URI for object of relation>
Things we can say: relation
http://purl.uniprot.org/uniprot/Q13547
http://conceptwiki.org/index.php/Concept:e6559...
http://bio2rdf.org/geneid:29780
“HDAC1”
We already saw the
(biological) relation
47. 47
http://purl.uniprot.org/uniprot/Q13547
<URI for “is of type”>
<URI for class Protein>
<URI for “has label”>
“Protein”
Things we can say: classify + human
readable labels
“HDAC1”
…and we add a
label for this class.
48. 48
http://purl.uniprot.org/uniprot/Q13547
<URI for “is of type”>
<URI for class Protein>
<URI for “has label”>
“Protein”
Things we can say: classify + human
readable labels
“HDAC1”
Classification is special:
here is where Linked
Data and Ontologies
meet
49. 49
http://purl.uniprot.org/uniprot/Q13547
<URI for “is of type”>
<URI for class Protein>
<URI for “label”>
“Protein”
Things we can say: human readable
labels
This is
from an
ontology!
Good ontologies have a
“URI” representation
(format: OWL/RDF)
50. 50
“parvb”
“HDAC1”
“Interacts with”
“genome
location <…>”
“has genome location”
“Homo
Sapiens”
“Species”
“in species”
“in species”
instance of
“Genome Location”
instance of
“Protein”
instance of
instance of
“Gene”
“encodes”
“Biological Entity”
“subclass of”
“subclass of”
“subclass of”
Knowledge and data represented by
graphs
With Linked Data we
build knowledge graphs.
NB we decide what to
include per application.
51. 51
“parvb”
“HDAC1”
“Interacts with”
“genome
location <…>”
“has genome location”
“Homo
Sapiens”
“Species”
“in species”
“in species”
instance of
“Genome Location”
instance of
“Protein”
instance of
instance of
“Gene”
“encodes”
“Biological Entity”
“subclass of”
“subclass of”
“subclass of”
Knowledge and data represented by
graphs
53. 53
“parvb”
“HDAC1”
“Interacts with”
“genome
location <…>”
“has genome location”
“Homo
Sapiens”
“Species”
“in species”
“in species”
instance of
“Genome Location”
instance of
“Protein”
instance of
instance of
“Gene”
“encodes”
“Biological Entity”
“subclass of”
“subclass of”
“subclass of”
Knowledge and data represented by
graphs
myNanopub:myAssertion
Our name is
on this now
55. 55
http://purl.uniprot.org/uniprot/Q13547
<URI for “also referred to as”>
<URI in other resource>
Things we can say: mappings
Vocabularies exist
for sophisticated
mapping
(also as URIs) We can do that in a
precise and subtle way
56. 56
By using these URIs
Ulrike Braisch’ Problem
<URI for C> <URI for R2> <URI for R3>
<URI for
Education
level>
<URI for C_EDUC>:
<URIs for 7 levels>
<URI for Edlevel>
<URIs for 9 levels>
<URI for Isced>
<URI for 7 levels>
<URI for
Marital
status>
<URI for
C_MARSTAT>
<URIs for never,
now, separated,
divorced, divorced>
<URI for Maristat>
<URIs for single,
married, partnership,
divorced, widowed>
<URI for Maristat>
<URIs for single,
married, partnership,
divorced, widowed>
<URI for
Age/date of
birth>
<URI for Age at
baseline in years>
<URI for Exact age
at visit>
<URI for Exact age
at visit>
I wish to correlate
patient characteristics
with CAG repeat length
If Ulrike’s table were
Linked Data…
57. 57
Linked Data for Ulrike
<URI for C>, <URI for R2>, <URI for R3>
<URI for “is of type”>
<URI for RD resource>
<URI for Edlevel level 3>
<URI for “is narrower than”>
<URI for C_EDUC level 2>
<URI for lsced level 3>
<URI for “is same as”>
<URI for C_EDUC level 2>
<URI for C_MARSTAT:divorced>
<URI for “is same as”>
<URI for Maristat:divorced>
<URI for C_MARSTAT:never>
<URI for “is related to”>
<URI for Maristat:single>
<URI for C_MARSTAT>, <URI for Maristat>
<URI for “subclass of”>
<URI for Marital status>
We also
say…
Remember:
URI = ID + Reference
+ Computable
58. 58
Linked Data is not
Painless data integration and computer reasoning
Harmonization moved up to early data management
More efficient, modelling effort is reused
Pain: semantic model for new data
Early days for reasoning:
we need your Linked Data first!
Conclusions (1/2)
59. 59
Linked data is
A way to enable computers to help harmonize
Everything has a unique reference
Ontologies say what data means
Mappings specify the relation between datasets
Data integration (almost) trivial
Enable computing with knowledge
Conclusions (2/2)
60. Linked Data Architecture
25 April 2014
In the next few slides we
show (simplified) how
Linked Data systems work
61. 61
Most common use: common reference
25 April 2014
Smoker
Heavy smoker
Light smoker
Gene Expression
Database
Clinical RegistryLinked Data
Exchange
62. 62
Most common use: common reference
25 April 2014
Smoker
Heavy smoker
Light smoker
Gene Expression
Database
Clinical RegistryLinked Data
Exchange
Ontologies in
Linked Data
provide a
reference for
systems
whatever internal
structures they use
63. 63
Most common use: common reference
25 April 2014
Smoker
Heavy smoker
Light smoker
Gene Expression
Database
Clinical RegistryLinked Data
Exchange
Systems do not have to
agree on one fixed
schema
One common link suffices
to connect resources
64. 64
Typical Linked Data architecture for
data integration applications
64
Linked
Data Cache
(e.g. running COEUS)
Case
Study
Exposed
Linked Data
Exposed
Linked Data
Exposed
Linked Data
Interface
User
dependent
Source 1 Source 2 Source 3
65. 65
Typical Linked Data architecture for
data integration applications
65
Linked
Data Cache
(e.g. running COEUS)
Case
Study
Exposed
Linked Data
Exposed
Linked Data
Exposed
Linked Data
Interface
User
dependent
Source 1 Source 2 Source 3
Linked Data can be
integrated in a cache Integration is trivial
when sources are well-
formed Linked Data: when
the same URIs were used
for the same things,
integration is instant
66. Nanopub
Db
VoID
Data Cache
(Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Data
Import
CorePlatform
P12374
EC2.43.4
CS4532
“Adenosine
receptor 2a”
VoID
Db
Nanopub
Db
VoID
Db
VoID
Nanopub
VoID
Public Content Commercial
Public
Ontologies
User
Annotations
Applications
OpenPHACTS uses
Linked Data for drug
discovery
67. Claim your findings as Nanopublications
Nanopublication
Mark Thompson, Rajaram Kaliyaperumal
67
It was
me, me,
me!
Finally, a word about
Nanopublication,
because in our opinion
your data contributions
should be acknowledged
68. 68
What do you say with a Nanopublication?
Minimal statement for which you deserve credit
How you came to say it (provenance)
Who should be cited
Preferred Format: Linked Data!
Nanopublication
69. 69
What do you say with a Nanopublication?
Minimal statement for which you deserve credit
How you came to say it (provenance)
Who should be cited
Preferred Format: Linked Data!
Nanopublication
Science
Good Science
Acknowledged Good Science
Digital
70. 70
Pain
Pain
Fame and glory (and reproducibility):
Nanopublication!
Pain
Gain
Pain
Pain
Experiment
Data
generation
Integration
Analysis
Application
Data
Knowledge
Gain
Nano-
publications
Gain
Nano-
publications
71. 71
Pain
Pain
Fame and glory (and reproducibility):
Nanopublication!
Pain
Gain
Pain
Pain
Experiment
Data
generation
Integration
Analysis
Application
Knowledge
Gain
Nano-
publications
Gain
Nano-
publications
Data
A new type of gain is the
credit you can get for data
publication
72. Acknowledgements
Ulrike Braisch (University of ULM, Germany)
RD-Connect (EU-FP7)
Leiden University Medical Center
Dutch Tech Centre for Life Sciences
RD-Connect Linked Data and Ontology Task Force, in particular: Pedro Lopes, Rachel Thompson, David Salgado,
Peter Robinson, Manual Posada, Estrella Lopez Martin,Mark Thompson, Michael Orth, David van Enckevort
BioSemantics team LUMC: Kristina Hettne, Eleni Mina, Tareq Malas, Herman van Haagen, Peter-Bram ‘t Hoen,
Rajaram Kaliyaperumal, Zuotian Tatum, Eelke van der Horst, Mark Thompson, Barend Mons
These slides are partly based on input and inspiration from Frank van Harmelen, Paul Groth, Scott Marshall,
Andrew Gibson, Katy Wolstencroft, Jun Zhao, Robert Stevens, Carole Goble, W3C Health Care and Life Science
Interest Group
Thank you for your attention…
25 April 2014