TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System
1. Towards A M ashup To Build
Bioinformatics K nowledge System
François Belleau, M arc-Alexandre Nolin,
Nicole Tourigny, Philippe Rigault, Jean M orissette
Département d'informatique et de génie logiciel
Université Laval
2. Presentation Plan
K nowledge integration vision
Bio2RDF architecture
RDFization of knowledge
Normalization of U RI
Parkinson E xample Demo
Conclusion
Banff, May 8, 2007 CHUL research center - Laval University 2
3. From the RDF inventor :
quot;Wouldn't it be great if you were able to
organize all this information based on
your own terms, instead of based on the
application you use to access the
information ?” (1999)
Ramanathan V. Guha
From WikiPedia :
Mashup (web application hybrid)
A mashup is a website or application that
combines content from more than one source
into an integrated experience.(2007)
Banff, May 8, 2007 CHUL research center - Laval University 3
4. Sir Berners-L ee’s vision of semantic web
« The Semantic Web is not a separate
Web but an extension of the current
one, in which information is given well-
defined meaning, better enabling
computers and people to work in
cooperation. »
Scientific Americain, 2001
Tim Berners- Lee
http://www.w3.org/2006/Talks/0404-mit-tbl/
Banff, May 8, 2007 CHUL research center - Laval University 4
5. Bio2RDF starting vision at ISM B 2005
Too many knowledge sources
available for life science scientists
Too many formats (text, X M L ,
HTM L )
New source each day with
specialized tool or web interface
Integration problem recognized by
global community
T hanks to Chr istopher Baker, Eric
Neum ann, Kei Cheun g and
Johan ne Luciaono for their ideas.
Banff, May 8, 2007 CHUL research center - Laval University 5
6. The knowledge integration problem in
bioinformatics
From the BioPAX group(2004) From Carol Goble at ISW C 2005
Banff, May 8, 2007 CHUL research center - Laval University 6
7. Integration methods in bioinformatics
1) Davidson 1995
“Transform data to the federated database on
demand”
2) Köhler 2003
“In different databases the same things can be
given different names”
3) Stein 2003
“link integration, view integration and data
warehousing”
Banff, May 8, 2007 CHUL research center - Laval University 7
8. Data warehouse approaches
url
http://www.ncbi.nlm.nih.gov/Database/ http://www.genome.jp/dbget/dbget.links.html
Banff, May 8, 2007 CHUL research center - Laval University 8
9. Bio2RDF ’s approach
to knowledge integration :
“Solve the problem of kn owledge
in tegration in biology by applying
a sem antic web approach.”
Banff, May 8, 2007 CHUL research center - Laval University 9
10. Other semantic web projects
Banff, May 8, 2007 CHUL research center - Laval University 10
11. Bio2RDF ’s design rules
2. Convert document to RDF format;
3. U se of a triplestore technology (sesame,
virtuoso, oracle);
4. Normalize U RIs;
5. Build a mashup as needed to answer specific
question (elmo);
6. Query the mashup with SeRQL or SPARQL .
Banff, May 8, 2007 CHUL research center - Laval University 11
12. Bio2RDF ’s architecture
#1
#5
#4
#2
#3
#6
Banff, May 8, 2007 CHUL research center - Laval University 12
13. Bio2RDF ’s knowledge sources
Banff, May 8, 2007 CHUL research center - Laval University 13
14. RDF conversion statistics
Data
Numb er of RDF
sourc LSID example Size of data converted
documents
e
go go:0000001 22 961 507 963 321
kegg path:aae00010 35 257 1 038 593 137
14 292 8 902 205
kegg cpd:c00001
438 724 210 458 897
mgi mgi:96103
17 359 573 639 380
ncbi omim:100050
ncbi geneid:1 2 744 786 67 225 535 082
obo obo's 59 name spaces 279 720 216 007 267
pdb pdb:100d 34 421 16 309 651 935
4 177 176 29 453 203 064
uniprot uniprot:A0A0 00
5 020 2 844 058
uniprot enzyme:1.-.-.-
191 664 364 728 083
uniprot pubmed:100133
uniprot taxonomy :10 337 564 125 630 659
uniprot niref:UniRef100_A0A000
u 7 990 452 14 865 490 144
… … … …
Banff, May 8, 2007 CHUL research center - Laval University 14
15. OpenRDF ’s software
http://www.openrdf.org/
Banff, May 8, 2007 CHUL research center - Laval University 15
16. RDF of geneid:15275
rdf:about
•
rdfs:label
•
dc:identifier, title, created
•
bio2rdf:lsid
•
bio2rdf:url
•
bio2rdf:synonym
•
bio2rdf:xRef
•
Banff, May 8, 2007 CHUL research center - Laval University 16
17. RDFizer
To rdfize: T o convert existin g
docum ent in to RD F form at.
efetch rdfizer
Banff, May 8, 2007 CHUL research center - Laval University 17
18. How to rdfize
From HTM L pages (prosite:ps00101)
•
From X M L documents using X SLT
•
(path:mmu00010)
From X M L documents using X Path and
•
J STL (geneid:15275)
From direct SQL access
•
(ensembl:ensmusg00000025875 )
From RDF document (uniprot:p26838 )
•
From Text files (cpd:c00001)
•
Banff, May 8, 2007 CHUL research center - Laval University 18
19. 1) prosite:ps00101 from html using a regex
Banff, May 8, 2007 CHUL research center - Laval University 19
20. 2) Kegg’s path:mmu00010 from X M L using X SL
Banff, May 8, 2007 CHUL research center - Laval University 20
22. 4) uniprot:p26838 from RDF using SeRQL
Banff, May 8, 2007 CHUL research center - Laval University 22
23. One reality, many names
Different namespace identifier
●
pubmed:11992264 vs pmid:11992264
Uppercase and lowercase
●
uniprot:p26838 vs uniprot:P26838
Version number
●
genbank:ac008393 vs genbank:ac008393.7
Total id length
●
go:0032283 vs go:32283
Banff, May 8, 2007 CHUL research center - Laval University 23
24. RDF izing docum ent is not enough
we also need norm alized URIs.
http:/ / bio2rdf.org/ namespace:id
http:/ / bio2rdf.org/ pubmed:11992264
http:/ / bio2rdf.org/ uniprot:p26838
http:/ / bio2rdf.org/ genbank:ac008393
http:/ / bio2rdf.org/ go:0032283
Banff, May 8, 2007 CHUL research center - Laval University 24
25. U RI Normalization rules
Different namespace identifier
●
We resolve namespace synonymy with a urlrewrite rule, for
example pubmed and pmid.
Uppercase and lowercase
●
We write every U RI in lowercase
Version number
●
A owl:sameAs predicate is use to link the different versions
of a document.
Total id length
●
A fixed length is determine for id.
Banff, May 8, 2007 CHUL research center - Laval University 25
27. U RL vs L SID
http:/ / bio2rdf.org/ uniprot:p26838
owl:sameAs
urn:lsid:uniprot.org:uniprot:p26838
http:/ / bio2rdf .org/ un ipr ot:p26838
http:/ / bi o2rdf .org/ ur n:lsid:uni pr ot.or g:unipr ot:p2 6838
Banff, May 8, 2007 CHUL research center - Laval University 27
28. Our method to answer question
T o answer a very specialized
question, we build a specifi c
kn owledge base (the mash up
stored in a RDF triplestore)
and then query it wi th SeRQL.
Banff, May 8, 2007 CHUL research center - Laval University 28
29. Parkinson examples
1. What is the semantic network of
OMIM records describing Parkinson’s
disease?
2. Which MeSH terms are mostly cited
in Parkinson’s disease publications?
3. What genes related to Parkinson’s
disease are involved in pathways
according to Kegg ?
Banff, May 8, 2007 CHUL research center - Laval University 29
30. Time for demo !
Banff, May 8, 2007 CHUL research center - Laval University 30
31. The big everything about parkinson
http:/ / localhost:8080/ bio2rdf/ search:parkinson@omim
http:/ / localhost:8080/ bio2rdf/ search:parkinson@geneid
http:/ / localhost:8080/ bio2rdf/ search:parkinson@uniprot
http:/ / localhost:8080/ bio2rdf/ search:parkinson@kegg
http:/ / localhost:8080/ bio2rdf/ load:pubmed
http:/ / localhost:8080/ bio2rdf/ sameas:hsa-geneid
http:/ / localhost:8080/ bio2rdf/ learn:geneid
http:/ / localhost:8080/ bio2rdf/ load:cpd
http:/ / localhost:8080/ bio2rdf/ load:reactome
http:/ / localhost:8080/ bio2rdf/ load:biopax-xref
http:/ / localhost:8080/ bio2rdf/ load:chebi
http:/ / localhost:8080/ bio2rdf/ load:obo-xref
http:/ / localhost:8080/ bio2rdf/ sameas:keggcompound-cpd
1.700 K triples
97 M bytes in turtle format
in 90 minutes
Banff, May 8, 2007 CHUL research center - Laval University 31
32. Third exemple SeRQL query
What genes related to Parkinson’s disease are involved in
pathways according to Kegg ?
SELECT
GeneticDisorder-label, Gene-label, pathway-label
FROM
{GeneticDisorder} rdf:type {<http://bio2rdf.org/omim#GeneticDisorder>},
{GeneticDisorder} rdfs:label {GeneticDisorder-label},
{GeneticDisorder} <http://www.w3.org/2002/07/owl#sameAs> {sameAs},
{Gene} <http://bio2rdf.org/bio2rdf#xRef> {sameAs},
{Gene} rdfs:label {Gene-label},
{Gene2} <http://www.w3.org/2000/01/rdf-schema#seeAlso> {Gene},
{xobject} <http://bio2rdf.org/kegg#xobject> {Gene2},
{xentry1} <http://bio2rdf.org/kegg#xentry1> {xobject},
{pathway} <http://bio2rdf.org/kegg#xrelation> {xentry1},
{pathway} rdfs:label {pathway-label}
WHERE
GeneticDisorder-label like quot;*PARKINSON*quot;
Banff, May 8, 2007 CHUL research center - Laval University 32
36. Our main results
● RDF is a framework that enables a very simple
thing: scalability of the knowledge base complexity.
● The Bio2RDF project proposes to keep complexity
in the bioinformatics knowledge space under
control by applying this proven web semantic
approach.
Banff, May 8, 2007 CHUL research center - Laval University 36
37. Now with Bio2RDF semantic integration
Banff, May 8, 2007 CHUL research center - Laval University 37
38. Bio2RDF ’s vision of knowledge map
Banff, May 8, 2007 CHUL research center - Laval University 38
39. Bio2RDF ’s map of distributed
bioinformatics knowledge
http://bio2rdf.org/bio2rdf-2007-02.owl
Banff, May 8, 2007 CHUL research center - Laval University 39
40. M ap of semantic resource
Banff, May 8, 2007 CHUL research center - Laval University 40
41. M ontreal’s subway map
Banff, May 8, 2007 CHUL research center - Laval University 41
42. Bio2RDF ’s actual knowledge map
Banff, May 8, 2007 CHUL research center - Laval University 42
43. Achievement
Public data + open source software + rdf
technology + rdfizer + normalized U RIs =
Bio2RDF knowledge integration;
A bioinformatic-integration ontology wont exist if
it is not adopted by the community, bio2rdf.owl is
just a proposed starting point;
46 millions RDF documents are now available at
http:/ / bio2rdf.org.
Banff, May 8, 2007 CHUL research center - Laval University 43
44. Bio2RDF project provides open
source RDFizer to the community.
So much style need to be rdfized, if
you are interested to contribute,
join us!
Now lets build the big knowledge
map of bioinformatics…
Banff, May 8, 2007 CHUL research center - Laval University 44
45. Final words
Please, tell Sir Tim Berners-L ee that he was right
‘semantic web in bioinformatics’ is a k ille r a p p
to illustrate all the potential of the semantic web.
And also, tell M ark W ilkinson that semantic web
in bioinformatics won’t be full of cr e e p s if we
organize it like we did…
Banff, May 8, 2007 CHUL research center - Laval University 45
46. Thanks
Jean M orissette
Nicole Tourigny
Philippe Rigault
Bioinformatics lab’s team at CHU L Research Center
M any open source communities
(OpenRDF, Simile’s project, Tomcat, J STL and many more)
W 3C Bio-RDF G roup
G énome Québec
G énome Canada
47. Visit http://bio2rdf.org
Download http://sourceforge.net/projects/bio2rdf/
Discover http://bio2rdf.org/bio2rdf-2007-02.owl
Contact us at bio2rdf@gmail.com
Banff, May 8, 2007 CHUL research center - Laval University 47