SlideShare uma empresa Scribd logo
1 de 47
Baixar para ler offline
Towards A M ashup To Build
 Bioinformatics K nowledge System

    François Belleau, M arc-Alexandre Nolin,
Nicole Tourigny, Philippe Rigault, Jean M orissette




                         Département d'informatique et de génie logiciel
                                        Université Laval
Presentation Plan
     K nowledge integration vision
 



     Bio2RDF architecture
 



     RDFization of knowledge
 



     Normalization of U RI
 



     Parkinson E xample Demo
 



     Conclusion
 




Banff, May 8, 2007     CHUL research center - Laval University   2
From the RDF inventor :
  quot;Wouldn't it be great if you were able to
  organize all this information based on
  your own terms, instead of based on the
  application you use to access the
  information ?” (1999)
                                                                 Ramanathan V. Guha

From WikiPedia :
Mashup (web application hybrid)

A mashup is a website or application that
combines content from more than one source
into an integrated experience.(2007)

Banff, May 8, 2007     CHUL research center - Laval University                        3
Sir Berners-L ee’s vision of semantic web
 « The Semantic Web is not a separate
 Web but an extension of the current
 one, in which information is given well-
 defined meaning, better enabling
 computers and people to work in
 cooperation. »
 Scientific Americain, 2001
                                                                                  Tim Berners- Lee




                     http://www.w3.org/2006/Talks/0404-mit-tbl/


Banff, May 8, 2007                      CHUL research center - Laval University                      4
Bio2RDF starting vision at ISM B 2005

       Too many knowledge sources



       available for life science scientists
       Too many formats (text, X M L ,



       HTM L )
       New source each day with



       specialized tool or web interface
       Integration problem recognized by



       global community



    T hanks to Chr istopher Baker, Eric
    Neum ann, Kei Cheun g and
    Johan ne Luciaono for their ideas.


    Banff, May 8, 2007             CHUL research center - Laval University   5
The knowledge integration problem in
                    bioinformatics




     From the BioPAX group(2004)                         From Carol Goble at ISW C 2005




Banff, May 8, 2007          CHUL research center - Laval University                       6
Integration methods in bioinformatics
     1) Davidson 1995
      “Transform data to the federated database on
        demand”

     2) Köhler 2003
      “In different databases the same things can be
         given different names”

     3) Stein 2003
      “link integration, view integration and data
         warehousing”

Banff, May 8, 2007    CHUL research center - Laval University   7
Data warehouse approaches




                                                                            url




   http://www.ncbi.nlm.nih.gov/Database/                       http://www.genome.jp/dbget/dbget.links.html




Banff, May 8, 2007                CHUL research center - Laval University                               8
Bio2RDF ’s approach
                     to knowledge integration :


       “Solve the problem of kn owledge
      in tegration in biology by applying
           a sem antic web approach.”




Banff, May 8, 2007        CHUL research center - Laval University   9
Other semantic web projects




Banff, May 8, 2007    CHUL research center - Laval University   10
Bio2RDF ’s design rules

 2. Convert document to RDF format;
 3. U se of a triplestore technology (sesame,
    virtuoso, oracle);
 4. Normalize U RIs;
 5. Build a mashup as needed to answer specific
    question (elmo);
 6. Query the mashup with SeRQL or SPARQL .
Banff, May 8, 2007        CHUL research center - Laval University   11
Bio2RDF ’s architecture


              #1




                                                                         #5

                                         #4


                                                                   #2


                                                                        #3



                                    #6

Banff, May 8, 2007       CHUL research center - Laval University              12
Bio2RDF ’s knowledge sources




Banff, May 8, 2007   CHUL research center - Laval University   13
RDF conversion statistics
                Data
                                             Numb er of RDF
               sourc        LSID example                     Size of data converted
                                              documents
                   e
                  go          go:0000001              22 961             507 963 321
                kegg        path:aae00010             35 257           1 038 593 137
                                                      14 292               8 902 205
                kegg          cpd:c00001
                                                    438 724              210 458 897
                 mgi           mgi:96103
                                                      17 359             573 639 380
                ncbi         omim:100050
                ncbi            geneid:1          2 744 786           67 225 535 082
                 obo obo's 59 name spaces           279 720              216 007 267
                 pdb           pdb:100d               34 421          16 309 651 935
                                                  4 177 176           29 453 203 064
               uniprot      uniprot:A0A0 00
                                                       5 020               2 844 058
               uniprot       enzyme:1.-.-.-
                                                    191 664              364 728 083
               uniprot     pubmed:100133
               uniprot       taxonomy :10           337 564              125 630 659
               uniprot niref:UniRef100_A0A000
                     u                            7 990 452           14 865 490 144
                  …                …               …                    …




Banff, May 8, 2007                  CHUL research center - Laval University            14
OpenRDF ’s software
                     http://www.openrdf.org/




Banff, May 8, 2007       CHUL research center - Laval University   15
RDF of geneid:15275


                                                            rdf:about
                                                            •



                                                            rdfs:label
                                                            •



                                                            dc:identifier, title, created
                                                            •



                                                            bio2rdf:lsid
                                                            •



                                                            bio2rdf:url
                                                            •



                                                            bio2rdf:synonym
                                                            •



                                                            bio2rdf:xRef
                                                            •


Banff, May 8, 2007       CHUL research center - Laval University                    16
RDFizer
To rdfize: T o convert existin g
docum ent in to RD F form at.




                      efetch                              rdfizer




 Banff, May 8, 2007     CHUL research center - Laval University    17
How to rdfize

                From HTM L pages (prosite:ps00101)
         •
                From X M L documents using X SLT
         •
                (path:mmu00010)
                From X M L documents using X Path and
         •
                J STL (geneid:15275)
                From direct SQL access
         •
                   (ensembl:ensmusg00000025875 )
                From RDF document (uniprot:p26838 )
         •
                From Text files (cpd:c00001)
         •

Banff, May 8, 2007        CHUL research center - Laval University   18
1) prosite:ps00101 from html using a regex




Banff, May 8, 2007   CHUL research center - Laval University   19
2) Kegg’s path:mmu00010 from X M L using X SL




Banff, May 8, 2007   CHUL research center - Laval University   20
3) ensembl:ensmusg00000025875 from SQL




Banff, May 8, 2007   CHUL research center - Laval University   21
4) uniprot:p26838 from RDF using SeRQL




Banff, May 8, 2007   CHUL research center - Laval University   22
One reality, many names
     Different namespace identifier
 ●


           pubmed:11992264 vs pmid:11992264
     Uppercase and lowercase
 ●


           uniprot:p26838 vs uniprot:P26838
     Version number
 ●


           genbank:ac008393 vs genbank:ac008393.7
     Total id length
 ●


           go:0032283 vs go:32283

Banff, May 8, 2007        CHUL research center - Laval University   23
RDF izing docum ent is not enough
       we also need norm alized URIs.


   http:/ / bio2rdf.org/ namespace:id
             http:/ / bio2rdf.org/ pubmed:11992264
               http:/ / bio2rdf.org/ uniprot:p26838
             http:/ / bio2rdf.org/ genbank:ac008393
                 http:/ / bio2rdf.org/ go:0032283
Banff, May 8, 2007      CHUL research center - Laval University   24
U RI Normalization rules
     Different namespace identifier
 ●


           We resolve namespace synonymy with a urlrewrite rule, for
           example pubmed and pmid.
     Uppercase and lowercase
 ●


           We write every U RI in lowercase
     Version number
 ●


           A owl:sameAs predicate is use to link the different versions
           of a document.
     Total id length
 ●


           A fixed length is determine for id.

Banff, May 8, 2007          CHUL research center - Laval University       25
U rl Rewrite Filter
                             http://tuckey.org/urlrewrite/
    < rule>
          < from> ^/ search:(.*?)@pubmed< / from>
          < to> / rdfizer/ ncbi-entrez2rdf.jsp?db= pubmed&amp;query= $1< / to>
    < / rule>
    < rule>
          < from> ^/ pubmed:(.*)< / from>
          < to> / rdfizer/ ncbi-pubmed2rdf.jsp?id= $1< / to>
    < / rule>
    < rule>
          < from> ^/ pmid:(.*)< / from>
          < to> / rdfizer/ lsid-sameas2rdf.jsp?from= pmid:$1&amp;to= pubmed:$1< / to>
    < / rule>

    < rule>
          < from> ^/ (.*):(.*)< / from>
          < to type= quot;redirectquot;> http:/ / bio2rdf.org/ $1:$2< / to>
    < / rule>

Banff, May 8, 2007              CHUL research center - Laval University                26
U RL vs L SID
                        http:/ / bio2rdf.org/ uniprot:p26838
                                          owl:sameAs
                    urn:lsid:uniprot.org:uniprot:p26838



http:/ / bio2rdf .org/ un ipr ot:p26838



                                    http:/ / bi o2rdf .org/ ur n:lsid:uni pr ot.or g:unipr ot:p2 6838




   Banff, May 8, 2007             CHUL research center - Laval University                    27
Our method to answer question

         T o answer a very specialized
          question, we build a specifi c
         kn owledge base (the mash up
          stored in a RDF triplestore)
         and then query it wi th SeRQL.


Banff, May 8, 2007   CHUL research center - Laval University   28
Parkinson examples
      1. What is the semantic network of
         OMIM records describing Parkinson’s
         disease?
      2. Which MeSH terms are mostly cited
         in Parkinson’s disease publications?
      3. What genes related to Parkinson’s
         disease are involved in pathways
         according to Kegg ?


Banff, May 8, 2007      CHUL research center - Laval University   29
Time for demo !



Banff, May 8, 2007     CHUL research center - Laval University   30
The big everything about parkinson
http:/ / localhost:8080/ bio2rdf/ search:parkinson@omim
http:/ / localhost:8080/ bio2rdf/ search:parkinson@geneid
http:/ / localhost:8080/ bio2rdf/ search:parkinson@uniprot
http:/ / localhost:8080/ bio2rdf/ search:parkinson@kegg
http:/ / localhost:8080/ bio2rdf/ load:pubmed
http:/ / localhost:8080/ bio2rdf/ sameas:hsa-geneid
http:/ / localhost:8080/ bio2rdf/ learn:geneid
http:/ / localhost:8080/ bio2rdf/ load:cpd
http:/ / localhost:8080/ bio2rdf/ load:reactome
http:/ / localhost:8080/ bio2rdf/ load:biopax-xref
http:/ / localhost:8080/ bio2rdf/ load:chebi
http:/ / localhost:8080/ bio2rdf/ load:obo-xref
http:/ / localhost:8080/ bio2rdf/ sameas:keggcompound-cpd

      1.700 K triples
97 M bytes in turtle format
      in 90 minutes

 Banff, May 8, 2007             CHUL research center - Laval University   31
Third exemple SeRQL query
  What genes related to Parkinson’s disease are involved in
               pathways according to Kegg ?
SELECT
     GeneticDisorder-label, Gene-label, pathway-label
FROM
  {GeneticDisorder} rdf:type {<http://bio2rdf.org/omim#GeneticDisorder>},
  {GeneticDisorder} rdfs:label {GeneticDisorder-label},
  {GeneticDisorder} <http://www.w3.org/2002/07/owl#sameAs> {sameAs},
  {Gene} <http://bio2rdf.org/bio2rdf#xRef> {sameAs},
  {Gene} rdfs:label {Gene-label},
  {Gene2} <http://www.w3.org/2000/01/rdf-schema#seeAlso> {Gene},
  {xobject} <http://bio2rdf.org/kegg#xobject> {Gene2},
  {xentry1} <http://bio2rdf.org/kegg#xentry1> {xobject},
  {pathway} <http://bio2rdf.org/kegg#xrelation> {xentry1},
  {pathway} rdfs:label {pathway-label}
WHERE
     GeneticDisorder-label like quot;*PARKINSON*quot;




Banff, May 8, 2007        CHUL research center - Laval University           32
Query result




Banff, May 8, 2007   CHUL research center - Laval University   33
Conclusion




Banff, May 8, 2007   CHUL research center - Laval University   34
Before Bio2RDF integration




Banff, May 8, 2007    CHUL research center - Laval University   35
Our main results

   ● RDF is a framework that enables a very simple
     thing: scalability of the knowledge base complexity.

   ● The Bio2RDF project proposes to keep complexity
     in the bioinformatics knowledge space under
     control by applying this proven web semantic
     approach.




Banff, May 8, 2007      CHUL research center - Laval University   36
Now with Bio2RDF semantic integration




Banff, May 8, 2007   CHUL research center - Laval University   37
Bio2RDF ’s vision of knowledge map




Banff, May 8, 2007   CHUL research center - Laval University   38
Bio2RDF ’s map of distributed
                       bioinformatics knowledge




        http://bio2rdf.org/bio2rdf-2007-02.owl
Banff, May 8, 2007          CHUL research center - Laval University   39
M ap of semantic resource




Banff, May 8, 2007         CHUL research center - Laval University   40
M ontreal’s subway map




Banff, May 8, 2007        CHUL research center - Laval University   41
Bio2RDF ’s actual knowledge map




Banff, May 8, 2007   CHUL research center - Laval University   42
Achievement
Public data + open source software + rdf
    technology + rdfizer + normalized U RIs =
    Bio2RDF knowledge integration;
A bioinformatic-integration ontology wont exist if
    it is not adopted by the community, bio2rdf.owl is
    just a proposed starting point;
46 millions RDF documents are now available at
    http:/ / bio2rdf.org.


Banff, May 8, 2007   CHUL research center - Laval University   43
Bio2RDF project provides open
      source RDFizer to the community.
      So much style need to be rdfized, if
       you are interested to contribute,
                   join us!

         Now lets build the big knowledge
            map of bioinformatics…
Banff, May 8, 2007   CHUL research center - Laval University   44
Final words

     Please, tell Sir Tim Berners-L ee that he was right
     ‘semantic web in bioinformatics’ is a k ille r a p p
     to illustrate all the potential of the semantic web.
     And also, tell M ark W ilkinson that semantic web
     in bioinformatics won’t be full of cr e e p s if we
     organize it like we did…




Banff, May 8, 2007   CHUL research center - Laval University   45
Thanks
                    Jean M orissette
                    Nicole Tourigny
                    Philippe Rigault

   Bioinformatics lab’s team at CHU L Research Center

            M any open source communities
(OpenRDF, Simile’s project, Tomcat, J STL and many more)

                 W 3C Bio-RDF G roup

                   G énome Québec
                   G énome Canada
Visit http://bio2rdf.org
    Download http://sourceforge.net/projects/bio2rdf/

       Discover http://bio2rdf.org/bio2rdf-2007-02.owl

                     Contact us at bio2rdf@gmail.com
Banff, May 8, 2007          CHUL research center - Laval University   47

Mais conteúdo relacionado

Semelhante a Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Linked Data for integrating life-science databases
Linked Data for integrating life-science databasesLinked Data for integrating life-science databases
Linked Data for integrating life-science databases
Shuichi Kawashima
 
BioPAX Models and Pathways
BioPAX Models and PathwaysBioPAX Models and Pathways
BioPAX Models and Pathways
Michel Dumontier
 
Providing named entity based search with a common biological database naming ...
Providing named entity based search with a common biological database naming ...Providing named entity based search with a common biological database naming ...
Providing named entity based search with a common biological database naming ...
nolmar01
 
Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014
Monica Munoz-Torres
 
EUGM 2013 - Bernd Rupp (FMP) Chemical Information systems: From compound coll...
EUGM 2013 - Bernd Rupp (FMP) Chemical Information systems: From compound coll...EUGM 2013 - Bernd Rupp (FMP) Chemical Information systems: From compound coll...
EUGM 2013 - Bernd Rupp (FMP) Chemical Information systems: From compound coll...
ChemAxon
 

Semelhante a Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System (20)

Linked Data for integrating life-science databases
Linked Data for integrating life-science databasesLinked Data for integrating life-science databases
Linked Data for integrating life-science databases
 
Small molecule identification and the new MassBank
Small molecule identification and the new MassBankSmall molecule identification and the new MassBank
Small molecule identification and the new MassBank
 
BioPAX Models and Pathways
BioPAX Models and PathwaysBioPAX Models and Pathways
BioPAX Models and Pathways
 
Converting GHO to RDF
Converting GHO to RDFConverting GHO to RDF
Converting GHO to RDF
 
Using Architectures for Semantic Interoperability to Create Journal Clubs for...
Using Architectures for Semantic Interoperability to Create Journal Clubs for...Using Architectures for Semantic Interoperability to Create Journal Clubs for...
Using Architectures for Semantic Interoperability to Create Journal Clubs for...
 
Providing named entity based search with a common biological database naming ...
Providing named entity based search with a common biological database naming ...Providing named entity based search with a common biological database naming ...
Providing named entity based search with a common biological database naming ...
 
W4 4 marc-alexandre-nolin-v2
W4 4 marc-alexandre-nolin-v2W4 4 marc-alexandre-nolin-v2
W4 4 marc-alexandre-nolin-v2
 
Introduction to BioHackathon 2014
Introduction to BioHackathon 2014Introduction to BioHackathon 2014
Introduction to BioHackathon 2014
 
Bio2RDF@BH2010
Bio2RDF@BH2010Bio2RDF@BH2010
Bio2RDF@BH2010
 
Producing, publishing and consuming linked data - CSHALS 2013
Producing, publishing and consuming linked data - CSHALS 2013Producing, publishing and consuming linked data - CSHALS 2013
Producing, publishing and consuming linked data - CSHALS 2013
 
Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012
 
Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
Linked Data for Federation of OER Data &amp; Repositories
Linked Data for Federation of OER Data &amp; RepositoriesLinked Data for Federation of OER Data &amp; Repositories
Linked Data for Federation of OER Data &amp; Repositories
 
HMW-DNA for long-read single-molecule sequencing
HMW-DNA for long-read single-molecule sequencingHMW-DNA for long-read single-molecule sequencing
HMW-DNA for long-read single-molecule sequencing
 
iMicrobe_ASLO_2015
iMicrobe_ASLO_2015iMicrobe_ASLO_2015
iMicrobe_ASLO_2015
 
Primary and secondary database
Primary and secondary databasePrimary and secondary database
Primary and secondary database
 
DisGeNET Tutorial SWAT4LS 2015-12-07
DisGeNET Tutorial SWAT4LS 2015-12-07DisGeNET Tutorial SWAT4LS 2015-12-07
DisGeNET Tutorial SWAT4LS 2015-12-07
 
EUGM 2013 - Bernd Rupp (FMP) Chemical Information systems: From compound coll...
EUGM 2013 - Bernd Rupp (FMP) Chemical Information systems: From compound coll...EUGM 2013 - Bernd Rupp (FMP) Chemical Information systems: From compound coll...
EUGM 2013 - Bernd Rupp (FMP) Chemical Information systems: From compound coll...
 
Ruby on bioinformatics
Ruby on bioinformaticsRuby on bioinformatics
Ruby on bioinformatics
 

Mais de François Belleau

Mais de François Belleau (16)

Bio2RDF @ DILS 2008
Bio2RDF @ DILS 2008Bio2RDF @ DILS 2008
Bio2RDF @ DILS 2008
 
Pitch Reactome2json_ld @ swat4hcls 2020
Pitch Reactome2json_ld @ swat4hcls 2020Pitch Reactome2json_ld @ swat4hcls 2020
Pitch Reactome2json_ld @ swat4hcls 2020
 
Show de boucane pour ELK
Show de boucane pour ELKShow de boucane pour ELK
Show de boucane pour ELK
 
Pitch Qliic coopérathon 2017
Pitch Qliic coopérathon 2017Pitch Qliic coopérathon 2017
Pitch Qliic coopérathon 2017
 
2015-11-17 Présentation SEAO et ES
2015-11-17 Présentation SEAO et ES2015-11-17 Présentation SEAO et ES
2015-11-17 Présentation SEAO et ES
 
Linuq 20160130
Linuq 20160130Linuq 20160130
Linuq 20160130
 
textOdossier
textOdossiertextOdossier
textOdossier
 
BD2K hackathon - Bio2RDF submission
BD2K hackathon - Bio2RDF submissionBD2K hackathon - Bio2RDF submission
BD2K hackathon - Bio2RDF submission
 
Découvrir le web sémantique en 15 minutes (Decideo 2014)
Découvrir le web sémantique en 15 minutes (Decideo 2014)Découvrir le web sémantique en 15 minutes (Decideo 2014)
Découvrir le web sémantique en 15 minutes (Decideo 2014)
 
Bio2RDF poster for Biocurator 2014 conference
Bio2RDF poster for Biocurator 2014 conferenceBio2RDF poster for Biocurator 2014 conference
Bio2RDF poster for Biocurator 2014 conference
 
Acfas 2013 - Comment publier sur le web sémantique : la méthode de Bio2RDF
Acfas 2013 - Comment publier sur le web sémantique : la méthode de Bio2RDFAcfas 2013 - Comment publier sur le web sémantique : la méthode de Bio2RDF
Acfas 2013 - Comment publier sur le web sémantique : la méthode de Bio2RDF
 
Producing, Publishing and Consuming Linked Data Three lessons from the Bio2RD...
Producing, Publishing and Consuming Linked Data Three lessons from the Bio2RD...Producing, Publishing and Consuming Linked Data Three lessons from the Bio2RD...
Producing, Publishing and Consuming Linked Data Three lessons from the Bio2RD...
 
Bio2RDF-ISMB2008
Bio2RDF-ISMB2008Bio2RDF-ISMB2008
Bio2RDF-ISMB2008
 
Bio2RDF : A Semantic Web Atlas of post genomic knowledge about Human and Mouse
Bio2RDF : A Semantic Web Atlas of post genomic knowledge about Human and MouseBio2RDF : A Semantic Web Atlas of post genomic knowledge about Human and Mouse
Bio2RDF : A Semantic Web Atlas of post genomic knowledge about Human and Mouse
 
Bio2RDF should we do it
Bio2RDF should we do itBio2RDF should we do it
Bio2RDF should we do it
 
Bio2RDF/Virtuoso
Bio2RDF/VirtuosoBio2RDF/Virtuoso
Bio2RDF/Virtuoso
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

  • 1. Towards A M ashup To Build Bioinformatics K nowledge System François Belleau, M arc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault, Jean M orissette Département d'informatique et de génie logiciel Université Laval
  • 2. Presentation Plan K nowledge integration vision  Bio2RDF architecture  RDFization of knowledge  Normalization of U RI  Parkinson E xample Demo  Conclusion  Banff, May 8, 2007 CHUL research center - Laval University 2
  • 3. From the RDF inventor : quot;Wouldn't it be great if you were able to organize all this information based on your own terms, instead of based on the application you use to access the information ?” (1999) Ramanathan V. Guha From WikiPedia : Mashup (web application hybrid) A mashup is a website or application that combines content from more than one source into an integrated experience.(2007) Banff, May 8, 2007 CHUL research center - Laval University 3
  • 4. Sir Berners-L ee’s vision of semantic web « The Semantic Web is not a separate Web but an extension of the current one, in which information is given well- defined meaning, better enabling computers and people to work in cooperation. » Scientific Americain, 2001 Tim Berners- Lee http://www.w3.org/2006/Talks/0404-mit-tbl/ Banff, May 8, 2007 CHUL research center - Laval University 4
  • 5. Bio2RDF starting vision at ISM B 2005 Too many knowledge sources  available for life science scientists Too many formats (text, X M L ,  HTM L ) New source each day with  specialized tool or web interface Integration problem recognized by  global community T hanks to Chr istopher Baker, Eric Neum ann, Kei Cheun g and Johan ne Luciaono for their ideas. Banff, May 8, 2007 CHUL research center - Laval University 5
  • 6. The knowledge integration problem in bioinformatics From the BioPAX group(2004) From Carol Goble at ISW C 2005 Banff, May 8, 2007 CHUL research center - Laval University 6
  • 7. Integration methods in bioinformatics 1) Davidson 1995 “Transform data to the federated database on demand” 2) Köhler 2003 “In different databases the same things can be given different names” 3) Stein 2003 “link integration, view integration and data warehousing” Banff, May 8, 2007 CHUL research center - Laval University 7
  • 8. Data warehouse approaches url http://www.ncbi.nlm.nih.gov/Database/ http://www.genome.jp/dbget/dbget.links.html Banff, May 8, 2007 CHUL research center - Laval University 8
  • 9. Bio2RDF ’s approach to knowledge integration : “Solve the problem of kn owledge in tegration in biology by applying a sem antic web approach.” Banff, May 8, 2007 CHUL research center - Laval University 9
  • 10. Other semantic web projects Banff, May 8, 2007 CHUL research center - Laval University 10
  • 11. Bio2RDF ’s design rules 2. Convert document to RDF format; 3. U se of a triplestore technology (sesame, virtuoso, oracle); 4. Normalize U RIs; 5. Build a mashup as needed to answer specific question (elmo); 6. Query the mashup with SeRQL or SPARQL . Banff, May 8, 2007 CHUL research center - Laval University 11
  • 12. Bio2RDF ’s architecture #1 #5 #4 #2 #3 #6 Banff, May 8, 2007 CHUL research center - Laval University 12
  • 13. Bio2RDF ’s knowledge sources Banff, May 8, 2007 CHUL research center - Laval University 13
  • 14. RDF conversion statistics Data Numb er of RDF sourc LSID example Size of data converted documents e go go:0000001 22 961 507 963 321 kegg path:aae00010 35 257 1 038 593 137 14 292 8 902 205 kegg cpd:c00001 438 724 210 458 897 mgi mgi:96103 17 359 573 639 380 ncbi omim:100050 ncbi geneid:1 2 744 786 67 225 535 082 obo obo's 59 name spaces 279 720 216 007 267 pdb pdb:100d 34 421 16 309 651 935 4 177 176 29 453 203 064 uniprot uniprot:A0A0 00 5 020 2 844 058 uniprot enzyme:1.-.-.- 191 664 364 728 083 uniprot pubmed:100133 uniprot taxonomy :10 337 564 125 630 659 uniprot niref:UniRef100_A0A000 u 7 990 452 14 865 490 144 … … … … Banff, May 8, 2007 CHUL research center - Laval University 14
  • 15. OpenRDF ’s software http://www.openrdf.org/ Banff, May 8, 2007 CHUL research center - Laval University 15
  • 16. RDF of geneid:15275 rdf:about • rdfs:label • dc:identifier, title, created • bio2rdf:lsid • bio2rdf:url • bio2rdf:synonym • bio2rdf:xRef • Banff, May 8, 2007 CHUL research center - Laval University 16
  • 17. RDFizer To rdfize: T o convert existin g docum ent in to RD F form at. efetch rdfizer Banff, May 8, 2007 CHUL research center - Laval University 17
  • 18. How to rdfize From HTM L pages (prosite:ps00101) • From X M L documents using X SLT • (path:mmu00010) From X M L documents using X Path and • J STL (geneid:15275) From direct SQL access • (ensembl:ensmusg00000025875 ) From RDF document (uniprot:p26838 ) • From Text files (cpd:c00001) • Banff, May 8, 2007 CHUL research center - Laval University 18
  • 19. 1) prosite:ps00101 from html using a regex Banff, May 8, 2007 CHUL research center - Laval University 19
  • 20. 2) Kegg’s path:mmu00010 from X M L using X SL Banff, May 8, 2007 CHUL research center - Laval University 20
  • 21. 3) ensembl:ensmusg00000025875 from SQL Banff, May 8, 2007 CHUL research center - Laval University 21
  • 22. 4) uniprot:p26838 from RDF using SeRQL Banff, May 8, 2007 CHUL research center - Laval University 22
  • 23. One reality, many names Different namespace identifier ● pubmed:11992264 vs pmid:11992264 Uppercase and lowercase ● uniprot:p26838 vs uniprot:P26838 Version number ● genbank:ac008393 vs genbank:ac008393.7 Total id length ● go:0032283 vs go:32283 Banff, May 8, 2007 CHUL research center - Laval University 23
  • 24. RDF izing docum ent is not enough we also need norm alized URIs. http:/ / bio2rdf.org/ namespace:id http:/ / bio2rdf.org/ pubmed:11992264 http:/ / bio2rdf.org/ uniprot:p26838 http:/ / bio2rdf.org/ genbank:ac008393 http:/ / bio2rdf.org/ go:0032283 Banff, May 8, 2007 CHUL research center - Laval University 24
  • 25. U RI Normalization rules Different namespace identifier ● We resolve namespace synonymy with a urlrewrite rule, for example pubmed and pmid. Uppercase and lowercase ● We write every U RI in lowercase Version number ● A owl:sameAs predicate is use to link the different versions of a document. Total id length ● A fixed length is determine for id. Banff, May 8, 2007 CHUL research center - Laval University 25
  • 26. U rl Rewrite Filter http://tuckey.org/urlrewrite/ < rule> < from> ^/ search:(.*?)@pubmed< / from> < to> / rdfizer/ ncbi-entrez2rdf.jsp?db= pubmed&amp;query= $1< / to> < / rule> < rule> < from> ^/ pubmed:(.*)< / from> < to> / rdfizer/ ncbi-pubmed2rdf.jsp?id= $1< / to> < / rule> < rule> < from> ^/ pmid:(.*)< / from> < to> / rdfizer/ lsid-sameas2rdf.jsp?from= pmid:$1&amp;to= pubmed:$1< / to> < / rule> < rule> < from> ^/ (.*):(.*)< / from> < to type= quot;redirectquot;> http:/ / bio2rdf.org/ $1:$2< / to> < / rule> Banff, May 8, 2007 CHUL research center - Laval University 26
  • 27. U RL vs L SID http:/ / bio2rdf.org/ uniprot:p26838 owl:sameAs urn:lsid:uniprot.org:uniprot:p26838 http:/ / bio2rdf .org/ un ipr ot:p26838 http:/ / bi o2rdf .org/ ur n:lsid:uni pr ot.or g:unipr ot:p2 6838 Banff, May 8, 2007 CHUL research center - Laval University 27
  • 28. Our method to answer question T o answer a very specialized question, we build a specifi c kn owledge base (the mash up stored in a RDF triplestore) and then query it wi th SeRQL. Banff, May 8, 2007 CHUL research center - Laval University 28
  • 29. Parkinson examples 1. What is the semantic network of OMIM records describing Parkinson’s disease? 2. Which MeSH terms are mostly cited in Parkinson’s disease publications? 3. What genes related to Parkinson’s disease are involved in pathways according to Kegg ? Banff, May 8, 2007 CHUL research center - Laval University 29
  • 30. Time for demo ! Banff, May 8, 2007 CHUL research center - Laval University 30
  • 31. The big everything about parkinson http:/ / localhost:8080/ bio2rdf/ search:parkinson@omim http:/ / localhost:8080/ bio2rdf/ search:parkinson@geneid http:/ / localhost:8080/ bio2rdf/ search:parkinson@uniprot http:/ / localhost:8080/ bio2rdf/ search:parkinson@kegg http:/ / localhost:8080/ bio2rdf/ load:pubmed http:/ / localhost:8080/ bio2rdf/ sameas:hsa-geneid http:/ / localhost:8080/ bio2rdf/ learn:geneid http:/ / localhost:8080/ bio2rdf/ load:cpd http:/ / localhost:8080/ bio2rdf/ load:reactome http:/ / localhost:8080/ bio2rdf/ load:biopax-xref http:/ / localhost:8080/ bio2rdf/ load:chebi http:/ / localhost:8080/ bio2rdf/ load:obo-xref http:/ / localhost:8080/ bio2rdf/ sameas:keggcompound-cpd 1.700 K triples 97 M bytes in turtle format in 90 minutes Banff, May 8, 2007 CHUL research center - Laval University 31
  • 32. Third exemple SeRQL query What genes related to Parkinson’s disease are involved in pathways according to Kegg ? SELECT GeneticDisorder-label, Gene-label, pathway-label FROM {GeneticDisorder} rdf:type {<http://bio2rdf.org/omim#GeneticDisorder>}, {GeneticDisorder} rdfs:label {GeneticDisorder-label}, {GeneticDisorder} <http://www.w3.org/2002/07/owl#sameAs> {sameAs}, {Gene} <http://bio2rdf.org/bio2rdf#xRef> {sameAs}, {Gene} rdfs:label {Gene-label}, {Gene2} <http://www.w3.org/2000/01/rdf-schema#seeAlso> {Gene}, {xobject} <http://bio2rdf.org/kegg#xobject> {Gene2}, {xentry1} <http://bio2rdf.org/kegg#xentry1> {xobject}, {pathway} <http://bio2rdf.org/kegg#xrelation> {xentry1}, {pathway} rdfs:label {pathway-label} WHERE GeneticDisorder-label like quot;*PARKINSON*quot; Banff, May 8, 2007 CHUL research center - Laval University 32
  • 33. Query result Banff, May 8, 2007 CHUL research center - Laval University 33
  • 34. Conclusion Banff, May 8, 2007 CHUL research center - Laval University 34
  • 35. Before Bio2RDF integration Banff, May 8, 2007 CHUL research center - Laval University 35
  • 36. Our main results ● RDF is a framework that enables a very simple thing: scalability of the knowledge base complexity. ● The Bio2RDF project proposes to keep complexity in the bioinformatics knowledge space under control by applying this proven web semantic approach. Banff, May 8, 2007 CHUL research center - Laval University 36
  • 37. Now with Bio2RDF semantic integration Banff, May 8, 2007 CHUL research center - Laval University 37
  • 38. Bio2RDF ’s vision of knowledge map Banff, May 8, 2007 CHUL research center - Laval University 38
  • 39. Bio2RDF ’s map of distributed bioinformatics knowledge http://bio2rdf.org/bio2rdf-2007-02.owl Banff, May 8, 2007 CHUL research center - Laval University 39
  • 40. M ap of semantic resource Banff, May 8, 2007 CHUL research center - Laval University 40
  • 41. M ontreal’s subway map Banff, May 8, 2007 CHUL research center - Laval University 41
  • 42. Bio2RDF ’s actual knowledge map Banff, May 8, 2007 CHUL research center - Laval University 42
  • 43. Achievement Public data + open source software + rdf technology + rdfizer + normalized U RIs = Bio2RDF knowledge integration; A bioinformatic-integration ontology wont exist if it is not adopted by the community, bio2rdf.owl is just a proposed starting point; 46 millions RDF documents are now available at http:/ / bio2rdf.org. Banff, May 8, 2007 CHUL research center - Laval University 43
  • 44. Bio2RDF project provides open source RDFizer to the community. So much style need to be rdfized, if you are interested to contribute, join us! Now lets build the big knowledge map of bioinformatics… Banff, May 8, 2007 CHUL research center - Laval University 44
  • 45. Final words Please, tell Sir Tim Berners-L ee that he was right ‘semantic web in bioinformatics’ is a k ille r a p p to illustrate all the potential of the semantic web. And also, tell M ark W ilkinson that semantic web in bioinformatics won’t be full of cr e e p s if we organize it like we did… Banff, May 8, 2007 CHUL research center - Laval University 45
  • 46. Thanks Jean M orissette Nicole Tourigny Philippe Rigault Bioinformatics lab’s team at CHU L Research Center M any open source communities (OpenRDF, Simile’s project, Tomcat, J STL and many more) W 3C Bio-RDF G roup G énome Québec G énome Canada
  • 47. Visit http://bio2rdf.org Download http://sourceforge.net/projects/bio2rdf/ Discover http://bio2rdf.org/bio2rdf-2007-02.owl Contact us at bio2rdf@gmail.com Banff, May 8, 2007 CHUL research center - Laval University 47