SlideShare uma empresa Scribd logo
1 de 76
Crowdsourcing Biology: The Gene
Wiki, BioGPS and GeneGames.org
               Andrew Su, Ph.D.
                  @andrewsu
                asu@scripps.edu
                 http://sulab.org



            Sanger/EBI

         September 7, 2012
2
Few genes are well annotated…


            TP53
            TNF
            APOE
            MTHFR
            IL6
            HLA-DRB1
   Counts




            VEGFA
            EGFR
            TGFB1                               59%
            ACE

                       PubMed
                                                      38%            23,278 protein-
                                                                      coding genes

             Gene
            ontology




                          Genes, sorted by decreasing counts


                                                       Data: NCBI gene2pubmed, August 2010
3
… because the literature is sparsely curated?


                       Number of PubMed-indexed articles
          1,000,000

           800,000

           600,000

           400,000

           200,000

                 0
                      1979   1984   1989   1994   1999   2004   2009
4
… because the literature is sparsely curated?


                   Average of articlesof humantypical scientist
                   Number capacity read by scientist


              20




              10




              0




              1979     1984   1989   1994   1999   2004   2009
5




311,696 articles (1.5% of PubMed)
have been cited by GO annotations
6




    Sooner or later, the
 research community will
need to be involved in the
             0
annotation effort to scale
   up to the rate of data
        generation.
7
The Long Tail is a prolific source of content


                       Short
                       Head
             Content
            produced


                                       Long Tail



                               Contributors (sorted)




             News :      Newspapers                 Blogs
              Video:    TV/Hollywood               YouTube
   Product reviews:    Consumer reports         Amazon reviews
     Food reviews:       Food critics                Yelp
     Talent judging:      Olympics               American Idol
8
Wikipedia is reasonably accurate
9
Wikipedia has breadth and depth


           Articles




            Words
             (millions)




            Words/
            article


                          Wikipedia   Britannica Online




                                                http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
10




  We can harness the
Long Tail of scientists
to directly participate in
  the gene annotation
        process.
11
From crowdsourcing to structured data



                   The Gene Wiki




                Biological Games
Filtering, extracting, and summarizing PubMed



Documents




 Concepts
13
Wiki success depends on a positive feedback

                  Gene wiki page utility




                             1   100
                         2             200




    Number of                                Number of
   contributors                                users
14
 10,000 gene “stubs” within Wikipedia          Utility




                                                         Users

                                        Contributors



                                         Protein structure
    Gene
  summary
                                          Symbols and
                                           identifiers


                                         Gene Ontology
                                          annotations
   Protein
interactions

                                        Tissue expression
  Linked                                     pattern
references

                                         Links to structured
                                             databases



Huss, PLoS Biol, 2008
15
 Gene Wiki has a critical mass of readers
                                                                                   Utility
                                         Total: 5.0 million views / month




                                                                                             Users
                                                                            Contributors




Huss, PLoS Biol, 2008; Good, NAR, 2011
16
 Gene Wiki has a critical mass of editors
                                                                           Utility



                                   Editors
           Editor count




                                                        Edit count
                                                                                     Users
                                                                     Contributors
                                               Edits




                          Increase of ~10,000 words / month from >1,000 edits
                                       Currently 1.42 million words
                              Approximately equal to 230 full-length articles
Good, NAR, 2011
17
A review article for every gene is powerful




     Reelin: 98 editors, 703 edits since July 2002
                                      Hyperlinks to related concepts
     Heparin: 358 editors, 654 edits since June 2003
     AMPK: 109 editors, 203 edits since March 2004
     RNAi: 394 editors, 994 edits since October 2002
                                               References to the literature
18
Making the Gene Wiki more reliable
  Novartis is a multinational   2       The company name is derived
  pharmaceutical company                 from old Greek, and means
 based in Basel, Switzerland                 "destroyer of birds".
that manufactures drugs such
         as clozapine
     (Clozaril), diclofenac
         (Voltaren), …

                                    2
19
Making the Gene Wiki more reliable
  Novartis is a multinational         2         The company name is derived
  pharmaceutical company                         from old Greek, and means
 based in Basel, Switzerland                         "destroyer of birds".
that manufactures drugs such
         as clozapine
     (Clozaril), diclofenac
         (Voltaren), …




              36211 total edits              36 total edits

                                  *                                          *
                                  *
                                  *
                                  *                                          *
                                  *
                                  *                                          *
                                  *
                                  *                                          *
                                  *                                          *

          High-trust author               Low-trust author
                                                      http://www.wikitrust.net/
20
Making the Gene Wiki more computable



Free text       Structured annotations
21
Filling the gaps in gene annotation

                                             NCBI Entrez Gene: 334



                            Gene Wiki
                            mapping


          Wikilink                              Candidate
                                                assertion

                                             GO:0006897



                             GO exact
                              match
                 6319 novel GO annotations
                 2147 novel DO annotations
22




TOP 100
GENES
23
Gene Wiki content improves enrichment analysis
    axon                                           Enrichment
  guidance     GO term
                                                    analysis
(GO:0007411)

                                    811 articles

 264 genes                          PubMed          Concept
               Gene list
                                    abstracts      recognition




                     GO:0007411
                      Yes    No
Linked genes   Yes     13       2
   through
               No     251   12033
   PubMed

                P = 1.55 E-20
24
Gene Wiki content improves enrichment analysis
   muscle                                          Enrichment
 contraction   GO term
                                                    analysis
(GO:0006936)

                                 251 articles

  87 genes                      PubMed              Concept
               Gene list
                                abstracts          recognition
                                     +
                                Gene Wiki
                                 87 articles
                   GO:0006936                     GO:0006936


Linked genes                       Linked genes
   through                            through
   PubMed                            PubMed +
                                     Gene Wiki
                   P = 1.0                        P = 1.22 E-09
25
Gene Wiki content improves enrichment analysis



                     More
    p-value       significant
(PubMed + GW)    PubMed only

                                                  Muscle
                                                contraction



                                     More
                                  significant
                                 PubMed + GW




                   p-value (PubMed only)
26
Gene Wiki+: Crowdsourced semantic database
 Q: What genes are related to hemolytic anemia?
27




          The
 Long Tail of scientists
is a valuable source of
  information on gene
        function
28
From crowdsourcing to structured data



                   The Gene Wiki




                Biological Games
29
Gene databases are numerous and overlapping




                            … and hundreds
                               more …
30
Community extensibility and user customizability




                   http://biogps.org
31
Utility: A simple and universal plugin interface
         Utility




Contributors       Users
32
Utility: A simple and universal plugin interface
         Utility




Contributors       Users
33
Utility: A simple and universal plugin interface
         Utility




Contributors       Users
34
Utility: A simple and universal plugin interface
         Utility




Contributors       Users
35
Utility: A simple and universal plugin interface
         Utility




Contributors       Users
36
Utility: A simple and universal plugin interface
         Utility




Contributors         Users




                       Total of 389 gene-centric online
                   databases registered as BioGPS plugins
37
Users: BioGPS has critical mass
         Utility           Daily pageviews




Contributors       Users




   • > 4100 registered users                      Top 10 organizations
   • 4000 unique visitors per week           1.     Harvard     6. Cambridge
                                             2.     NIH         7. U Penn
   • 40,000 page views per week
                                             3.     UCSD        8. Stanford
                                             4.     Scripps     9. Wash U
                                             5.     MIT         10. UNC
38
Contributors: Explicit and implicit knowledge
         Utility




Contributors       Users




     389 plugins registered
      (65% publicly shared)

         by over 75 users

    spanning 150+ domains
39
Mining structured content from HTML
40
Defining a data extraction template
        TP53   TNF   APOE   IL6   VEGF EGFR TGFB1   …
  …
41
The BioGPS Semantic Annotator




              http://50.112.124.237
42




        The
    Long Tail of
 bioinformaticians
can collaboratively
build a gene portal.
43
From crowdsourcing to structured data



                   The Gene Wiki




                Biological Games
44



Seven million human hours




                            http://www.flickr.com/photos/archana3k1/4124330493/
45



Twenty million human hours




                             http://www.flickr.com/photos/ableman/2171326385/
46
-
    150 billion human hours
              per year




                              http://www.flickr.com/photos/rvp-cw/6243289302/
47
Using games to fold proteins



      Fold.it players have successfully:
      • Outperformed state of the art protein
        folding algorithms (Cooper, Nature, 2010)
      • Solved a previously-intractable crystal
        structure (Khatib, Nat Struct Mol Biol, 2011)
      • Designed an improved protein folding
        algorithm (Khatib, PNAS, 2011)
      • Improved enzyme activity of de novo
        designed enzyme (Eiben, Nat Biotechnol, 2011)
48
Using games to fold RNAs




              http://eterna.cmu.edu/
49
Using games to align sequences




              http://phylo.cs.mcgill.ca
50
Using games to annotate genes?




              http://genegames.org
51
No good gene-disease annotation database
             Query: Apolipoprotein E




            Alzheimer's disease (AD)
            Lipoprotein glomerulopathy
            Sea-blue histiocyte disease
52
No good gene-disease annotation database
             Query: Apolipoprotein E




            Alzheimer's disease (AD)
            Lipoprotein glomerulopathy
            Sea-blue histiocyte disease
            Hyperlipoproteinemia, type III
            Macular degeneration, age-related
            Myocardial infarction susceptibility
53
No good gene-disease annotation database
              Query: Apolipoprotein E




           ? Alzheimer's disease (AD)
           ? Lipoprotein glomerulopathy
           ? Sea-blue histiocyte disease
             Hyperlipoproteinemia, type III
           ? Macular degeneration, age-related
           ? Myocardial infarction susceptibility
             HIV
             Psoriasis
             Vascular Diseases
54
No good gene-disease annotation database
             Query: Apolipoprotein E




            Alzheimer's disease (AD)    Memory
                                        Coronary Artery Disease
            Neuropsychological Tests    Hypertension
            Cognition Disorders         Mental Status Schedule
                                        Psychiatric Status Rating
            Dementia                        Scales
            Cognition                   Hyperlipidemias
                                        Atrophy
            Disease Progression         Dementia, Vascular
            Cardiovascular Diseases     Parkinson Disease
                                        Brain Injuries
            Coronary Disease            Myocardial Infarction
            Diabetes Mellitus, Type 2   …

            Memory Disorders            477 diseases!
55
Play Dizeez to annotate gene-disease links
                                                6. Play to win!
               5. Hurry!
                                 4. Then on to the
                                 next question…

           3. If it‟s „right‟, you get points

            1. Read the clue (gene)




                             2. Click the related disease
                                (only one is “right”)
56
Dizeez players seem pretty smart…

  In total (since Dec 2011):
  • 207 unique gamers
  • 1045 games played
  • 8525 guesses

# Occurrences   Gene Disease              Pubmed   OMIM PharmGKB   Gene Wiki

      7         GAST gastrinoma
      7         RBP3 retinoblastoma
      7         SSX1 synovial sarcoma
      6          TG    Graves' disease
      6         CRYGC Cataract
      6         SOX8 mental retardation
      6          WRN Werner syndrome
      6          ABL1 leukemia
      6         MLL3 leukemia
      6         SNAI2 breast carcinoma
57
Dizeez players seem pretty smart…

  In total (since Dec 2011):
  • 207 unique gamers
  • 1045 games played
  • 8525 guesses

# Occurrences    Gene Disease              Pubmed   OMIM PharmGKB   Gene Wiki

      5         MECOM sarcoma
      4         ATF7   cancer
      3         ABCB5 acute myeloid leukemia
      3         SART1 glioblastoma
      3         NCK1   leukemia
      3         NEK1   cancer
58
Using games to predict phenotype from genotype?




                                  The Cure




               http://genegames.org
59
Classification problems in genome biology

                                                   Classify new
   cancer                    normal                  samples


                                      find patterns
                                                                  cancer
   100,000s features




                                                                  normal
                                          SVM
                                         Neural
                                        networks
                                          Naïve
                                          Bayes
                                          KNN
                                           …
                       100s samples
60
Random forests
                                      Sample subset
                                       of cases and   Train decision
  cancer                     normal       features         tree
   100,000s features




                       100s samples
61
Random forests


  cancer                     normal
   100,000s features




                       100s samples
62
Random forests

                                                         Classify new
  cancer                     normal                        samples



                                                                        cancer
   100,000s features




                                                                        normal




                                      How to interject
                                        biological
                       100s samples    knowledge?
63
Network-guided forests




                         Dutkowski & Ideker (2011). PLoS Computational Biology
64
Network-guided forests
                                          Sample
                                      features by PPI   Train decision
  cancer                     normal       network            tree
   100,000s features




                       100s samples
65
Human-guided forests
                                        Sample
                                      features by    Train decision
  cancer                     normal      human            tree
                                      intelligence
   100,000s features




                       100s samples
66
67
The Cure: Genomic predictors for disease
68
The Cure: Genomic predictors for disease
69
The Cure: Genomic predictors for disease
70
The Cure: Genomic predictors for disease
71
The Cure: Genomic predictors for disease
72
The Cure: Genomic predictors for disease
73
Human-guided forests

                       Classify new
                         samples



                                      cancer
                                      normal
74
“Critical Assessment”-style challenge




      Will this work? Check our blog after October 15.
75




         The
Long Tail of gamers
 can collaboratively
  build an accurate
 disease classifier.
76
       Collaborators                                                        Group members
Doug Howe, ZFIN                                             Ben Good                   Max Nanis
John Hogenesch, U Penn
Jon Huss, GNF
                                                            Salvatore Loguercio        Chunlei Wu
Luca de Alfaro, UCSC                                        Ian Macleod
Angel Pizzaro, U Penn
Faramarz Valafar, SDSU
Pierre Lindenbaum,
      Fondation Jean Dausset
Michael Martone, Rush
Konrad Koehler, Karo Bio
Warren Kibbe, Simon Lim, Northwestern
Many Wikipedia editors
    WP:MCB Project



                                                                                         Contact
                                                                                     http://sulab.org
 Recruiting graduate students
                                                                                    asu@scripps.edu
  in quantitative biology! See                                                        @andrewsu
 http://education.scripps.edu/                                                        +Andrew Su



                                        Funding and Support


                                                                                      @genegame
                                   (BioGPS: GM83924, Gene Wiki: GM089820)

Mais conteúdo relacionado

Semelhante a Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Gene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meetingGene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meetingBenjamin Good
 
Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...Andrew Su
 
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...Andrew Su
 
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...Andrew Su
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyBarry Smith
 
Developing Frameworks and Tools for Animal Trait Ontology (ATO)
Developing Frameworks and Tools for Animal Trait Ontology (ATO) Developing Frameworks and Tools for Animal Trait Ontology (ATO)
Developing Frameworks and Tools for Animal Trait Ontology (ATO) Jie Bao
 
Lock - PomBase community curation
Lock - PomBase community curationLock - PomBase community curation
Lock - PomBase community curationPascale Gaudet
 
PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011
PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011
PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011rebshoe
 
BioCuration 2019 - Evidence and Conclusion Ontology 2019 Update
BioCuration 2019 - Evidence and Conclusion Ontology 2019 UpdateBioCuration 2019 - Evidence and Conclusion Ontology 2019 Update
BioCuration 2019 - Evidence and Conclusion Ontology 2019 Updatedolleyj
 
Introduction to EOL.org for scientists
Introduction to EOL.org for scientistsIntroduction to EOL.org for scientists
Introduction to EOL.org for scientistsCyndy Parr
 
Bio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challengesBio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challengesJanna Hastings
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Sciencedrnigam
 
Species pages and portals
Species pages and portals Species pages and portals
Species pages and portals Cyndy Parr
 
Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.Monica Munoz-Torres
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgCrowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgAndrew Su
 
ABIcurator.doc
ABIcurator.docABIcurator.doc
ABIcurator.docbutest
 
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Amit Sheth
 

Semelhante a Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger) (20)

Gene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meetingGene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meeting
 
Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...
 
bioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics databioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics data
 
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
 
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental Biology
 
BioPortal: ontologies and integrated data resources at the click of a mouse
BioPortal: ontologies and integrated data resourcesat the click of a mouseBioPortal: ontologies and integrated data resourcesat the click of a mouse
BioPortal: ontologies and integrated data resources at the click of a mouse
 
Developing Frameworks and Tools for Animal Trait Ontology (ATO)
Developing Frameworks and Tools for Animal Trait Ontology (ATO) Developing Frameworks and Tools for Animal Trait Ontology (ATO)
Developing Frameworks and Tools for Animal Trait Ontology (ATO)
 
Lock - PomBase community curation
Lock - PomBase community curationLock - PomBase community curation
Lock - PomBase community curation
 
PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011
PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011
PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011
 
BioCuration 2019 - Evidence and Conclusion Ontology 2019 Update
BioCuration 2019 - Evidence and Conclusion Ontology 2019 UpdateBioCuration 2019 - Evidence and Conclusion Ontology 2019 Update
BioCuration 2019 - Evidence and Conclusion Ontology 2019 Update
 
Introduction to EOL.org for scientists
Introduction to EOL.org for scientistsIntroduction to EOL.org for scientists
Introduction to EOL.org for scientists
 
Bio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challengesBio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challenges
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
 
Species pages and portals
Species pages and portals Species pages and portals
Species pages and portals
 
Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgCrowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
 
2014
20142014
2014
 
ABIcurator.doc
ABIcurator.docABIcurator.doc
ABIcurator.doc
 
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
 

Mais de Andrew Su

Building and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graphBuilding and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graphAndrew Su
 
Wikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciencesWikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciencesAndrew Su
 
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledgeThe Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledgeAndrew Su
 
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...Andrew Su
 
WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)Andrew Su
 
The case for an open biomedical knowledgebase
The case for an open biomedical knowledgebaseThe case for an open biomedical knowledgebase
The case for an open biomedical knowledgebaseAndrew Su
 
Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)Andrew Su
 
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...Andrew Su
 
Citizen Science and Rare Disease Research
Citizen Science and Rare Disease ResearchCitizen Science and Rare Disease Research
Citizen Science and Rare Disease ResearchAndrew Su
 
Open biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen scienceOpen biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen scienceAndrew Su
 
Heart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen ScienceHeart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen ScienceAndrew Su
 
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015Andrew Su
 
Using Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledgeUsing Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledgeAndrew Su
 
UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6Andrew Su
 
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)Andrew Su
 
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)Andrew Su
 
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen ScienceCrowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen ScienceAndrew Su
 
Centralized Model Organism Database (Biocuration 2014 poster)
Centralized Model Organism Database (Biocuration 2014 poster)Centralized Model Organism Database (Biocuration 2014 poster)
Centralized Model Organism Database (Biocuration 2014 poster)Andrew Su
 
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)Andrew Su
 
20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing Symposium20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing SymposiumAndrew Su
 

Mais de Andrew Su (20)

Building and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graphBuilding and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graph
 
Wikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciencesWikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciences
 
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledgeThe Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
 
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
 
WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)
 
The case for an open biomedical knowledgebase
The case for an open biomedical knowledgebaseThe case for an open biomedical knowledgebase
The case for an open biomedical knowledgebase
 
Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)
 
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
 
Citizen Science and Rare Disease Research
Citizen Science and Rare Disease ResearchCitizen Science and Rare Disease Research
Citizen Science and Rare Disease Research
 
Open biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen scienceOpen biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen science
 
Heart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen ScienceHeart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen Science
 
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
 
Using Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledgeUsing Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledge
 
UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6
 
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
 
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
 
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen ScienceCrowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
 
Centralized Model Organism Database (Biocuration 2014 poster)
Centralized Model Organism Database (Biocuration 2014 poster)Centralized Model Organism Database (Biocuration 2014 poster)
Centralized Model Organism Database (Biocuration 2014 poster)
 
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
 
20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing Symposium20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing Symposium
 

Último

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 

Último (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

  • 1. Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org Andrew Su, Ph.D. @andrewsu asu@scripps.edu http://sulab.org Sanger/EBI September 7, 2012
  • 2. 2 Few genes are well annotated… TP53 TNF APOE MTHFR IL6 HLA-DRB1 Counts VEGFA EGFR TGFB1 59% ACE PubMed 38% 23,278 protein- coding genes Gene ontology Genes, sorted by decreasing counts Data: NCBI gene2pubmed, August 2010
  • 3. 3 … because the literature is sparsely curated? Number of PubMed-indexed articles 1,000,000 800,000 600,000 400,000 200,000 0 1979 1984 1989 1994 1999 2004 2009
  • 4. 4 … because the literature is sparsely curated? Average of articlesof humantypical scientist Number capacity read by scientist 20 10 0 1979 1984 1989 1994 1999 2004 2009
  • 5. 5 311,696 articles (1.5% of PubMed) have been cited by GO annotations
  • 6. 6 Sooner or later, the research community will need to be involved in the 0 annotation effort to scale up to the rate of data generation.
  • 7. 7 The Long Tail is a prolific source of content Short Head Content produced Long Tail Contributors (sorted) News : Newspapers Blogs Video: TV/Hollywood YouTube Product reviews: Consumer reports Amazon reviews Food reviews: Food critics Yelp Talent judging: Olympics American Idol
  • 9. 9 Wikipedia has breadth and depth Articles Words (millions) Words/ article Wikipedia Britannica Online http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
  • 10. 10 We can harness the Long Tail of scientists to directly participate in the gene annotation process.
  • 11. 11 From crowdsourcing to structured data The Gene Wiki Biological Games
  • 12. Filtering, extracting, and summarizing PubMed Documents Concepts
  • 13. 13 Wiki success depends on a positive feedback Gene wiki page utility 1 100 2 200 Number of Number of contributors users
  • 14. 14 10,000 gene “stubs” within Wikipedia Utility Users Contributors Protein structure Gene summary Symbols and identifiers Gene Ontology annotations Protein interactions Tissue expression Linked pattern references Links to structured databases Huss, PLoS Biol, 2008
  • 15. 15 Gene Wiki has a critical mass of readers Utility Total: 5.0 million views / month Users Contributors Huss, PLoS Biol, 2008; Good, NAR, 2011
  • 16. 16 Gene Wiki has a critical mass of editors Utility Editors Editor count Edit count Users Contributors Edits Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articles Good, NAR, 2011
  • 17. 17 A review article for every gene is powerful Reelin: 98 editors, 703 edits since July 2002 Hyperlinks to related concepts Heparin: 358 editors, 654 edits since June 2003 AMPK: 109 editors, 203 edits since March 2004 RNAi: 394 editors, 994 edits since October 2002 References to the literature
  • 18. 18 Making the Gene Wiki more reliable Novartis is a multinational 2 The company name is derived pharmaceutical company from old Greek, and means based in Basel, Switzerland "destroyer of birds". that manufactures drugs such as clozapine (Clozaril), diclofenac (Voltaren), … 2
  • 19. 19 Making the Gene Wiki more reliable Novartis is a multinational 2 The company name is derived pharmaceutical company from old Greek, and means based in Basel, Switzerland "destroyer of birds". that manufactures drugs such as clozapine (Clozaril), diclofenac (Voltaren), … 36211 total edits 36 total edits * * * * * * * * * * * * * * High-trust author Low-trust author http://www.wikitrust.net/
  • 20. 20 Making the Gene Wiki more computable Free text Structured annotations
  • 21. 21 Filling the gaps in gene annotation NCBI Entrez Gene: 334 Gene Wiki mapping Wikilink Candidate assertion GO:0006897 GO exact match 6319 novel GO annotations 2147 novel DO annotations
  • 23. 23 Gene Wiki content improves enrichment analysis axon Enrichment guidance GO term analysis (GO:0007411) 811 articles 264 genes PubMed Concept Gene list abstracts recognition GO:0007411 Yes No Linked genes Yes 13 2 through No 251 12033 PubMed P = 1.55 E-20
  • 24. 24 Gene Wiki content improves enrichment analysis muscle Enrichment contraction GO term analysis (GO:0006936) 251 articles 87 genes PubMed Concept Gene list abstracts recognition + Gene Wiki 87 articles GO:0006936 GO:0006936 Linked genes Linked genes through through PubMed PubMed + Gene Wiki P = 1.0 P = 1.22 E-09
  • 25. 25 Gene Wiki content improves enrichment analysis More p-value significant (PubMed + GW) PubMed only Muscle contraction More significant PubMed + GW p-value (PubMed only)
  • 26. 26 Gene Wiki+: Crowdsourced semantic database Q: What genes are related to hemolytic anemia?
  • 27. 27 The Long Tail of scientists is a valuable source of information on gene function
  • 28. 28 From crowdsourcing to structured data The Gene Wiki Biological Games
  • 29. 29 Gene databases are numerous and overlapping … and hundreds more …
  • 30. 30 Community extensibility and user customizability http://biogps.org
  • 31. 31 Utility: A simple and universal plugin interface Utility Contributors Users
  • 32. 32 Utility: A simple and universal plugin interface Utility Contributors Users
  • 33. 33 Utility: A simple and universal plugin interface Utility Contributors Users
  • 34. 34 Utility: A simple and universal plugin interface Utility Contributors Users
  • 35. 35 Utility: A simple and universal plugin interface Utility Contributors Users
  • 36. 36 Utility: A simple and universal plugin interface Utility Contributors Users Total of 389 gene-centric online databases registered as BioGPS plugins
  • 37. 37 Users: BioGPS has critical mass Utility Daily pageviews Contributors Users • > 4100 registered users Top 10 organizations • 4000 unique visitors per week 1. Harvard 6. Cambridge 2. NIH 7. U Penn • 40,000 page views per week 3. UCSD 8. Stanford 4. Scripps 9. Wash U 5. MIT 10. UNC
  • 38. 38 Contributors: Explicit and implicit knowledge Utility Contributors Users 389 plugins registered (65% publicly shared) by over 75 users spanning 150+ domains
  • 40. 40 Defining a data extraction template TP53 TNF APOE IL6 VEGF EGFR TGFB1 … …
  • 41. 41 The BioGPS Semantic Annotator http://50.112.124.237
  • 42. 42 The Long Tail of bioinformaticians can collaboratively build a gene portal.
  • 43. 43 From crowdsourcing to structured data The Gene Wiki Biological Games
  • 44. 44 Seven million human hours http://www.flickr.com/photos/archana3k1/4124330493/
  • 45. 45 Twenty million human hours http://www.flickr.com/photos/ableman/2171326385/
  • 46. 46 - 150 billion human hours per year http://www.flickr.com/photos/rvp-cw/6243289302/
  • 47. 47 Using games to fold proteins Fold.it players have successfully: • Outperformed state of the art protein folding algorithms (Cooper, Nature, 2010) • Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011) • Designed an improved protein folding algorithm (Khatib, PNAS, 2011) • Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)
  • 48. 48 Using games to fold RNAs http://eterna.cmu.edu/
  • 49. 49 Using games to align sequences http://phylo.cs.mcgill.ca
  • 50. 50 Using games to annotate genes? http://genegames.org
  • 51. 51 No good gene-disease annotation database Query: Apolipoprotein E Alzheimer's disease (AD) Lipoprotein glomerulopathy Sea-blue histiocyte disease
  • 52. 52 No good gene-disease annotation database Query: Apolipoprotein E Alzheimer's disease (AD) Lipoprotein glomerulopathy Sea-blue histiocyte disease Hyperlipoproteinemia, type III Macular degeneration, age-related Myocardial infarction susceptibility
  • 53. 53 No good gene-disease annotation database Query: Apolipoprotein E ? Alzheimer's disease (AD) ? Lipoprotein glomerulopathy ? Sea-blue histiocyte disease Hyperlipoproteinemia, type III ? Macular degeneration, age-related ? Myocardial infarction susceptibility HIV Psoriasis Vascular Diseases
  • 54. 54 No good gene-disease annotation database Query: Apolipoprotein E Alzheimer's disease (AD) Memory Coronary Artery Disease Neuropsychological Tests Hypertension Cognition Disorders Mental Status Schedule Psychiatric Status Rating Dementia Scales Cognition Hyperlipidemias Atrophy Disease Progression Dementia, Vascular Cardiovascular Diseases Parkinson Disease Brain Injuries Coronary Disease Myocardial Infarction Diabetes Mellitus, Type 2 … Memory Disorders 477 diseases!
  • 55. 55 Play Dizeez to annotate gene-disease links 6. Play to win! 5. Hurry! 4. Then on to the next question… 3. If it‟s „right‟, you get points 1. Read the clue (gene) 2. Click the related disease (only one is “right”)
  • 56. 56 Dizeez players seem pretty smart… In total (since Dec 2011): • 207 unique gamers • 1045 games played • 8525 guesses # Occurrences Gene Disease Pubmed OMIM PharmGKB Gene Wiki 7 GAST gastrinoma 7 RBP3 retinoblastoma 7 SSX1 synovial sarcoma 6 TG Graves' disease 6 CRYGC Cataract 6 SOX8 mental retardation 6 WRN Werner syndrome 6 ABL1 leukemia 6 MLL3 leukemia 6 SNAI2 breast carcinoma
  • 57. 57 Dizeez players seem pretty smart… In total (since Dec 2011): • 207 unique gamers • 1045 games played • 8525 guesses # Occurrences Gene Disease Pubmed OMIM PharmGKB Gene Wiki 5 MECOM sarcoma 4 ATF7 cancer 3 ABCB5 acute myeloid leukemia 3 SART1 glioblastoma 3 NCK1 leukemia 3 NEK1 cancer
  • 58. 58 Using games to predict phenotype from genotype? The Cure http://genegames.org
  • 59. 59 Classification problems in genome biology Classify new cancer normal samples find patterns cancer 100,000s features normal SVM Neural networks Naïve Bayes KNN … 100s samples
  • 60. 60 Random forests Sample subset of cases and Train decision cancer normal features tree 100,000s features 100s samples
  • 61. 61 Random forests cancer normal 100,000s features 100s samples
  • 62. 62 Random forests Classify new cancer normal samples cancer 100,000s features normal How to interject biological 100s samples knowledge?
  • 63. 63 Network-guided forests Dutkowski & Ideker (2011). PLoS Computational Biology
  • 64. 64 Network-guided forests Sample features by PPI Train decision cancer normal network tree 100,000s features 100s samples
  • 65. 65 Human-guided forests Sample features by Train decision cancer normal human tree intelligence 100,000s features 100s samples
  • 66. 66
  • 67. 67 The Cure: Genomic predictors for disease
  • 68. 68 The Cure: Genomic predictors for disease
  • 69. 69 The Cure: Genomic predictors for disease
  • 70. 70 The Cure: Genomic predictors for disease
  • 71. 71 The Cure: Genomic predictors for disease
  • 72. 72 The Cure: Genomic predictors for disease
  • 73. 73 Human-guided forests Classify new samples cancer normal
  • 74. 74 “Critical Assessment”-style challenge Will this work? Check our blog after October 15.
  • 75. 75 The Long Tail of gamers can collaboratively build an accurate disease classifier.
  • 76. 76 Collaborators Group members Doug Howe, ZFIN Ben Good Max Nanis John Hogenesch, U Penn Jon Huss, GNF Salvatore Loguercio Chunlei Wu Luca de Alfaro, UCSC Ian Macleod Angel Pizzaro, U Penn Faramarz Valafar, SDSU Pierre Lindenbaum, Fondation Jean Dausset Michael Martone, Rush Konrad Koehler, Karo Bio Warren Kibbe, Simon Lim, Northwestern Many Wikipedia editors WP:MCB Project Contact http://sulab.org Recruiting graduate students asu@scripps.edu in quantitative biology! See @andrewsu http://education.scripps.edu/ +Andrew Su Funding and Support @genegame (BioGPS: GM83924, Gene Wiki: GM089820)

Notas do Editor

  1. We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  2. If you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
  3. For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  4. Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  5. Reverted four minutes later
  6. Reverted four minutes later
  7. Structured annotations enable pathway analysis, statistical analyses, cross-species comparisons
  8. Tried on 773 GO categories, significant in 356 cases (46%)
  9. We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
  10. Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  11. For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  12. MODs and portals
  13. Genetics resources
  14. Literature resources
  15. Protein resources
  16. Pathway and expression databases
  17. Pathway and expression databases
  18. Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  19. For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  20. Empire state building
  21. Question: how to interject biological knowledge in the feature selection process?
  22. Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.