SlideShare uma empresa Scribd logo
1 de 78
Crowdsourcing Biology: The Gene
Wiki, BioGPS and GeneGames.org
               Andrew Su, Ph.D.
                   @andrewsu
                 asu@scripps.edu
                  http://sulab.org




          October 30, 2012
2
Few genes are well annotated…


            TP53
            TNF
            APOE
            MTHFR
            IL6
            HLA-DRB1
   Counts




            VEGFA
            EGFR
            TGFB1                               59%
            ACE

                       PubMed
                                                      38%            23,278 protein-
                                                                      coding genes

             Gene
            ontology




                          Genes, sorted by decreasing counts


                                                       Data: NCBI gene2pubmed, August 2010
3
… because the literature is sparsely curated?


                       Number of PubMed-indexed articles
          1,000,000

           800,000

           600,000

           400,000

           200,000

                 0
                      1979   1984   1989   1994   1999   2004   2009
4
… because the literature is sparsely curated?


                   Average of articlesof humantypical scientist
                   Number capacity read by scientist


              20




              10




              0




              1979     1984   1989   1994   1999   2004   2009
5




311,696 articles (1.5% of PubMed)
have been cited by GO annotations
6




    Sooner or later, the
 research community will
need to be involved in the
             0
annotation effort to scale
   up to the rate of data
        generation.
7
The Long Tail is a prolific source of content


                       Short
                       Head
             Content
            produced


                                       Long Tail



                               Contributors (sorted)




             News :      Newspapers                 Blogs
              Video:    TV/Hollywood               YouTube
   Product reviews:    Consumer reports         Amazon reviews
     Food reviews:       Food critics                Yelp
     Talent judging:      Olympics               American Idol
8
Wikipedia is reasonably accurate
9
Wikipedia has breadth and depth


           Articles




            Words
             (millions)




            Words/
            article


                          Wikipedia   Britannica Online




                                                http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
10




  We can harness the
Long Tail of scientists
to directly participate in
  the gene annotation
        process.
11
From crowdsourcing to structured data



                   The Gene Wiki




                Biological Games
Filtering, extracting, and summarizing PubMed



Documents




 Concepts             Review article
Filtering, extracting, and summarizing PubMed



Documents




 Concepts
14
Wiki success depends on a positive feedback

                  Gene wiki page utility




                             1   100
                         2             200




    Number of                                Number of
   contributors                                users
15
 10,000 gene “stubs” within Wikipedia          Utility




                                                         Users

                                        Contributors



                                         Protein structure
    Gene
  summary
                                          Symbols and
                                           identifiers


                                         Gene Ontology
                                          annotations
   Protein
interactions

                                        Tissue expression
  Linked                                     pattern
references

                                         Links to structured
                                             databases



Huss, PLoS Biol, 2008
16
 Gene Wiki has a critical mass of readers
                                                                                   Utility
                                         Total: 5.0 million views / month




                                                                                             Users
                                                                            Contributors




Huss, PLoS Biol, 2008; Good, NAR, 2011
17
 Gene Wiki has a critical mass of editors
                                                                           Utility



                                   Editors
           Editor count




                                                        Edit count
                                                                                     Users
                                                                     Contributors
                                               Edits




                          Increase of ~10,000 words / month from >1,000 edits
                                       Currently 1.42 million words
                              Approximately equal to 230 full-length articles
Good, NAR, 2011
18
A review article for every gene is powerful




     Reelin: 98 editors, 703 edits since July 2002
                                      Hyperlinks to related concepts
     Heparin: 358 editors, 654 edits since June 2003
     AMPK: 109 editors, 203 edits since March 2004
     RNAi: 394 editors, 994 edits since October 2002
                                               References to the literature
19
 The Gene Wiki is (reasonably) reliable

                                      Per edit     Average      Probability
                                     probability   lifetime      by time
   Cumulative edits




                        Good edits     98.9%       115.4 d       99.968%


                        Vandalism      1.1%         3.4 d        0.032%

                      Date                                    (0.63% for
                                                              WP overall)



Good, NAR, 2011
20
Making the Gene Wiki more reliable
  Novartis is a multinational   2       The company name is derived
  pharmaceutical company                 from old Greek, and means
 based in Basel, Switzerland                 "destroyer of birds".
that manufactures drugs such
         as clozapine
     (Clozaril), diclofenac
         (Voltaren), …

                                    2
21
Making the Gene Wiki more reliable
  Novartis is a multinational         2         The company name is derived
  pharmaceutical company                         from old Greek, and means
 based in Basel, Switzerland                         "destroyer of birds".
that manufactures drugs such
         as clozapine
     (Clozaril), diclofenac
         (Voltaren), …




              36211 total edits              36 total edits

                                  *                                          *
                                  *
                                  *
                                  *                                          *
                                  *
                                  *                                          *
                                  *
                                  *                                          *
                                  *                                          *

          High-trust author               Low-trust author
                                                      http://www.wikitrust.net/
22
Making the Gene Wiki more computable



Free text       Structured annotations
23
Filling the gaps in gene annotation

                                             NCBI Entrez Gene: 334



                            Gene Wiki
                            mapping


          Wikilink                              Candidate
                                                assertion

                                             GO:0006897



                             GO exact
                              match
                 6319 novel GO annotations
                 2147 novel DO annotations
24




TOP 100
GENES
25
Gene Wiki content improves enrichment analysis
    axon                                           Enrichment
  guidance     GO term
                                                    analysis
(GO:0007411)

                                    811 articles

 264 genes                          PubMed          Concept
               Gene list
                                    abstracts      recognition




                     GO:0007411
                      Yes    No
Linked genes   Yes     13       2
   through
               No     251   12033
   PubMed

                P = 1.55 E-20
26
Gene Wiki content improves enrichment analysis
   muscle                                          Enrichment
 contraction   GO term
                                                    analysis
(GO:0006936)

                                 251 articles

  87 genes                      PubMed              Concept
               Gene list
                                abstracts          recognition
                                     +
                                Gene Wiki
                                 87 articles
                   GO:0006936                     GO:0006936


Linked genes                       Linked genes
   through                            through
   PubMed                            PubMed +
                                     Gene Wiki
                   P = 1.0                        P = 1.22 E-09
27
Gene Wiki content improves enrichment analysis



                     More
    p-value       significant
(PubMed + GW)    PubMed only

                                                  Muscle
                                                contraction



                                     More
                                  significant
                                 PubMed + GW




                   p-value (PubMed only)
28
Gene Wiki+: Crowdsourced semantic database
 Q: What genes are related to hemolytic anemia?
29




          The
 Long Tail of scientists
is a valuable source of
  information on gene
        function
30
From crowdsourcing to structured data



                   The Gene Wiki




                Biological Games
31
Gene databases are numerous and overlapping




                            … and hundreds
                               more …
32
Community extensibility and user customizability




                   http://biogps.org
33
Utility: A simple and universal plugin interface
         Utility




Contributors       Users
34
Utility: A simple and universal plugin interface
         Utility




Contributors       Users
35
Utility: A simple and universal plugin interface
         Utility




Contributors       Users
36
Utility: A simple and universal plugin interface
         Utility




Contributors       Users
37
Utility: A simple and universal plugin interface
         Utility




Contributors       Users
38
Utility: A simple and universal plugin interface
         Utility




Contributors         Users




                       Total of 389 gene-centric online
                   databases registered as BioGPS plugins
39
Users: BioGPS has critical mass
         Utility           Daily pageviews




Contributors       Users




 • > 5000 registered users                        Top 10 organizations
 • 13,500 unique visitors per month          1.     Harvard     6. Cambridge
                                             2.     NIH         7. U Penn
 • 155,000 page views per week
                                             3.     UCSD        8. Stanford
                                             4.     Scripps     9. Wash U
                                             5.     MIT         10. UNC
40
Contributors: Explicit and implicit knowledge
         Utility




Contributors       Users




     389 plugins registered
      (65% publicly shared)

         by over 75 users

    spanning 150+ domains
41
Mining structured content from HTML
42
Defining a data extraction template
        TP53   TNF   APOE   IL6   VEGF EGFR TGFB1   …
  …
43
The BioGPS Semantic Annotator




              http://50.112.124.237
44




        The
    Long Tail of
 bioinformaticians
can collaboratively
build a gene portal.
45
From crowdsourcing to structured data



                   The Gene Wiki




                Biological Games
46



Seven million human hours




                            http://www.flickr.com/photos/archana3k1/4124330493/
47



Twenty million human hours




                             http://www.flickr.com/photos/ableman/2171326385/
48
-
    150 billion human hours
              per year




                              http://www.flickr.com/photos/rvp-cw/6243289302/
49
Using games to fold proteins



      Fold.it players have successfully:
      • Outperformed state of the art protein
        folding algorithms (Cooper, Nature, 2010)
      • Solved a previously-intractable crystal
        structure (Khatib, Nat Struct Mol Biol, 2011)
      • Designed an improved protein folding
        algorithm (Khatib, PNAS, 2011)
      • Improved enzyme activity of de novo
        designed enzyme (Eiben, Nat Biotechnol, 2011)
50
Using games to fold RNAs




              http://eterna.cmu.edu/
51
Using games to align sequences




              http://phylo.cs.mcgill.ca
52
Using games to annotate genes?




              http://genegames.org
53
No good gene-disease annotation database
             Query: Apolipoprotein E




            Alzheimer's disease (AD)
            Lipoprotein glomerulopathy
            Sea-blue histiocyte disease
54
No good gene-disease annotation database
             Query: Apolipoprotein E




            Alzheimer's disease (AD)
            Lipoprotein glomerulopathy
            Sea-blue histiocyte disease
            Hyperlipoproteinemia, type III
            Macular degeneration, age-related
            Myocardial infarction susceptibility
55
No good gene-disease annotation database
              Query: Apolipoprotein E




           ? Alzheimer's disease (AD)
           ? Lipoprotein glomerulopathy
           ? Sea-blue histiocyte disease
             Hyperlipoproteinemia, type III
           ? Macular degeneration, age-related
           ? Myocardial infarction susceptibility
             HIV
             Psoriasis
             Vascular Diseases
56
No good gene-disease annotation database
             Query: Apolipoprotein E




            Alzheimer's disease (AD)    Memory
                                        Coronary Artery Disease
            Neuropsychological Tests    Hypertension
            Cognition Disorders         Mental Status Schedule
                                        Psychiatric Status Rating
            Dementia                        Scales
            Cognition                   Hyperlipidemias
                                        Atrophy
            Disease Progression         Dementia, Vascular
            Cardiovascular Diseases     Parkinson Disease
                                        Brain Injuries
            Coronary Disease            Myocardial Infarction
            Diabetes Mellitus, Type 2   …

            Memory Disorders            477 diseases!
57
Play Dizeez to annotate gene-disease links
                                                6. Play to win!
               5. Hurry!
                                 4. Then on to the
                                 next question…

           3. If it‟s „right‟, you get points

            1. Read the clue (gene)




                             2. Click the related disease
                                (only one is “right”)
58
Dizeez players seem pretty smart…

 In total (since Dec 2011):
 • 230 unique gamers
 • 1045 games played
 • 8525 guesses

# Occurrences Gene   Disease              Gene Wiki   OMIM PharmGKB   PubMed

     11      NBPF3 neuroblastoma
     11      SOX8    mental retardation
      9      ABL1    leukemia
      9      SSX1    synovial sarcoma
      8      APC     colorectal cancer
      8      FES     sarcoma
      8      RBP3    retinoblastoma
      8      GAST    gastrinoma
      8      DCC     colorectal cancer
      8      MAP3K5 cancer
59
Using games to predict phenotype from genotype?




               http://genegames.org
60
Classification problems in genome biology

                                                   Classify new
   cancer                    normal                  samples


                                      find patterns
                                                                  cancer
   100,000s features




                                                                  normal
                                          SVM
                                         Neural
                                        networks
                                          Naïve
                                          Bayes
                                          KNN
                                           …
                       100s samples
61
Random forests
                                      Sample subset
                                       of cases and   Train decision
  cancer                     normal       features         tree
   100,000s features




                       100s samples
62
Random forests


  cancer                     normal
   100,000s features




                       100s samples
63
Random forests

                                                         Classify new
  cancer                     normal                        samples



                                                                        cancer
   100,000s features




                                                                        normal




                                      How to interject
                                        biological
                       100s samples    knowledge?
64
Network-guided forests




                         Dutkowski & Ideker (2011). PLoS Computational Biology
65
Network-guided forests
                                          Sample
                                      features by PPI   Train decision
  cancer                     normal       network            tree
   100,000s features




                       100s samples
66
Human-guided forests
                                        Sample
                                      features by    Train decision
  cancer                     normal      human            tree
                                      intelligence
   100,000s features




                       100s samples
67
68
The Cure: Genomic predictors for disease
69
The Cure: Genomic predictors for disease
70
The Cure: Genomic predictors for disease
71
The Cure: Genomic predictors for disease
72
The Cure: Genomic predictors for disease
73
The Cure: Genomic predictors for disease
74
Human-guided forests

                       Classify new
                         samples



                                      cancer
                                      normal
75
“Critical Assessment”-style challenge
76
Preliminary results


• 214 registered players
   – 50% declared knowledge of cancer
     biology
   – 40% self-identified as having Ph.D.
• Prediction results
   – 69% correct on survival concordance
     index
   – Best scoring model was 72%
77




         The
Long Tail of gamers
 can collaboratively
  build an accurate
 disease classifier.
78
       Collaborators                                                        Group members
Doug Howe, ZFIN                                             Ben Good                   Max Nanis
John Hogenesch, U Penn
Jon Huss, GNF
                                                            Salvatore Loguercio        Chunlei Wu
Luca de Alfaro, UCSC                                        Ian Macleod
Angel Pizzaro, U Penn
Faramarz Valafar, SDSU
Pierre Lindenbaum,
      Fondation Jean Dausset
Michael Martone, Rush
Konrad Koehler, Karo Bio
Warren Kibbe, Simon Lim, Northwestern
Many Wikipedia editors
    WP:MCB Project



                                                                                         Contact
                                                                                     http://sulab.org
 Recruiting graduate students
                                                                                    asu@scripps.edu
  in quantitative biology! See                                                        @andrewsu
 http://education.scripps.edu/                                                        +Andrew Su



                                        Funding and Support



                                   (BioGPS: GM83924, Gene Wiki: GM089820)

Mais conteúdo relacionado

Semelhante a Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org

Crowdsourcing to structure biological knowledge (USC/ISI)
Crowdsourcing to structure biological knowledge (USC/ISI)Crowdsourcing to structure biological knowledge (USC/ISI)
Crowdsourcing to structure biological knowledge (USC/ISI)Andrew Su
 
Gene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meetingGene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meetingBenjamin Good
 
Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...Andrew Su
 
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...Andrew Su
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyBarry Smith
 
Developing Frameworks and Tools for Animal Trait Ontology (ATO)
Developing Frameworks and Tools for Animal Trait Ontology (ATO) Developing Frameworks and Tools for Animal Trait Ontology (ATO)
Developing Frameworks and Tools for Animal Trait Ontology (ATO) Jie Bao
 
PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011
PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011
PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011rebshoe
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Sciencedrnigam
 
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Amit Sheth
 
BioCuration 2019 - Evidence and Conclusion Ontology 2019 Update
BioCuration 2019 - Evidence and Conclusion Ontology 2019 UpdateBioCuration 2019 - Evidence and Conclusion Ontology 2019 Update
BioCuration 2019 - Evidence and Conclusion Ontology 2019 Updatedolleyj
 
Lock - PomBase community curation
Lock - PomBase community curationLock - PomBase community curation
Lock - PomBase community curationPascale Gaudet
 
Bio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challengesBio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challengesJanna Hastings
 
ABIcurator.doc
ABIcurator.docABIcurator.doc
ABIcurator.docbutest
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgCrowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgAndrew Su
 
Introduction to EOL.org for scientists
Introduction to EOL.org for scientistsIntroduction to EOL.org for scientists
Introduction to EOL.org for scientistsCyndy Parr
 
Eol fellow-march2010
Eol fellow-march2010Eol fellow-march2010
Eol fellow-march2010tgarnett
 
Species pages and portals
Species pages and portals Species pages and portals
Species pages and portals Cyndy Parr
 
Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.Monica Munoz-Torres
 

Semelhante a Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (20)

Crowdsourcing to structure biological knowledge (USC/ISI)
Crowdsourcing to structure biological knowledge (USC/ISI)Crowdsourcing to structure biological knowledge (USC/ISI)
Crowdsourcing to structure biological knowledge (USC/ISI)
 
Gene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meetingGene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meeting
 
bioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics databioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics data
 
Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...
 
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental Biology
 
BioPortal: ontologies and integrated data resources at the click of a mouse
BioPortal: ontologies and integrated data resourcesat the click of a mouseBioPortal: ontologies and integrated data resourcesat the click of a mouse
BioPortal: ontologies and integrated data resources at the click of a mouse
 
Developing Frameworks and Tools for Animal Trait Ontology (ATO)
Developing Frameworks and Tools for Animal Trait Ontology (ATO) Developing Frameworks and Tools for Animal Trait Ontology (ATO)
Developing Frameworks and Tools for Animal Trait Ontology (ATO)
 
PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011
PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011
PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
 
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
 
BioCuration 2019 - Evidence and Conclusion Ontology 2019 Update
BioCuration 2019 - Evidence and Conclusion Ontology 2019 UpdateBioCuration 2019 - Evidence and Conclusion Ontology 2019 Update
BioCuration 2019 - Evidence and Conclusion Ontology 2019 Update
 
Lock - PomBase community curation
Lock - PomBase community curationLock - PomBase community curation
Lock - PomBase community curation
 
Bio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challengesBio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challenges
 
ABIcurator.doc
ABIcurator.docABIcurator.doc
ABIcurator.doc
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgCrowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
 
Introduction to EOL.org for scientists
Introduction to EOL.org for scientistsIntroduction to EOL.org for scientists
Introduction to EOL.org for scientists
 
Eol fellow-march2010
Eol fellow-march2010Eol fellow-march2010
Eol fellow-march2010
 
Species pages and portals
Species pages and portals Species pages and portals
Species pages and portals
 
Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.
 

Mais de Andrew Su

Building and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graphBuilding and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graphAndrew Su
 
Wikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciencesWikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciencesAndrew Su
 
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledgeThe Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledgeAndrew Su
 
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...Andrew Su
 
WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)Andrew Su
 
The case for an open biomedical knowledgebase
The case for an open biomedical knowledgebaseThe case for an open biomedical knowledgebase
The case for an open biomedical knowledgebaseAndrew Su
 
Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)Andrew Su
 
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...Andrew Su
 
Citizen Science and Rare Disease Research
Citizen Science and Rare Disease ResearchCitizen Science and Rare Disease Research
Citizen Science and Rare Disease ResearchAndrew Su
 
Open biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen scienceOpen biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen scienceAndrew Su
 
Heart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen ScienceHeart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen ScienceAndrew Su
 
Using Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledgeUsing Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledgeAndrew Su
 
UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6Andrew Su
 
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)Andrew Su
 
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen ScienceCrowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen ScienceAndrew Su
 
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)Andrew Su
 
20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing Symposium20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing SymposiumAndrew Su
 

Mais de Andrew Su (17)

Building and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graphBuilding and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graph
 
Wikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciencesWikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciences
 
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledgeThe Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
 
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
 
WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)
 
The case for an open biomedical knowledgebase
The case for an open biomedical knowledgebaseThe case for an open biomedical knowledgebase
The case for an open biomedical knowledgebase
 
Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)
 
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
 
Citizen Science and Rare Disease Research
Citizen Science and Rare Disease ResearchCitizen Science and Rare Disease Research
Citizen Science and Rare Disease Research
 
Open biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen scienceOpen biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen science
 
Heart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen ScienceHeart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen Science
 
Using Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledgeUsing Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledge
 
UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6
 
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
 
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen ScienceCrowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
 
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
 
20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing Symposium20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing Symposium
 

Último

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Último (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org

  • 1. Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org Andrew Su, Ph.D. @andrewsu asu@scripps.edu http://sulab.org October 30, 2012
  • 2. 2 Few genes are well annotated… TP53 TNF APOE MTHFR IL6 HLA-DRB1 Counts VEGFA EGFR TGFB1 59% ACE PubMed 38% 23,278 protein- coding genes Gene ontology Genes, sorted by decreasing counts Data: NCBI gene2pubmed, August 2010
  • 3. 3 … because the literature is sparsely curated? Number of PubMed-indexed articles 1,000,000 800,000 600,000 400,000 200,000 0 1979 1984 1989 1994 1999 2004 2009
  • 4. 4 … because the literature is sparsely curated? Average of articlesof humantypical scientist Number capacity read by scientist 20 10 0 1979 1984 1989 1994 1999 2004 2009
  • 5. 5 311,696 articles (1.5% of PubMed) have been cited by GO annotations
  • 6. 6 Sooner or later, the research community will need to be involved in the 0 annotation effort to scale up to the rate of data generation.
  • 7. 7 The Long Tail is a prolific source of content Short Head Content produced Long Tail Contributors (sorted) News : Newspapers Blogs Video: TV/Hollywood YouTube Product reviews: Consumer reports Amazon reviews Food reviews: Food critics Yelp Talent judging: Olympics American Idol
  • 9. 9 Wikipedia has breadth and depth Articles Words (millions) Words/ article Wikipedia Britannica Online http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
  • 10. 10 We can harness the Long Tail of scientists to directly participate in the gene annotation process.
  • 11. 11 From crowdsourcing to structured data The Gene Wiki Biological Games
  • 12. Filtering, extracting, and summarizing PubMed Documents Concepts Review article
  • 13. Filtering, extracting, and summarizing PubMed Documents Concepts
  • 14. 14 Wiki success depends on a positive feedback Gene wiki page utility 1 100 2 200 Number of Number of contributors users
  • 15. 15 10,000 gene “stubs” within Wikipedia Utility Users Contributors Protein structure Gene summary Symbols and identifiers Gene Ontology annotations Protein interactions Tissue expression Linked pattern references Links to structured databases Huss, PLoS Biol, 2008
  • 16. 16 Gene Wiki has a critical mass of readers Utility Total: 5.0 million views / month Users Contributors Huss, PLoS Biol, 2008; Good, NAR, 2011
  • 17. 17 Gene Wiki has a critical mass of editors Utility Editors Editor count Edit count Users Contributors Edits Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articles Good, NAR, 2011
  • 18. 18 A review article for every gene is powerful Reelin: 98 editors, 703 edits since July 2002 Hyperlinks to related concepts Heparin: 358 editors, 654 edits since June 2003 AMPK: 109 editors, 203 edits since March 2004 RNAi: 394 editors, 994 edits since October 2002 References to the literature
  • 19. 19 The Gene Wiki is (reasonably) reliable Per edit Average Probability probability lifetime by time Cumulative edits Good edits 98.9% 115.4 d 99.968% Vandalism 1.1% 3.4 d 0.032% Date (0.63% for WP overall) Good, NAR, 2011
  • 20. 20 Making the Gene Wiki more reliable Novartis is a multinational 2 The company name is derived pharmaceutical company from old Greek, and means based in Basel, Switzerland "destroyer of birds". that manufactures drugs such as clozapine (Clozaril), diclofenac (Voltaren), … 2
  • 21. 21 Making the Gene Wiki more reliable Novartis is a multinational 2 The company name is derived pharmaceutical company from old Greek, and means based in Basel, Switzerland "destroyer of birds". that manufactures drugs such as clozapine (Clozaril), diclofenac (Voltaren), … 36211 total edits 36 total edits * * * * * * * * * * * * * * High-trust author Low-trust author http://www.wikitrust.net/
  • 22. 22 Making the Gene Wiki more computable Free text Structured annotations
  • 23. 23 Filling the gaps in gene annotation NCBI Entrez Gene: 334 Gene Wiki mapping Wikilink Candidate assertion GO:0006897 GO exact match 6319 novel GO annotations 2147 novel DO annotations
  • 25. 25 Gene Wiki content improves enrichment analysis axon Enrichment guidance GO term analysis (GO:0007411) 811 articles 264 genes PubMed Concept Gene list abstracts recognition GO:0007411 Yes No Linked genes Yes 13 2 through No 251 12033 PubMed P = 1.55 E-20
  • 26. 26 Gene Wiki content improves enrichment analysis muscle Enrichment contraction GO term analysis (GO:0006936) 251 articles 87 genes PubMed Concept Gene list abstracts recognition + Gene Wiki 87 articles GO:0006936 GO:0006936 Linked genes Linked genes through through PubMed PubMed + Gene Wiki P = 1.0 P = 1.22 E-09
  • 27. 27 Gene Wiki content improves enrichment analysis More p-value significant (PubMed + GW) PubMed only Muscle contraction More significant PubMed + GW p-value (PubMed only)
  • 28. 28 Gene Wiki+: Crowdsourced semantic database Q: What genes are related to hemolytic anemia?
  • 29. 29 The Long Tail of scientists is a valuable source of information on gene function
  • 30. 30 From crowdsourcing to structured data The Gene Wiki Biological Games
  • 31. 31 Gene databases are numerous and overlapping … and hundreds more …
  • 32. 32 Community extensibility and user customizability http://biogps.org
  • 33. 33 Utility: A simple and universal plugin interface Utility Contributors Users
  • 34. 34 Utility: A simple and universal plugin interface Utility Contributors Users
  • 35. 35 Utility: A simple and universal plugin interface Utility Contributors Users
  • 36. 36 Utility: A simple and universal plugin interface Utility Contributors Users
  • 37. 37 Utility: A simple and universal plugin interface Utility Contributors Users
  • 38. 38 Utility: A simple and universal plugin interface Utility Contributors Users Total of 389 gene-centric online databases registered as BioGPS plugins
  • 39. 39 Users: BioGPS has critical mass Utility Daily pageviews Contributors Users • > 5000 registered users Top 10 organizations • 13,500 unique visitors per month 1. Harvard 6. Cambridge 2. NIH 7. U Penn • 155,000 page views per week 3. UCSD 8. Stanford 4. Scripps 9. Wash U 5. MIT 10. UNC
  • 40. 40 Contributors: Explicit and implicit knowledge Utility Contributors Users 389 plugins registered (65% publicly shared) by over 75 users spanning 150+ domains
  • 42. 42 Defining a data extraction template TP53 TNF APOE IL6 VEGF EGFR TGFB1 … …
  • 43. 43 The BioGPS Semantic Annotator http://50.112.124.237
  • 44. 44 The Long Tail of bioinformaticians can collaboratively build a gene portal.
  • 45. 45 From crowdsourcing to structured data The Gene Wiki Biological Games
  • 46. 46 Seven million human hours http://www.flickr.com/photos/archana3k1/4124330493/
  • 47. 47 Twenty million human hours http://www.flickr.com/photos/ableman/2171326385/
  • 48. 48 - 150 billion human hours per year http://www.flickr.com/photos/rvp-cw/6243289302/
  • 49. 49 Using games to fold proteins Fold.it players have successfully: • Outperformed state of the art protein folding algorithms (Cooper, Nature, 2010) • Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011) • Designed an improved protein folding algorithm (Khatib, PNAS, 2011) • Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)
  • 50. 50 Using games to fold RNAs http://eterna.cmu.edu/
  • 51. 51 Using games to align sequences http://phylo.cs.mcgill.ca
  • 52. 52 Using games to annotate genes? http://genegames.org
  • 53. 53 No good gene-disease annotation database Query: Apolipoprotein E Alzheimer's disease (AD) Lipoprotein glomerulopathy Sea-blue histiocyte disease
  • 54. 54 No good gene-disease annotation database Query: Apolipoprotein E Alzheimer's disease (AD) Lipoprotein glomerulopathy Sea-blue histiocyte disease Hyperlipoproteinemia, type III Macular degeneration, age-related Myocardial infarction susceptibility
  • 55. 55 No good gene-disease annotation database Query: Apolipoprotein E ? Alzheimer's disease (AD) ? Lipoprotein glomerulopathy ? Sea-blue histiocyte disease Hyperlipoproteinemia, type III ? Macular degeneration, age-related ? Myocardial infarction susceptibility HIV Psoriasis Vascular Diseases
  • 56. 56 No good gene-disease annotation database Query: Apolipoprotein E Alzheimer's disease (AD) Memory Coronary Artery Disease Neuropsychological Tests Hypertension Cognition Disorders Mental Status Schedule Psychiatric Status Rating Dementia Scales Cognition Hyperlipidemias Atrophy Disease Progression Dementia, Vascular Cardiovascular Diseases Parkinson Disease Brain Injuries Coronary Disease Myocardial Infarction Diabetes Mellitus, Type 2 … Memory Disorders 477 diseases!
  • 57. 57 Play Dizeez to annotate gene-disease links 6. Play to win! 5. Hurry! 4. Then on to the next question… 3. If it‟s „right‟, you get points 1. Read the clue (gene) 2. Click the related disease (only one is “right”)
  • 58. 58 Dizeez players seem pretty smart… In total (since Dec 2011): • 230 unique gamers • 1045 games played • 8525 guesses # Occurrences Gene Disease Gene Wiki OMIM PharmGKB PubMed 11 NBPF3 neuroblastoma 11 SOX8 mental retardation 9 ABL1 leukemia 9 SSX1 synovial sarcoma 8 APC colorectal cancer 8 FES sarcoma 8 RBP3 retinoblastoma 8 GAST gastrinoma 8 DCC colorectal cancer 8 MAP3K5 cancer
  • 59. 59 Using games to predict phenotype from genotype? http://genegames.org
  • 60. 60 Classification problems in genome biology Classify new cancer normal samples find patterns cancer 100,000s features normal SVM Neural networks Naïve Bayes KNN … 100s samples
  • 61. 61 Random forests Sample subset of cases and Train decision cancer normal features tree 100,000s features 100s samples
  • 62. 62 Random forests cancer normal 100,000s features 100s samples
  • 63. 63 Random forests Classify new cancer normal samples cancer 100,000s features normal How to interject biological 100s samples knowledge?
  • 64. 64 Network-guided forests Dutkowski & Ideker (2011). PLoS Computational Biology
  • 65. 65 Network-guided forests Sample features by PPI Train decision cancer normal network tree 100,000s features 100s samples
  • 66. 66 Human-guided forests Sample features by Train decision cancer normal human tree intelligence 100,000s features 100s samples
  • 67. 67
  • 68. 68 The Cure: Genomic predictors for disease
  • 69. 69 The Cure: Genomic predictors for disease
  • 70. 70 The Cure: Genomic predictors for disease
  • 71. 71 The Cure: Genomic predictors for disease
  • 72. 72 The Cure: Genomic predictors for disease
  • 73. 73 The Cure: Genomic predictors for disease
  • 74. 74 Human-guided forests Classify new samples cancer normal
  • 76. 76 Preliminary results • 214 registered players – 50% declared knowledge of cancer biology – 40% self-identified as having Ph.D. • Prediction results – 69% correct on survival concordance index – Best scoring model was 72%
  • 77. 77 The Long Tail of gamers can collaboratively build an accurate disease classifier.
  • 78. 78 Collaborators Group members Doug Howe, ZFIN Ben Good Max Nanis John Hogenesch, U Penn Jon Huss, GNF Salvatore Loguercio Chunlei Wu Luca de Alfaro, UCSC Ian Macleod Angel Pizzaro, U Penn Faramarz Valafar, SDSU Pierre Lindenbaum, Fondation Jean Dausset Michael Martone, Rush Konrad Koehler, Karo Bio Warren Kibbe, Simon Lim, Northwestern Many Wikipedia editors WP:MCB Project Contact http://sulab.org Recruiting graduate students asu@scripps.edu in quantitative biology! See @andrewsu http://education.scripps.edu/ +Andrew Su Funding and Support (BioGPS: GM83924, Gene Wiki: GM089820)

Notas do Editor

  1. We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  2. If you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
  3. For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  4. Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  5. Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  6. Reverted four minutes later
  7. Reverted four minutes later
  8. Structured annotations enable pathway analysis, statistical analyses, cross-species comparisons
  9. Tried on 773 GO categories, significant in 356 cases (46%)
  10. We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
  11. Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  12. For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  13. MODs and portals
  14. Genetics resources
  15. Literature resources
  16. Protein resources
  17. Pathway and expression databases
  18. Pathway and expression databases
  19. Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  20. For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  21. Empire state building
  22. Question: how to interject biological knowledge in the feature selection process?
  23. Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.