This document discusses using crowdsourcing approaches to improve gene annotation. It describes how the Gene Wiki allows collaborative editing to build gene summaries. Biological games are proposed to harness human intuition to solve problems like protein folding, sequence alignment, and gene-disease annotation. The document outlines the GeneWiki+, BioGPS, and GeneGames projects which aim to build structured databases through crowdsourcing structured data from wikis and games. It argues that harnessing the "Long Tail" of scientists and gamers can help scale gene annotation efforts to keep up with data generation.
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
1. Crowdsourcing Biology: The Gene
Wiki, BioGPS and GeneGames.org
Andrew Su, Ph.D.
@andrewsu
asu@scripps.edu
http://sulab.org
Sanger/EBI
September 7, 2012
2. 2
Few genes are well annotated…
TP53
TNF
APOE
MTHFR
IL6
HLA-DRB1
Counts
VEGFA
EGFR
TGFB1 59%
ACE
PubMed
38% 23,278 protein-
coding genes
Gene
ontology
Genes, sorted by decreasing counts
Data: NCBI gene2pubmed, August 2010
3. 3
… because the literature is sparsely curated?
Number of PubMed-indexed articles
1,000,000
800,000
600,000
400,000
200,000
0
1979 1984 1989 1994 1999 2004 2009
4. 4
… because the literature is sparsely curated?
Average of articlesof humantypical scientist
Number capacity read by scientist
20
10
0
1979 1984 1989 1994 1999 2004 2009
6. 6
Sooner or later, the
research community will
need to be involved in the
0
annotation effort to scale
up to the rate of data
generation.
7. 7
The Long Tail is a prolific source of content
Short
Head
Content
produced
Long Tail
Contributors (sorted)
News : Newspapers Blogs
Video: TV/Hollywood YouTube
Product reviews: Consumer reports Amazon reviews
Food reviews: Food critics Yelp
Talent judging: Olympics American Idol
9. 9
Wikipedia has breadth and depth
Articles
Words
(millions)
Words/
article
Wikipedia Britannica Online
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
10. 10
We can harness the
Long Tail of scientists
to directly participate in
the gene annotation
process.
13. 13
Wiki success depends on a positive feedback
Gene wiki page utility
1 100
2 200
Number of Number of
contributors users
14. 14
10,000 gene “stubs” within Wikipedia Utility
Users
Contributors
Protein structure
Gene
summary
Symbols and
identifiers
Gene Ontology
annotations
Protein
interactions
Tissue expression
Linked pattern
references
Links to structured
databases
Huss, PLoS Biol, 2008
15. 15
Gene Wiki has a critical mass of readers
Utility
Total: 5.0 million views / month
Users
Contributors
Huss, PLoS Biol, 2008; Good, NAR, 2011
16. 16
Gene Wiki has a critical mass of editors
Utility
Editors
Editor count
Edit count
Users
Contributors
Edits
Increase of ~10,000 words / month from >1,000 edits
Currently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
17. 17
A review article for every gene is powerful
Reelin: 98 editors, 703 edits since July 2002
Hyperlinks to related concepts
Heparin: 358 editors, 654 edits since June 2003
AMPK: 109 editors, 203 edits since March 2004
RNAi: 394 editors, 994 edits since October 2002
References to the literature
18. 18
Making the Gene Wiki more reliable
Novartis is a multinational 2 The company name is derived
pharmaceutical company from old Greek, and means
based in Basel, Switzerland "destroyer of birds".
that manufactures drugs such
as clozapine
(Clozaril), diclofenac
(Voltaren), …
2
19. 19
Making the Gene Wiki more reliable
Novartis is a multinational 2 The company name is derived
pharmaceutical company from old Greek, and means
based in Basel, Switzerland "destroyer of birds".
that manufactures drugs such
as clozapine
(Clozaril), diclofenac
(Voltaren), …
36211 total edits 36 total edits
* *
*
*
* *
*
* *
*
* *
* *
High-trust author Low-trust author
http://www.wikitrust.net/
20. 20
Making the Gene Wiki more computable
Free text Structured annotations
21. 21
Filling the gaps in gene annotation
NCBI Entrez Gene: 334
Gene Wiki
mapping
Wikilink Candidate
assertion
GO:0006897
GO exact
match
6319 novel GO annotations
2147 novel DO annotations
23. 23
Gene Wiki content improves enrichment analysis
axon Enrichment
guidance GO term
analysis
(GO:0007411)
811 articles
264 genes PubMed Concept
Gene list
abstracts recognition
GO:0007411
Yes No
Linked genes Yes 13 2
through
No 251 12033
PubMed
P = 1.55 E-20
24. 24
Gene Wiki content improves enrichment analysis
muscle Enrichment
contraction GO term
analysis
(GO:0006936)
251 articles
87 genes PubMed Concept
Gene list
abstracts recognition
+
Gene Wiki
87 articles
GO:0006936 GO:0006936
Linked genes Linked genes
through through
PubMed PubMed +
Gene Wiki
P = 1.0 P = 1.22 E-09
25. 25
Gene Wiki content improves enrichment analysis
More
p-value significant
(PubMed + GW) PubMed only
Muscle
contraction
More
significant
PubMed + GW
p-value (PubMed only)
36. 36
Utility: A simple and universal plugin interface
Utility
Contributors Users
Total of 389 gene-centric online
databases registered as BioGPS plugins
37. 37
Users: BioGPS has critical mass
Utility Daily pageviews
Contributors Users
• > 4100 registered users Top 10 organizations
• 4000 unique visitors per week 1. Harvard 6. Cambridge
2. NIH 7. U Penn
• 40,000 page views per week
3. UCSD 8. Stanford
4. Scripps 9. Wash U
5. MIT 10. UNC
38. 38
Contributors: Explicit and implicit knowledge
Utility
Contributors Users
389 plugins registered
(65% publicly shared)
by over 75 users
spanning 150+ domains
46. 46
-
150 billion human hours
per year
http://www.flickr.com/photos/rvp-cw/6243289302/
47. 47
Using games to fold proteins
Fold.it players have successfully:
• Outperformed state of the art protein
folding algorithms (Cooper, Nature, 2010)
• Solved a previously-intractable crystal
structure (Khatib, Nat Struct Mol Biol, 2011)
• Designed an improved protein folding
algorithm (Khatib, PNAS, 2011)
• Improved enzyme activity of de novo
designed enzyme (Eiben, Nat Biotechnol, 2011)
51. 51
No good gene-disease annotation database
Query: Apolipoprotein E
Alzheimer's disease (AD)
Lipoprotein glomerulopathy
Sea-blue histiocyte disease
52. 52
No good gene-disease annotation database
Query: Apolipoprotein E
Alzheimer's disease (AD)
Lipoprotein glomerulopathy
Sea-blue histiocyte disease
Hyperlipoproteinemia, type III
Macular degeneration, age-related
Myocardial infarction susceptibility
53. 53
No good gene-disease annotation database
Query: Apolipoprotein E
? Alzheimer's disease (AD)
? Lipoprotein glomerulopathy
? Sea-blue histiocyte disease
Hyperlipoproteinemia, type III
? Macular degeneration, age-related
? Myocardial infarction susceptibility
HIV
Psoriasis
Vascular Diseases
54. 54
No good gene-disease annotation database
Query: Apolipoprotein E
Alzheimer's disease (AD) Memory
Coronary Artery Disease
Neuropsychological Tests Hypertension
Cognition Disorders Mental Status Schedule
Psychiatric Status Rating
Dementia Scales
Cognition Hyperlipidemias
Atrophy
Disease Progression Dementia, Vascular
Cardiovascular Diseases Parkinson Disease
Brain Injuries
Coronary Disease Myocardial Infarction
Diabetes Mellitus, Type 2 …
Memory Disorders 477 diseases!
55. 55
Play Dizeez to annotate gene-disease links
6. Play to win!
5. Hurry!
4. Then on to the
next question…
3. If it‟s „right‟, you get points
1. Read the clue (gene)
2. Click the related disease
(only one is “right”)
56. 56
Dizeez players seem pretty smart…
In total (since Dec 2011):
• 207 unique gamers
• 1045 games played
• 8525 guesses
# Occurrences Gene Disease Pubmed OMIM PharmGKB Gene Wiki
7 GAST gastrinoma
7 RBP3 retinoblastoma
7 SSX1 synovial sarcoma
6 TG Graves' disease
6 CRYGC Cataract
6 SOX8 mental retardation
6 WRN Werner syndrome
6 ABL1 leukemia
6 MLL3 leukemia
6 SNAI2 breast carcinoma
57. 57
Dizeez players seem pretty smart…
In total (since Dec 2011):
• 207 unique gamers
• 1045 games played
• 8525 guesses
# Occurrences Gene Disease Pubmed OMIM PharmGKB Gene Wiki
5 MECOM sarcoma
4 ATF7 cancer
3 ABCB5 acute myeloid leukemia
3 SART1 glioblastoma
3 NCK1 leukemia
3 NEK1 cancer
58. 58
Using games to predict phenotype from genotype?
The Cure
http://genegames.org
59. 59
Classification problems in genome biology
Classify new
cancer normal samples
find patterns
cancer
100,000s features
normal
SVM
Neural
networks
Naïve
Bayes
KNN
…
100s samples
60. 60
Random forests
Sample subset
of cases and Train decision
cancer normal features tree
100,000s features
100s samples
75. 75
The
Long Tail of gamers
can collaboratively
build an accurate
disease classifier.
76. 76
Collaborators Group members
Doug Howe, ZFIN Ben Good Max Nanis
John Hogenesch, U Penn
Jon Huss, GNF
Salvatore Loguercio Chunlei Wu
Luca de Alfaro, UCSC Ian Macleod
Angel Pizzaro, U Penn
Faramarz Valafar, SDSU
Pierre Lindenbaum,
Fondation Jean Dausset
Michael Martone, Rush
Konrad Koehler, Karo Bio
Warren Kibbe, Simon Lim, Northwestern
Many Wikipedia editors
WP:MCB Project
Contact
http://sulab.org
Recruiting graduate students
asu@scripps.edu
in quantitative biology! See @andrewsu
http://education.scripps.edu/ +Andrew Su
Funding and Support
@genegame
(BioGPS: GM83924, Gene Wiki: GM089820)
Notas do Editor
We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
If you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
Tried on 773 GO categories, significant in 356 cases (46%)
We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
MODs and portals
Genetics resources
Literature resources
Protein resources
Pathway and expression databases
Pathway and expression databases
Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
Empire state building
Question: how to interject biological knowledge in the feature selection process?
Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.