Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
1. Crowdsourcing Biology: The Gene
Wiki, BioGPS and GeneGames.org
Andrew Su, Ph.D.
@andrewsu
asu@scripps.edu
http://sulab.org
October 30, 2012
2. 2
Few genes are well annotated…
TP53
TNF
APOE
MTHFR
IL6
HLA-DRB1
Counts
VEGFA
EGFR
TGFB1 59%
ACE
PubMed
38% 23,278 protein-
coding genes
Gene
ontology
Genes, sorted by decreasing counts
Data: NCBI gene2pubmed, August 2010
3. 3
… because the literature is sparsely curated?
Number of PubMed-indexed articles
1,000,000
800,000
600,000
400,000
200,000
0
1979 1984 1989 1994 1999 2004 2009
4. 4
… because the literature is sparsely curated?
Average of articlesof humantypical scientist
Number capacity read by scientist
20
10
0
1979 1984 1989 1994 1999 2004 2009
6. 6
Sooner or later, the
research community will
need to be involved in the
0
annotation effort to scale
up to the rate of data
generation.
7. 7
The Long Tail is a prolific source of content
Short
Head
Content
produced
Long Tail
Contributors (sorted)
News : Newspapers Blogs
Video: TV/Hollywood YouTube
Product reviews: Consumer reports Amazon reviews
Food reviews: Food critics Yelp
Talent judging: Olympics American Idol
9. 9
Wikipedia has breadth and depth
Articles
Words
(millions)
Words/
article
Wikipedia Britannica Online
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
10. 10
We can harness the
Long Tail of scientists
to directly participate in
the gene annotation
process.
14. 14
Wiki success depends on a positive feedback
Gene wiki page utility
1 100
2 200
Number of Number of
contributors users
15. 15
10,000 gene “stubs” within Wikipedia Utility
Users
Contributors
Protein structure
Gene
summary
Symbols and
identifiers
Gene Ontology
annotations
Protein
interactions
Tissue expression
Linked pattern
references
Links to structured
databases
Huss, PLoS Biol, 2008
16. 16
Gene Wiki has a critical mass of readers
Utility
Total: 5.0 million views / month
Users
Contributors
Huss, PLoS Biol, 2008; Good, NAR, 2011
17. 17
Gene Wiki has a critical mass of editors
Utility
Editors
Editor count
Edit count
Users
Contributors
Edits
Increase of ~10,000 words / month from >1,000 edits
Currently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
18. 18
A review article for every gene is powerful
Reelin: 98 editors, 703 edits since July 2002
Hyperlinks to related concepts
Heparin: 358 editors, 654 edits since June 2003
AMPK: 109 editors, 203 edits since March 2004
RNAi: 394 editors, 994 edits since October 2002
References to the literature
19. 19
The Gene Wiki is (reasonably) reliable
Per edit Average Probability
probability lifetime by time
Cumulative edits
Good edits 98.9% 115.4 d 99.968%
Vandalism 1.1% 3.4 d 0.032%
Date (0.63% for
WP overall)
Good, NAR, 2011
20. 20
Making the Gene Wiki more reliable
Novartis is a multinational 2 The company name is derived
pharmaceutical company from old Greek, and means
based in Basel, Switzerland "destroyer of birds".
that manufactures drugs such
as clozapine
(Clozaril), diclofenac
(Voltaren), …
2
21. 21
Making the Gene Wiki more reliable
Novartis is a multinational 2 The company name is derived
pharmaceutical company from old Greek, and means
based in Basel, Switzerland "destroyer of birds".
that manufactures drugs such
as clozapine
(Clozaril), diclofenac
(Voltaren), …
36211 total edits 36 total edits
* *
*
*
* *
*
* *
*
* *
* *
High-trust author Low-trust author
http://www.wikitrust.net/
22. 22
Making the Gene Wiki more computable
Free text Structured annotations
23. 23
Filling the gaps in gene annotation
NCBI Entrez Gene: 334
Gene Wiki
mapping
Wikilink Candidate
assertion
GO:0006897
GO exact
match
6319 novel GO annotations
2147 novel DO annotations
25. 25
Gene Wiki content improves enrichment analysis
axon Enrichment
guidance GO term
analysis
(GO:0007411)
811 articles
264 genes PubMed Concept
Gene list
abstracts recognition
GO:0007411
Yes No
Linked genes Yes 13 2
through
No 251 12033
PubMed
P = 1.55 E-20
26. 26
Gene Wiki content improves enrichment analysis
muscle Enrichment
contraction GO term
analysis
(GO:0006936)
251 articles
87 genes PubMed Concept
Gene list
abstracts recognition
+
Gene Wiki
87 articles
GO:0006936 GO:0006936
Linked genes Linked genes
through through
PubMed PubMed +
Gene Wiki
P = 1.0 P = 1.22 E-09
27. 27
Gene Wiki content improves enrichment analysis
More
p-value significant
(PubMed + GW) PubMed only
Muscle
contraction
More
significant
PubMed + GW
p-value (PubMed only)
38. 38
Utility: A simple and universal plugin interface
Utility
Contributors Users
Total of 389 gene-centric online
databases registered as BioGPS plugins
39. 39
Users: BioGPS has critical mass
Utility Daily pageviews
Contributors Users
• > 5000 registered users Top 10 organizations
• 13,500 unique visitors per month 1. Harvard 6. Cambridge
2. NIH 7. U Penn
• 155,000 page views per week
3. UCSD 8. Stanford
4. Scripps 9. Wash U
5. MIT 10. UNC
40. 40
Contributors: Explicit and implicit knowledge
Utility
Contributors Users
389 plugins registered
(65% publicly shared)
by over 75 users
spanning 150+ domains
48. 48
-
150 billion human hours
per year
http://www.flickr.com/photos/rvp-cw/6243289302/
49. 49
Using games to fold proteins
Fold.it players have successfully:
• Outperformed state of the art protein
folding algorithms (Cooper, Nature, 2010)
• Solved a previously-intractable crystal
structure (Khatib, Nat Struct Mol Biol, 2011)
• Designed an improved protein folding
algorithm (Khatib, PNAS, 2011)
• Improved enzyme activity of de novo
designed enzyme (Eiben, Nat Biotechnol, 2011)
53. 53
No good gene-disease annotation database
Query: Apolipoprotein E
Alzheimer's disease (AD)
Lipoprotein glomerulopathy
Sea-blue histiocyte disease
54. 54
No good gene-disease annotation database
Query: Apolipoprotein E
Alzheimer's disease (AD)
Lipoprotein glomerulopathy
Sea-blue histiocyte disease
Hyperlipoproteinemia, type III
Macular degeneration, age-related
Myocardial infarction susceptibility
55. 55
No good gene-disease annotation database
Query: Apolipoprotein E
? Alzheimer's disease (AD)
? Lipoprotein glomerulopathy
? Sea-blue histiocyte disease
Hyperlipoproteinemia, type III
? Macular degeneration, age-related
? Myocardial infarction susceptibility
HIV
Psoriasis
Vascular Diseases
56. 56
No good gene-disease annotation database
Query: Apolipoprotein E
Alzheimer's disease (AD) Memory
Coronary Artery Disease
Neuropsychological Tests Hypertension
Cognition Disorders Mental Status Schedule
Psychiatric Status Rating
Dementia Scales
Cognition Hyperlipidemias
Atrophy
Disease Progression Dementia, Vascular
Cardiovascular Diseases Parkinson Disease
Brain Injuries
Coronary Disease Myocardial Infarction
Diabetes Mellitus, Type 2 …
Memory Disorders 477 diseases!
57. 57
Play Dizeez to annotate gene-disease links
6. Play to win!
5. Hurry!
4. Then on to the
next question…
3. If it‟s „right‟, you get points
1. Read the clue (gene)
2. Click the related disease
(only one is “right”)
58. 58
Dizeez players seem pretty smart…
In total (since Dec 2011):
• 230 unique gamers
• 1045 games played
• 8525 guesses
# Occurrences Gene Disease Gene Wiki OMIM PharmGKB PubMed
11 NBPF3 neuroblastoma
11 SOX8 mental retardation
9 ABL1 leukemia
9 SSX1 synovial sarcoma
8 APC colorectal cancer
8 FES sarcoma
8 RBP3 retinoblastoma
8 GAST gastrinoma
8 DCC colorectal cancer
8 MAP3K5 cancer
59. 59
Using games to predict phenotype from genotype?
http://genegames.org
60. 60
Classification problems in genome biology
Classify new
cancer normal samples
find patterns
cancer
100,000s features
normal
SVM
Neural
networks
Naïve
Bayes
KNN
…
100s samples
61. 61
Random forests
Sample subset
of cases and Train decision
cancer normal features tree
100,000s features
100s samples
76. 76
Preliminary results
• 214 registered players
– 50% declared knowledge of cancer
biology
– 40% self-identified as having Ph.D.
• Prediction results
– 69% correct on survival concordance
index
– Best scoring model was 72%
77. 77
The
Long Tail of gamers
can collaboratively
build an accurate
disease classifier.
78. 78
Collaborators Group members
Doug Howe, ZFIN Ben Good Max Nanis
John Hogenesch, U Penn
Jon Huss, GNF
Salvatore Loguercio Chunlei Wu
Luca de Alfaro, UCSC Ian Macleod
Angel Pizzaro, U Penn
Faramarz Valafar, SDSU
Pierre Lindenbaum,
Fondation Jean Dausset
Michael Martone, Rush
Konrad Koehler, Karo Bio
Warren Kibbe, Simon Lim, Northwestern
Many Wikipedia editors
WP:MCB Project
Contact
http://sulab.org
Recruiting graduate students
asu@scripps.edu
in quantitative biology! See @andrewsu
http://education.scripps.edu/ +Andrew Su
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820)
Notas do Editor
We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
If you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
Tried on 773 GO categories, significant in 356 cases (46%)
We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
MODs and portals
Genetics resources
Literature resources
Protein resources
Pathway and expression databases
Pathway and expression databases
Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
Empire state building
Question: how to interject biological knowledge in the feature selection process?
Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.