1. An online game for improving human phenotype prediction
Benjamin M Good, Salvatore Loguercio, Andrew I Su
The Scripps Research Institute, La Jolla, California, USA
ABSTRACT
ABSTRACT Motivation Combo: feature selection with community intelligence
An important goal for biomedical research is to produce genetic and • Goal: pick the best set of genes
genomic predictors for human phenotypes such as disease prognosis or
drug response. To this end, we can now quantify an extremely large
• Using prior biological knowledge, it is possible • Best: the gene set that produces the best decision tree classifier
number of potential biomarkers for any biological sample. In fact, a to identify stronger, more consistent • Classifier: created using training data and selected genes, used to
single sample could reasonably be described by millions of molecular predict phenotype (e.g. breast cancer prognosis)
variations in DNA, RNA, proteins, and metabolites. However, the actual
predictive patterns.
number of samples processed typically remains small in comparison. As a
result, attempts to use this data to build predictors often face problems A game board A hand
of overfitting. (While a predictive pattern may describe training data
very well, it may not reproduce well on other datasets.) • Prior knowledge
It has recently been shown that biological knowledge in the form of gene
encoded in protein-
annotations and pathway databases can be used to guide the process of protein interaction
inferring phenotype predictors [1-3]. While promising, such methods are
limited by the amount, quality and problem-specific applicability of the
databases [1,2] and
structured knowledge that is available. pathway databases
[3] has been used to
Following in the line of games that have recently demonstrated success
as a means of ‘crowdsourcing’ difficult biological problems [4,5], we are improve phenotype
Inferred
developing games with the purpose of improving human phenotype prediction Score: 78 (percent correct) decision tree
predictions. Our games work on two levels: (1) games such as Dizeez
and GenESP collect novel gene annotations and (2) games like Combo Game Score: determined by Phenotype 1
Network Guided Forest from Dutkowski et al (2011)
engage players directly in the process of predictor inference. estimating performance of trees Phenotype 2
constructed using the selected Feature sets from many
Play game prototypes at: http://www.genegames.org • What about knowledge that is not recorded in features on training data. individual games used to create
a Decision Tree Forest classifier.
(Also see Poster I03) structured databases? (Each tree votes once.)
Challenge Opportunity Human Guided Forest
Ensemble classifier where
make predictions on • Online games are successfully tapping into the components are decision
cancer normal new samples knowledge and reasoning abilities of trees constructed using
thousands of people. manually selected subsets of
find patterns features. Adaptation of
cancer Network Guided and Random
Forests [1].
normal
Label all images on the Web
Devise protein folding algorithms
REFERENCES
1. Dutkowski and Ideker (2011) Protein Networks as Logic Functions in Development and Cancer. PLoS
Computational Biology
Design RNA molecules Fix multiple sequence alignments 2. Winter et al (2012) Google Goes Cancer: Improving Outcome Prediction for Cancer Patients by Network-Based
Ranking of Marker Genes. PLoS Computational Biology
3. Liu et al (2012) Identifying dysregulated pathways in cancers from pathway interaction networks. BMC
Bioinformatics
• With tens of thousands of measurements • COMBO is designed to motivate and enable 4. Good and Su (2011) Games with a Scientific Purpose. Genome Biology
5. Kawrykow et al (2012) Phylo: A Citizen Science Approach for Improving Multiple Sequence Alignment. PLoS One
but only hundreds of samples, many people to help improve phenotype predictors
possible patterns are found. CONTACT
• But which ones are real? Benjamin Good: bgood@scripps.edu Salvatore Loguercio: loguerci@scripps.edu Andrew Su: asu@scripps.edu
FUNDING
select predictive gene sets
We acknowledge support from the National Institute of General Medical Sciences (GM089820 and
GM083924) and the NIH through the FaceBase Consortium for a particular emphasis on
craniofacial genes (DE-20057).
.