Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Survival Prediction
1. THE CURE: A GAME WITH THE PURPOSE OF
GENE SELECTION FOR BREAST CANCER
SURVIVAL PREDICTION
Benjamin Good*, Salvatore Loguercio, Max Nanis, Andrew Su
The Scripps Research Institute
http://genegames.org/cure/
Rocky 2013
2. A QUESTION
How would you get 150 PhD level scientists
to work together on the same problem?
Without any money?
4. WHY GAMES?
It is estimated that 9 billion
hours are spent playing
Solitaire every year
Luis Von Ahn. : Google Tech Talk: Human Computation 2006.
(Shortly after receiving $500,000 „Genius Grant‟ for this work)
5. Seven million hours of human labor
ONE YEAR SOLITAIRE =
1,285 EMPIRE STATE
BUILDINGS
Empire State Building
6. 150 billion hours gaming each year
What if we could use a tiny fraction of that
human effort to achieve another purpose?
empire state
building
7M
one year of solitaire one year of games
9B
150B
McGonigal J. Reality is broken : why games make us better and how they can
change the world. New York: Penguin Press; 2011.
7. PURPOSES
Computer
science
Find objects
inside
images
Tag songs
Label all images
on the Web
Rate image
quality
Biology
Figure out how
proteins fold
Teach computers
English
Design RNA
molecules
Build ontologies
Map connections
between neurons
Link genes with
diseases
Assemble
genomes
Align DNA and
protein sequences
Tag Malaria parasites
in blood smears
Develop better
treatments for
breast cancer
10. INFERRING SURVIVAL PREDICTORS
10 year
Nosurvival?
Yes
make predictions on new samples
find patterns
10 year survival?
No
Yes
van't Veer, Laura J., et al. "Gene expression profiling predicts clinical outcome of breast cancer.” Nature 415.6871 (2002): 530-536.
11. INFERRING SURVIVAL PREDICTORS
find patterns
make predictions
No
10 year survival?
Yes
1) select genes
Out of the 25,000+ genes, which
small set works together the best?
2) infer predictor from data (e.g. decision tree, SVM, etc.)
12. PROBLEM: GENE SELECTION INSTABILITY
instability: different methods, different datasets
produce different gene sets for the same phenotype [1]
[1] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine 5.10 (2013).
13. PROBLEM: THE VALIDATION GAP
training
data, test
data
validation
validation: predictive signatures often perform
worse on independent data created for validation.
Photograph by Richard Hallman, National Geographic Adventure Blog
14. ADDING PRIOR KNOWLEDGE TO THE DISCOVERY
ALGORITHM
make predictions
find patterns
<10 yr
survival
>10 yr
survival
15. EX.) NETWORK GUIDED FORESTS
Use network to find
good gene
combinations
Dutkowski & Ideker (2011) Protein Networks as Logic Functions in Development in Development and Cancer. PLoS Computational Biology
16. BUT MOST KNOWLEDGE IS NOT STRUCTURED
1000000
950000
900000
850000
Number 800000
articles
750000
added to
PubMed 700000
112 publications/hour
(37 more by the end of this talk)
650000
600000
550000
500000
>160,000 publications linked to “breast cancer” since 2000
http://tinyurl.com/brsince2000
17. HOW CAN WE USE UNSTRUCTURED
KNOWLEDGE FOR GENE SELECTION?
Need an intelligent system that is good at reading and hypothesizing
Like you
28. COMMUNITY BOARD VIEW,
CHOOSE OPEN BOARD
You beat this one
The community
finished this board
(e.g. 11 different
players completed it)
This board is still open
29. BOARDS
• 25 genes each
• randomly selected from 1,250 genes that passed an
unsupervised filter for minimum expression level and variance
for a particular dataset [1],[2]
• 4 different 100 board rounds completed, each with some overlap
• 3731 distinct genes used in the game
[1] Curtis, Christina, et al. "The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups." Nature (2012)
[2] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine (2013)
30. 1,077 Players registered (one year)
http://io9.com/
these-cool-games-let-you-do-real-life-science-486173006
PLAYERS
250
Sage DREAM7
challenge, game
announcement
200
Other
150
Did not state
none
New player
registrations 100
BA
MSc
50
PhD
Au…
Jul-…
Jun…
Ma…
Apr…
Ma…
Fe…
Jan…
De…
No…
Oct…
0
Se…
%PhD
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
MD
32. GAMES PLAYED
• 9,904 games (non training)
Total games played per player
games played, top 20 players
10000
800
PhD
700
1000
Total
games
played
600
MD
500
100
MS
400
300
10
PhD
200
100
1
0
0
200
400
600
800
0
5
Player
PhD
10
15
20
25
33. GENE RANKINGS FROM GAMES
make predictions
find patterns
<10 yr
survival
>10 yr
survival
34. GENE RANKINGS FROM GAMES
•
For each gene:
1. O = number of times it appeared in a game (some genes occur on multiple boards, all
boards are played multiple times, all occurrences are counted)
2. S = number of times it was selected by a player
3. F = S/0
•
Games can be filtered based on player data
•
We can estimate an empirical P value for each value of O, S
•
P reflects the chances of getting S or more by chance given O
Examples (all games):
•
B-cell lymphoma 2 gene:
O = 13, S = 10, F = 10/13 = 0.77, P < 0.0001
•
Alanine and arginine rich domain containing protein:
O = 33, S = 3, F = 3/33 = 0.09, P = 0.91
35. GENES SELECTED BY ALL PLAYERS
9904 GAMES
P<0.001, 60 GENES
Top 10 enriched disease annotations
n genes
adj. P < 2.43e-06
background = 3731 genes
used in any game
Top 10 genes
Wang, Jing, et al. "WEB-based GEne SeT
AnaLysis Toolkit (WebGestalt): update 2013."
Nucleic acids research (2013).
36. GENES SELECTED BY PEOPLE:
WITH PHDS
WITH KNOWLEDGE OF CANCER,
2373 GAMES
P<0.001, 82 GENES
Top 10 enriched disease annotations
“Expert Gene Set”
n genes
adj. P < 5.76e-08
Top 10 genes
37. GENES SELECTED BY PEOPLE:
WITHOUT PHDS,
WITH NO KNOWLEDGE OF CANCER,
THAT ARE NOT BIOLOGISTS
3607 GAMES
P<0.001 , 10 GENES
Top 10 genes
• Gene set not
significantly enriched
with any disease
annotations
39. EVEN WITHOUT FILTERING, THE DATA CONTAINS
THE KNOWLEDGE
•
“All Players” still contained significant cancer signal.
40. PROBLEM: GENE SELECTION INSTABILITY
instability: different methods, different datasets
produce different gene sets for the same phenotype [1]
[1] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine 5.10 (2013).
41. GENE SET OVERLAPS, SOME BUT NOT MUCH
“Expert Gene Set”
http://bioinformatics.psb.ugent.be/webtools/Venn/
42. PROBLEM: THE VALIDATION GAP
training
data, test
data
validation
validation: predictive signatures often perform
worse on independent data created for validation.
Photograph by Richard Hallman, National Geographic Adventure Blog
43. CLASSIFIER PERFORMANCE WITH DIFFERENT
GENE GROUPS, DIFFERENT DATASETS
10 year survival
Yes
No
X-axis Test Set performance
Griffith 2013 data
“Expert Gene Set”
Y-axis Test Set performance
Metabric training Oslo Test
Only difference between
points, are the genes used to
build SVM classifier
44. SUMMARY
Plusses
•
1 year
•
1,000 players, 150 PhDs
•
10,000 games
•
“expert knowledge” captured through an
open game
Minuses
•
New gene ranking method with results
competitive with established approaches
•
Game is now in use in an undergraduate
class
•
Did not make a significantly better breast
cancer survival predictor
•
Game could have been better in many ways
• no beginning, middle or end
• random guessing can win
• easy to cheat
46. THE END
Thanks to:
Players!!!!
Andrew Su
Salvatore Loguercio
Max Nanis
Karthik Gangavarapu
Funding
More information at:
http://genegames.org/cure/
bgood@scripps.edu
@bgood
We are hiring! Looking for
postdocs, programmers
interested in crowdsourcing
and bioinformatics.
Contact: asu@scripps.edu
47. GAMES WITH A PURPOSE
of collecting expert level knowledge
Khatib, Firas, et al. "Algorithm discovery by
protein folding game players." Proceedings of
the National Academy of Sciences (2011)
Loguercio, Salvatore, et al.
"Dizeez: an online game for
human gene-disease
annotation." PloS One (2013)
MOLT
The Cure
48. HUMAN GUIDED FOREST (HGF)
Let CURE players build
decision modules
http://i9606.blogspot.com/2012/04/human-guided-forests-hgf.html
49. WHY DID YOU SIGN UP? (83 RESPONSES)
Why did you sign up for The Cure? (select all that apply)
90.0%
80.0%
70.0%
60.0%
50.0%
40.0%
30.0%
20.0%
10.0%
0.0%
To help breast cancer research
To learn something
To have fun playing a game
50. WAS THE GAME FUN?
0.8
0.7
0.6
percent
0.5
0.4
0.3
0.2
0.1
0
Yes, it was very fun
A little bit entertaining
No, not at all
51. DO YOU KNOW ANYONE THAT HAS OR HAD
BREAST CANCER?
Have you known or do you currently know anyone that has or has had breast cancer?
Yes
No
52. DID YOU LEARN ANYTHING FROM PLAYING?
60
50
40
30
20
10
0
Yes, I felt like I learned a lot
Yes, I learned a little bit
No, I did not learn anything
53. MY KNOWLEDGE OF BREAST CANCER IS:
0.6
0.5
0.4
0.3
0.2
0.1
0
I am an expert in breast I have helped conduct I know some biology and I know a little biology, but Nothing, I do not know a
cancer
cancer research ias part have some understanding nothing specific to cancer
thing about it
of my job
of what cancer is
54. AGE?
Which category below includes your age?
17 or younger
18-20
21-29
30-39
40-49
50-59
60 and above
60. OVERLAP OF SIGNIFICANT GENE SETS FROM
DIFFERENT CURE GAME FILTERS
PhD or MD (3,070 games)
Cancer Knowledge (4,660 games)
Biologist (4,913 games)
PhD & Cancer Knowledge (2,373 games)
No Expertise (3,607 games)
61. MOST RANDOM GENE EXPRESSION SIGNATURES ARE
SIGNIFICANTLY ASSOCIATED WITH BREAST CANCER
OUTCOME
Still need to pick gene sets
Feature selection challenge still relevant
Very useful grain of salt in interpreting these results..
Venet et al.(2011). PLoS Comp. Bio.
Notas do Editor
What if we could harness just a tiny fraction of that human effort???
All of this work still requires human effort
a, Two-dimensional presentation of transcript ratios for 98 breast tumours. There were 4,968 significant genes across the group. Each row represents a tumour and each column a single gene. As shown in the colour bar, red indicates upregulation, green downregulation, black no change, and grey no data available. The yellow line marks the subdivision into two dominant tumour clusters. b, Selected clinical data for the 98 patients in a: BRCA1 germline mutation carrier (or sporadic patient), ER expression, tumour grade 3 (versus grade 1 and 2), lymphocytic infiltrate, angioinvasion, and metastasis status. White indicates positive, black negative and grey denotes tumours derived from BRCA1 germline carriers who were excluded from the metastasis evaluation. The cluster below the yellow line consists of 36 tumours, of which 34 are ER negative (total 39 ER-negative) and 16 are carriers of the BRCA1 mutation (total 18). c, Enlarged portion from a containing a group of genes that co-regulate with the ER- gene (ESR1). Each gene is labelled by its gene name or accession number from GenBank. Contig ESTs ending with RC are reverse-complementary of the named contig EST. d, Enlarged portion from a containing a group of co-regulated genes that are the molecular reflection of extensive lymphocytic infiltrate, and comprise a set of genes expressed in T and B cells. (Gene annotation as in c.)
a, Two-dimensional presentation of transcript ratios for 98 breast tumours. There were 4,968 significant genes across the group. Each row represents a tumour and each column a single gene. As shown in the colour bar, red indicates upregulation, green downregulation, black no change, and grey no data available. The yellow line marks the subdivision into two dominant tumour clusters. b, Selected clinical data for the 98 patients in a: BRCA1 germline mutation carrier (or sporadic patient), ER expression, tumour grade 3 (versus grade 1 and 2), lymphocytic infiltrate, angioinvasion, and metastasis status. White indicates positive, black negative and grey denotes tumours derived from BRCA1 germline carriers who were excluded from the metastasis evaluation. The cluster below the yellow line consists of 36 tumours, of which 34 are ER negative (total 39 ER-negative) and 16 are carriers of the BRCA1 mutation (total 18). c, Enlarged portion from a containing a group of genes that co-regulate with the ER- gene (ESR1). Each gene is labelled by its gene name or accession number from GenBank. Contig ESTs ending with RC are reverse-complementary of the named contig EST. d, Enlarged portion from a containing a group of co-regulated genes that are the molecular reflection of extensive lymphocytic infiltrate, and comprise a set of genes expressed in T and B cells. (Gene annotation as in c.)
Main reason statistically is inadequate sample size and correlated data structure. (Xu 2010).Makes it difficult to trust the predictors when different genes appear every time.
though progress is being made on this issue, e.g. Margolin showed very good agreement between cross-validation, test set, and validation performance for models submitted to Sage challenge.
a, Two-dimensional presentation of transcript ratios for 98 breast tumours. There were 4,968 significant genes across the group. Each row represents a tumour and each column a single gene. As shown in the colour bar, red indicates upregulation, green downregulation, black no change, and grey no data available. The yellow line marks the subdivision into two dominant tumour clusters. b, Selected clinical data for the 98 patients in a: BRCA1 germline mutation carrier (or sporadic patient), ER expression, tumour grade 3 (versus grade 1 and 2), lymphocytic infiltrate, angioinvasion, and metastasis status. White indicates positive, black negative and grey denotes tumours derived from BRCA1 germline carriers who were excluded from the metastasis evaluation. The cluster below the yellow line consists of 36 tumours, of which 34 are ER negative (total 39 ER-negative) and 16 are carriers of the BRCA1 mutation (total 18). c, Enlarged portion from a containing a group of genes that co-regulate with the ER- gene (ESR1). Each gene is labelled by its gene name or accession number from GenBank. Contig ESTs ending with RC are reverse-complementary of the named contig EST. d, Enlarged portion from a containing a group of co-regulated genes that are the molecular reflection of extensive lymphocytic infiltrate, and comprise a set of genes expressed in T and B cells. (Gene annotation as in c.)
Walk you through it as a player and then I’ll explain what is going on.
Playing 10,000 75-game series, we would only expect 27 or more occurrences of a particular gene in 1 of the 10,000 series.BCL2: B-cell lymphoma 2, regulator of apoptosisAARD: alanine and arginine rich domain containing protein, no information
Playing 10,000 75-game series, we would only expect 27 or more occurrences of a particular gene in 1 of the 10,000 series.BCL2: B-cell lymphoma 2, regulator of apoptosisAARD: alanine and arginine rich domain containing protein, no informationSingle tailed : the chances of getting S by chance given O
Disease terms came from PharmGKB associations to genes made using NCBI gene and Pubmed.
Disease terms came from PharmGKB associations to genes made using NCBI gene and Pubmed.
Main reason statistically is inadequate sample size and correlated data structure. (Xu 2010).
though progress is being made on this issue, e.g. Margolin showed very good agreement between cross-validation, test set, and validation performance for models submitted to Sage challenge.