SlideShare uma empresa Scribd logo
1 de 61
THE CURE: A GAME WITH THE PURPOSE OF
GENE SELECTION FOR BREAST CANCER
SURVIVAL PREDICTION
Benjamin Good*, Salvatore Loguercio, Max Nanis, Andrew Su
The Scripps Research Institute
http://genegames.org/cure/
Rocky 2013
A QUESTION

How would you get 150 PhD level scientists
to work together on the same problem?

Without any money?
TRAIL MAP

Games
Survival Prediction
The Cure
WHY GAMES?

It is estimated that 9 billion
hours are spent playing
Solitaire every year

Luis Von Ahn. : Google Tech Talk: Human Computation 2006.
(Shortly after receiving $500,000 „Genius Grant‟ for this work)
Seven million hours of human labor

ONE YEAR SOLITAIRE =
1,285 EMPIRE STATE
BUILDINGS

Empire State Building
150 billion hours gaming each year

What if we could use a tiny fraction of that
human effort to achieve another purpose?
empire state
building
7M

one year of solitaire one year of games
9B

150B

McGonigal J. Reality is broken : why games make us better and how they can
change the world. New York: Penguin Press; 2011.
PURPOSES
Computer
science
Find objects
inside
images
Tag songs

Label all images
on the Web

Rate image
quality

Biology
Figure out how
proteins fold

Teach computers
English

Design RNA
molecules

Build ontologies
Map connections
between neurons

Link genes with
diseases

Assemble
genomes

Align DNA and
protein sequences

Tag Malaria parasites
in blood smears

Develop better
treatments for
breast cancer
GAMES WITH A PURPOSE

MOLT
The Cure
TRAIL MAP

Games
Survival Prediction
The Cure
INFERRING SURVIVAL PREDICTORS
10 year
Nosurvival?

Yes

make predictions on new samples

find patterns

10 year survival?
No

Yes

van't Veer, Laura J., et al. "Gene expression profiling predicts clinical outcome of breast cancer.” Nature 415.6871 (2002): 530-536.
INFERRING SURVIVAL PREDICTORS
find patterns

make predictions
No

10 year survival?
Yes

1) select genes

Out of the 25,000+ genes, which
small set works together the best?

2) infer predictor from data (e.g. decision tree, SVM, etc.)
PROBLEM: GENE SELECTION INSTABILITY

instability: different methods, different datasets
produce different gene sets for the same phenotype [1]

[1] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine 5.10 (2013).
PROBLEM: THE VALIDATION GAP

training
data, test
data
validation
validation: predictive signatures often perform
worse on independent data created for validation.

Photograph by Richard Hallman, National Geographic Adventure Blog
ADDING PRIOR KNOWLEDGE TO THE DISCOVERY
ALGORITHM
make predictions
find patterns

<10 yr
survival
>10 yr
survival
EX.) NETWORK GUIDED FORESTS

Use network to find
good gene
combinations

Dutkowski & Ideker (2011) Protein Networks as Logic Functions in Development in Development and Cancer. PLoS Computational Biology
BUT MOST KNOWLEDGE IS NOT STRUCTURED
1000000
950000
900000
850000

Number 800000
articles
750000
added to
PubMed 700000

112 publications/hour
(37 more by the end of this talk)

650000

600000
550000
500000

>160,000 publications linked to “breast cancer” since 2000
http://tinyurl.com/brsince2000
HOW CAN WE USE UNSTRUCTURED
KNOWLEDGE FOR GENE SELECTION?

Need an intelligent system that is good at reading and hypothesizing

Like you
TRAIL MAP

Games
Survival Prediction
The Cure
THE CURE

HTTP://GENEGAMES.ORG/CURE/
education level?
cancer knowledge?

biologist?
PLAY = GENE SELECTION
Opponents
hand

Alternate turns
picking a gene from
a “board” of 25

Your
hand
SCORING
Score reflects accuracy of
decision tree created with
just the selected genes
on real training data

Cure Server
PLAY WITH KNOWLEDGE: GENE ONTOLOGY
PLAY WITH KNOWLEDGE: GENE RIFS
YOU WIN!
COMMUNITY BOARD VIEW,
CHOOSE OPEN BOARD
You beat this one

The community
finished this board
(e.g. 11 different
players completed it)

This board is still open
BOARDS
• 25 genes each

• randomly selected from 1,250 genes that passed an
unsupervised filter for minimum expression level and variance
for a particular dataset [1],[2]
• 4 different 100 board rounds completed, each with some overlap
• 3731 distinct genes used in the game

[1] Curtis, Christina, et al. "The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups." Nature (2012)
[2] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine (2013)
1,077 Players registered (one year)
http://io9.com/
these-cool-games-let-you-do-real-life-science-486173006

PLAYERS
250

Sage DREAM7
challenge, game
announcement

200
Other
150

Did not state
none

New player
registrations 100

BA
MSc

50

PhD

Au…

Jul-…

Jun…

Ma…

Apr…

Ma…

Fe…

Jan…

De…

No…

Oct…

0

Se…

%PhD

0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0

MD
PLAYER DEMOGRAPHICS
graduate_degree
undergraduate

none

800

350
300
250
Most
200
recent
150
degree 100
50
0

800

600

600

Cancer
400
knowledge?
200

Are you a
400
Biologist?
200

0

0
no

ns

yes

no

ns

yes
GAMES PLAYED

• 9,904 games (non training)

Total games played per player

games played, top 20 players

10000

800

PhD

700

1000
Total
games
played

600

MD

500

100

MS

400
300

10

PhD

200
100

1

0

0

200

400

600

800

0

5

Player

PhD

10

15

20

25
GENE RANKINGS FROM GAMES
make predictions
find patterns

<10 yr
survival
>10 yr
survival
GENE RANKINGS FROM GAMES
•

For each gene:
1. O = number of times it appeared in a game (some genes occur on multiple boards, all
boards are played multiple times, all occurrences are counted)
2. S = number of times it was selected by a player
3. F = S/0

•

Games can be filtered based on player data

•

We can estimate an empirical P value for each value of O, S

•

P reflects the chances of getting S or more by chance given O

Examples (all games):
•

B-cell lymphoma 2 gene:
O = 13, S = 10, F = 10/13 = 0.77, P < 0.0001

•

Alanine and arginine rich domain containing protein:

O = 33, S = 3, F = 3/33 = 0.09, P = 0.91
GENES SELECTED BY ALL PLAYERS
9904 GAMES
P<0.001, 60 GENES
Top 10 enriched disease annotations

n genes

adj. P < 2.43e-06
background = 3731 genes
used in any game

Top 10 genes

Wang, Jing, et al. "WEB-based GEne SeT
AnaLysis Toolkit (WebGestalt): update 2013."
Nucleic acids research (2013).
GENES SELECTED BY PEOPLE:
WITH PHDS
WITH KNOWLEDGE OF CANCER,
2373 GAMES
P<0.001, 82 GENES
Top 10 enriched disease annotations

“Expert Gene Set”
n genes

adj. P < 5.76e-08
Top 10 genes
GENES SELECTED BY PEOPLE:
WITHOUT PHDS,
WITH NO KNOWLEDGE OF CANCER,
THAT ARE NOT BIOLOGISTS
3607 GAMES
P<0.001 , 10 GENES
Top 10 genes

• Gene set not
significantly enriched
with any disease
annotations
SELF REPORTING SEEMED TO WORK...
EVEN WITHOUT FILTERING, THE DATA CONTAINS
THE KNOWLEDGE
•

“All Players” still contained significant cancer signal.
PROBLEM: GENE SELECTION INSTABILITY

instability: different methods, different datasets
produce different gene sets for the same phenotype [1]

[1] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine 5.10 (2013).
GENE SET OVERLAPS, SOME BUT NOT MUCH
“Expert Gene Set”

http://bioinformatics.psb.ugent.be/webtools/Venn/
PROBLEM: THE VALIDATION GAP

training
data, test
data
validation
validation: predictive signatures often perform
worse on independent data created for validation.

Photograph by Richard Hallman, National Geographic Adventure Blog
CLASSIFIER PERFORMANCE WITH DIFFERENT
GENE GROUPS, DIFFERENT DATASETS
10 year survival
Yes
No

X-axis Test Set performance
Griffith 2013 data
“Expert Gene Set”

Y-axis Test Set performance
Metabric training Oslo Test

Only difference between
points, are the genes used to
build SVM classifier
SUMMARY
Plusses
•

1 year

•

1,000 players, 150 PhDs

•

10,000 games

•

“expert knowledge” captured through an
open game

Minuses

•

New gene ranking method with results
competitive with established approaches

•

Game is now in use in an undergraduate
class

•

Did not make a significantly better breast
cancer survival predictor

•

Game could have been better in many ways
• no beginning, middle or end
• random guessing can win
• easy to cheat
NEXT STEPS
•

More fun

•

More learning for novices

•

More control for experts

•

More data
THE END
Thanks to:
Players!!!!
Andrew Su
Salvatore Loguercio
Max Nanis
Karthik Gangavarapu
Funding

More information at:
http://genegames.org/cure/
bgood@scripps.edu
@bgood
We are hiring! Looking for
postdocs, programmers
interested in crowdsourcing
and bioinformatics.
Contact: asu@scripps.edu
GAMES WITH A PURPOSE

of collecting expert level knowledge

Khatib, Firas, et al. "Algorithm discovery by
protein folding game players." Proceedings of
the National Academy of Sciences (2011)

Loguercio, Salvatore, et al.
"Dizeez: an online game for
human gene-disease
annotation." PloS One (2013)

MOLT
The Cure
HUMAN GUIDED FOREST (HGF)

Let CURE players build
decision modules

http://i9606.blogspot.com/2012/04/human-guided-forests-hgf.html
WHY DID YOU SIGN UP? (83 RESPONSES)
Why did you sign up for The Cure? (select all that apply)
90.0%
80.0%
70.0%
60.0%
50.0%
40.0%
30.0%
20.0%
10.0%
0.0%
To help breast cancer research

To learn something

To have fun playing a game
WAS THE GAME FUN?
0.8
0.7
0.6

percent

0.5
0.4
0.3
0.2
0.1
0
Yes, it was very fun

A little bit entertaining

No, not at all
DO YOU KNOW ANYONE THAT HAS OR HAD
BREAST CANCER?
Have you known or do you currently know anyone that has or has had breast cancer?

Yes
No
DID YOU LEARN ANYTHING FROM PLAYING?
60
50
40
30
20
10
0
Yes, I felt like I learned a lot

Yes, I learned a little bit

No, I did not learn anything
MY KNOWLEDGE OF BREAST CANCER IS:
0.6

0.5

0.4

0.3

0.2

0.1

0
I am an expert in breast I have helped conduct I know some biology and I know a little biology, but Nothing, I do not know a
cancer
cancer research ias part have some understanding nothing specific to cancer
thing about it
of my job
of what cancer is
AGE?
Which category below includes your age?

17 or younger
18-20
21-29
30-39
40-49
50-59
60 and above
GENDER?
What is your gender?

Female
Male
TRAINING LEVELS
the decision tree created using the
feature “makes milk” is 100%
correct on training data, you win!
TRAINING INTERFACE

Choose the feature that best
distinguishes mammals from other
creatures
TRAINING INTERFACE

the decision tree created using the
feature “has hair” is 94% correct
on training data, you win!
OVERLAP OF SIGNIFICANT GENE SETS FROM
DIFFERENT CURE GAME FILTERS
PhD or MD (3,070 games)
Cancer Knowledge (4,660 games)
Biologist (4,913 games)

PhD & Cancer Knowledge (2,373 games)

No Expertise (3,607 games)
MOST RANDOM GENE EXPRESSION SIGNATURES ARE
SIGNIFICANTLY ASSOCIATED WITH BREAST CANCER
OUTCOME

Still need to pick gene sets
Feature selection challenge still relevant
Very useful grain of salt in interpreting these results..

Venet et al.(2011). PLoS Comp. Bio.

Mais conteúdo relacionado

Destaque

Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstractsMicrotask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Benjamin Good
 
2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata
Benjamin Good
 
Fedora Iptables
Fedora IptablesFedora Iptables
Fedora Iptables
zubin71
 
Human Guided Forests (HGF)
Human Guided Forests (HGF)Human Guided Forests (HGF)
Human Guided Forests (HGF)
Benjamin Good
 
Eishi Company Profile 修改好的
Eishi Company Profile 修改好的Eishi Company Profile 修改好的
Eishi Company Profile 修改好的
eishimachinery
 

Destaque (17)

The National Society For The Protection Of Hmmm
The National Society For The Protection Of HmmmThe National Society For The Protection Of Hmmm
The National Society For The Protection Of Hmmm
 
Gene wiki jamboree
Gene wiki jamboreeGene wiki jamboree
Gene wiki jamboree
 
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstractsMicrotask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
 
Channeling Collaborative Spirit
Channeling Collaborative SpiritChanneling Collaborative Spirit
Channeling Collaborative Spirit
 
2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata
 
Fedora Iptables
Fedora IptablesFedora Iptables
Fedora Iptables
 
Bio Logical Mass Collaboration3
Bio Logical Mass Collaboration3Bio Logical Mass Collaboration3
Bio Logical Mass Collaboration3
 
2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio
 
IMSafer Angel Round
IMSafer Angel RoundIMSafer Angel Round
IMSafer Angel Round
 
Human Guided Forests (HGF)
Human Guided Forests (HGF)Human Guided Forests (HGF)
Human Guided Forests (HGF)
 
Oslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alphaOslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alpha
 
Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giants
 
2to3
2to32to3
2to3
 
Citizen sciencepanel2015 pdf
Citizen sciencepanel2015 pdfCitizen sciencepanel2015 pdf
Citizen sciencepanel2015 pdf
 
Mark Hopper Product And Marketing Exec 2010
Mark Hopper Product And Marketing Exec 2010Mark Hopper Product And Marketing Exec 2010
Mark Hopper Product And Marketing Exec 2010
 
Eishi Company Profile 修改好的
Eishi Company Profile 修改好的Eishi Company Profile 修改好的
Eishi Company Profile 修改好的
 
B2B Branding Explained
B2B Branding ExplainedB2B Branding Explained
B2B Branding Explained
 

Semelhante a The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Survival Prediction

The Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival predictionThe Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival prediction
Benjamin Good
 
Genetic_Research_Lesson1_Slides_NWABR.ppt
Genetic_Research_Lesson1_Slides_NWABR.pptGenetic_Research_Lesson1_Slides_NWABR.ppt
Genetic_Research_Lesson1_Slides_NWABR.ppt
DESMONDEZIEKE1
 

Semelhante a The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Survival Prediction (20)

Apprendre par le jeu ed tech
Apprendre par le jeu ed techApprendre par le jeu ed tech
Apprendre par le jeu ed tech
 
Sciences Games #Glass2015
Sciences Games #Glass2015Sciences Games #Glass2015
Sciences Games #Glass2015
 
The Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival predictionThe Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival prediction
 
Learning with games
Learning with gamesLearning with games
Learning with games
 
Apprendre par le jeu
Apprendre par le jeu Apprendre par le jeu
Apprendre par le jeu
 
2013 alumni-webinar
2013 alumni-webinar2013 alumni-webinar
2013 alumni-webinar
 
Izant openscience
Izant openscienceIzant openscience
Izant openscience
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
Genetic_Research_Lesson1_Slides_NWABR.ppt
Genetic_Research_Lesson1_Slides_NWABR.pptGenetic_Research_Lesson1_Slides_NWABR.ppt
Genetic_Research_Lesson1_Slides_NWABR.ppt
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Why Life is Difficult, and What We MIght Do About It
Why Life is Difficult, and What We MIght Do About ItWhy Life is Difficult, and What We MIght Do About It
Why Life is Difficult, and What We MIght Do About It
 
Bioinformatics issues and challanges presentation at s p college
Bioinformatics  issues and challanges  presentation at s p collegeBioinformatics  issues and challanges  presentation at s p college
Bioinformatics issues and challanges presentation at s p college
 
Bringing scientists to data to accelerate discoveries and improve human healt...
Bringing scientists to data to accelerate discoveries and improve human healt...Bringing scientists to data to accelerate discoveries and improve human healt...
Bringing scientists to data to accelerate discoveries and improve human healt...
 
"Hacking the Software for Life" - Brad Perkins (Chief Medical Officer, Human ...
"Hacking the Software for Life" - Brad Perkins (Chief Medical Officer, Human ..."Hacking the Software for Life" - Brad Perkins (Chief Medical Officer, Human ...
"Hacking the Software for Life" - Brad Perkins (Chief Medical Officer, Human ...
 
The Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome SequencingThe Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome Sequencing
 
Why life is so complicated
Why life is so complicatedWhy life is so complicated
Why life is so complicated
 
Δρ Χαράλαμπος Πιτσαλίδης, 3rd Health Innovation Conference
Δρ Χαράλαμπος Πιτσαλίδης, 3rd Health Innovation ConferenceΔρ Χαράλαμπος Πιτσαλίδης, 3rd Health Innovation Conference
Δρ Χαράλαμπος Πιτσαλίδης, 3rd Health Innovation Conference
 
How can Big Data help upgrade brain care?
How can Big Data help upgrade brain care?How can Big Data help upgrade brain care?
How can Big Data help upgrade brain care?
 
Knowledge Will Propel Machine Understanding of Big Data
Knowledge Will Propel Machine Understanding of Big DataKnowledge Will Propel Machine Understanding of Big Data
Knowledge Will Propel Machine Understanding of Big Data
 

Mais de Benjamin Good

Integrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity ModelsIntegrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity Models
Benjamin Good
 
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery (Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
Benjamin Good
 
Building a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen scienceBuilding a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen science
Benjamin Good
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
Benjamin Good
 
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Benjamin Good
 
Mark2Cure: a crowdsourcing platform for biomedical literature annotation
Mark2Cure: a crowdsourcing platform for biomedical literature annotationMark2Cure: a crowdsourcing platform for biomedical literature annotation
Mark2Cure: a crowdsourcing platform for biomedical literature annotation
Benjamin Good
 
Short update on The Cure game first week
Short update on The Cure game first weekShort update on The Cure game first week
Short update on The Cure game first week
Benjamin Good
 

Mais de Benjamin Good (20)

Representing and reasoning with biological knowledge
Representing and reasoning with biological knowledgeRepresenting and reasoning with biological knowledge
Representing and reasoning with biological knowledge
 
Integrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity ModelsIntegrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity Models
 
Pathways2GO: Converting BioPax pathways to GO-CAMs
Pathways2GO: Converting BioPax pathways to GO-CAMsPathways2GO: Converting BioPax pathways to GO-CAMs
Pathways2GO: Converting BioPax pathways to GO-CAMs
 
Knowledge Beacons
Knowledge BeaconsKnowledge Beacons
Knowledge Beacons
 
Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden
 
Science Game Lab
Science Game LabScience Game Lab
Science Game Lab
 
Wikidata and the Semantic Web of Food
Wikidata and the  Semantic Web of FoodWikidata and the  Semantic Web of Food
Wikidata and the Semantic Web of Food
 
Gene Wiki and Wikimedia Foundation SPARQL workshop
Gene Wiki and Wikimedia Foundation SPARQL workshopGene Wiki and Wikimedia Foundation SPARQL workshop
Gene Wiki and Wikimedia Foundation SPARQL workshop
 
Opportunities and challenges presented by Wikidata in the context of biocuration
Opportunities and challenges presented by Wikidata in the context of biocurationOpportunities and challenges presented by Wikidata in the context of biocuration
Opportunities and challenges presented by Wikidata in the context of biocuration
 
Scripps bioinformatics seminar_day_2
Scripps bioinformatics seminar_day_2Scripps bioinformatics seminar_day_2
Scripps bioinformatics seminar_day_2
 
Wikidata workshop for ISB Biocuration 2016
Wikidata workshop for ISB Biocuration 2016Wikidata workshop for ISB Biocuration 2016
Wikidata workshop for ISB Biocuration 2016
 
2016 mem good
2016 mem good2016 mem good
2016 mem good
 
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery (Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2K
 
Building a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen scienceBuilding a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen science
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
 
Serious games for bioinformatics education. ISMB 2014 education workshop
Serious games for bioinformatics education.  ISMB 2014 education workshopSerious games for bioinformatics education.  ISMB 2014 education workshop
Serious games for bioinformatics education. ISMB 2014 education workshop
 
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
 
Mark2Cure: a crowdsourcing platform for biomedical literature annotation
Mark2Cure: a crowdsourcing platform for biomedical literature annotationMark2Cure: a crowdsourcing platform for biomedical literature annotation
Mark2Cure: a crowdsourcing platform for biomedical literature annotation
 
Short update on The Cure game first week
Short update on The Cure game first weekShort update on The Cure game first week
Short update on The Cure game first week
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Survival Prediction

  • 1. THE CURE: A GAME WITH THE PURPOSE OF GENE SELECTION FOR BREAST CANCER SURVIVAL PREDICTION Benjamin Good*, Salvatore Loguercio, Max Nanis, Andrew Su The Scripps Research Institute http://genegames.org/cure/ Rocky 2013
  • 2. A QUESTION How would you get 150 PhD level scientists to work together on the same problem? Without any money?
  • 4. WHY GAMES? It is estimated that 9 billion hours are spent playing Solitaire every year Luis Von Ahn. : Google Tech Talk: Human Computation 2006. (Shortly after receiving $500,000 „Genius Grant‟ for this work)
  • 5. Seven million hours of human labor ONE YEAR SOLITAIRE = 1,285 EMPIRE STATE BUILDINGS Empire State Building
  • 6. 150 billion hours gaming each year What if we could use a tiny fraction of that human effort to achieve another purpose? empire state building 7M one year of solitaire one year of games 9B 150B McGonigal J. Reality is broken : why games make us better and how they can change the world. New York: Penguin Press; 2011.
  • 7. PURPOSES Computer science Find objects inside images Tag songs Label all images on the Web Rate image quality Biology Figure out how proteins fold Teach computers English Design RNA molecules Build ontologies Map connections between neurons Link genes with diseases Assemble genomes Align DNA and protein sequences Tag Malaria parasites in blood smears Develop better treatments for breast cancer
  • 8. GAMES WITH A PURPOSE MOLT The Cure
  • 10. INFERRING SURVIVAL PREDICTORS 10 year Nosurvival? Yes make predictions on new samples find patterns 10 year survival? No Yes van't Veer, Laura J., et al. "Gene expression profiling predicts clinical outcome of breast cancer.” Nature 415.6871 (2002): 530-536.
  • 11. INFERRING SURVIVAL PREDICTORS find patterns make predictions No 10 year survival? Yes 1) select genes Out of the 25,000+ genes, which small set works together the best? 2) infer predictor from data (e.g. decision tree, SVM, etc.)
  • 12. PROBLEM: GENE SELECTION INSTABILITY instability: different methods, different datasets produce different gene sets for the same phenotype [1] [1] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine 5.10 (2013).
  • 13. PROBLEM: THE VALIDATION GAP training data, test data validation validation: predictive signatures often perform worse on independent data created for validation. Photograph by Richard Hallman, National Geographic Adventure Blog
  • 14. ADDING PRIOR KNOWLEDGE TO THE DISCOVERY ALGORITHM make predictions find patterns <10 yr survival >10 yr survival
  • 15. EX.) NETWORK GUIDED FORESTS Use network to find good gene combinations Dutkowski & Ideker (2011) Protein Networks as Logic Functions in Development in Development and Cancer. PLoS Computational Biology
  • 16. BUT MOST KNOWLEDGE IS NOT STRUCTURED 1000000 950000 900000 850000 Number 800000 articles 750000 added to PubMed 700000 112 publications/hour (37 more by the end of this talk) 650000 600000 550000 500000 >160,000 publications linked to “breast cancer” since 2000 http://tinyurl.com/brsince2000
  • 17. HOW CAN WE USE UNSTRUCTURED KNOWLEDGE FOR GENE SELECTION? Need an intelligent system that is good at reading and hypothesizing Like you
  • 21.
  • 22. PLAY = GENE SELECTION Opponents hand Alternate turns picking a gene from a “board” of 25 Your hand
  • 23. SCORING Score reflects accuracy of decision tree created with just the selected genes on real training data Cure Server
  • 24. PLAY WITH KNOWLEDGE: GENE ONTOLOGY
  • 25. PLAY WITH KNOWLEDGE: GENE RIFS
  • 27.
  • 28. COMMUNITY BOARD VIEW, CHOOSE OPEN BOARD You beat this one The community finished this board (e.g. 11 different players completed it) This board is still open
  • 29. BOARDS • 25 genes each • randomly selected from 1,250 genes that passed an unsupervised filter for minimum expression level and variance for a particular dataset [1],[2] • 4 different 100 board rounds completed, each with some overlap • 3731 distinct genes used in the game [1] Curtis, Christina, et al. "The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups." Nature (2012) [2] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine (2013)
  • 30. 1,077 Players registered (one year) http://io9.com/ these-cool-games-let-you-do-real-life-science-486173006 PLAYERS 250 Sage DREAM7 challenge, game announcement 200 Other 150 Did not state none New player registrations 100 BA MSc 50 PhD Au… Jul-… Jun… Ma… Apr… Ma… Fe… Jan… De… No… Oct… 0 Se… %PhD 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 MD
  • 32. GAMES PLAYED • 9,904 games (non training) Total games played per player games played, top 20 players 10000 800 PhD 700 1000 Total games played 600 MD 500 100 MS 400 300 10 PhD 200 100 1 0 0 200 400 600 800 0 5 Player PhD 10 15 20 25
  • 33. GENE RANKINGS FROM GAMES make predictions find patterns <10 yr survival >10 yr survival
  • 34. GENE RANKINGS FROM GAMES • For each gene: 1. O = number of times it appeared in a game (some genes occur on multiple boards, all boards are played multiple times, all occurrences are counted) 2. S = number of times it was selected by a player 3. F = S/0 • Games can be filtered based on player data • We can estimate an empirical P value for each value of O, S • P reflects the chances of getting S or more by chance given O Examples (all games): • B-cell lymphoma 2 gene: O = 13, S = 10, F = 10/13 = 0.77, P < 0.0001 • Alanine and arginine rich domain containing protein: O = 33, S = 3, F = 3/33 = 0.09, P = 0.91
  • 35. GENES SELECTED BY ALL PLAYERS 9904 GAMES P<0.001, 60 GENES Top 10 enriched disease annotations n genes adj. P < 2.43e-06 background = 3731 genes used in any game Top 10 genes Wang, Jing, et al. "WEB-based GEne SeT AnaLysis Toolkit (WebGestalt): update 2013." Nucleic acids research (2013).
  • 36. GENES SELECTED BY PEOPLE: WITH PHDS WITH KNOWLEDGE OF CANCER, 2373 GAMES P<0.001, 82 GENES Top 10 enriched disease annotations “Expert Gene Set” n genes adj. P < 5.76e-08 Top 10 genes
  • 37. GENES SELECTED BY PEOPLE: WITHOUT PHDS, WITH NO KNOWLEDGE OF CANCER, THAT ARE NOT BIOLOGISTS 3607 GAMES P<0.001 , 10 GENES Top 10 genes • Gene set not significantly enriched with any disease annotations
  • 38. SELF REPORTING SEEMED TO WORK...
  • 39. EVEN WITHOUT FILTERING, THE DATA CONTAINS THE KNOWLEDGE • “All Players” still contained significant cancer signal.
  • 40. PROBLEM: GENE SELECTION INSTABILITY instability: different methods, different datasets produce different gene sets for the same phenotype [1] [1] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine 5.10 (2013).
  • 41. GENE SET OVERLAPS, SOME BUT NOT MUCH “Expert Gene Set” http://bioinformatics.psb.ugent.be/webtools/Venn/
  • 42. PROBLEM: THE VALIDATION GAP training data, test data validation validation: predictive signatures often perform worse on independent data created for validation. Photograph by Richard Hallman, National Geographic Adventure Blog
  • 43. CLASSIFIER PERFORMANCE WITH DIFFERENT GENE GROUPS, DIFFERENT DATASETS 10 year survival Yes No X-axis Test Set performance Griffith 2013 data “Expert Gene Set” Y-axis Test Set performance Metabric training Oslo Test Only difference between points, are the genes used to build SVM classifier
  • 44. SUMMARY Plusses • 1 year • 1,000 players, 150 PhDs • 10,000 games • “expert knowledge” captured through an open game Minuses • New gene ranking method with results competitive with established approaches • Game is now in use in an undergraduate class • Did not make a significantly better breast cancer survival predictor • Game could have been better in many ways • no beginning, middle or end • random guessing can win • easy to cheat
  • 45. NEXT STEPS • More fun • More learning for novices • More control for experts • More data
  • 46. THE END Thanks to: Players!!!! Andrew Su Salvatore Loguercio Max Nanis Karthik Gangavarapu Funding More information at: http://genegames.org/cure/ bgood@scripps.edu @bgood We are hiring! Looking for postdocs, programmers interested in crowdsourcing and bioinformatics. Contact: asu@scripps.edu
  • 47. GAMES WITH A PURPOSE of collecting expert level knowledge Khatib, Firas, et al. "Algorithm discovery by protein folding game players." Proceedings of the National Academy of Sciences (2011) Loguercio, Salvatore, et al. "Dizeez: an online game for human gene-disease annotation." PloS One (2013) MOLT The Cure
  • 48. HUMAN GUIDED FOREST (HGF) Let CURE players build decision modules http://i9606.blogspot.com/2012/04/human-guided-forests-hgf.html
  • 49. WHY DID YOU SIGN UP? (83 RESPONSES) Why did you sign up for The Cure? (select all that apply) 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% To help breast cancer research To learn something To have fun playing a game
  • 50. WAS THE GAME FUN? 0.8 0.7 0.6 percent 0.5 0.4 0.3 0.2 0.1 0 Yes, it was very fun A little bit entertaining No, not at all
  • 51. DO YOU KNOW ANYONE THAT HAS OR HAD BREAST CANCER? Have you known or do you currently know anyone that has or has had breast cancer? Yes No
  • 52. DID YOU LEARN ANYTHING FROM PLAYING? 60 50 40 30 20 10 0 Yes, I felt like I learned a lot Yes, I learned a little bit No, I did not learn anything
  • 53. MY KNOWLEDGE OF BREAST CANCER IS: 0.6 0.5 0.4 0.3 0.2 0.1 0 I am an expert in breast I have helped conduct I know some biology and I know a little biology, but Nothing, I do not know a cancer cancer research ias part have some understanding nothing specific to cancer thing about it of my job of what cancer is
  • 54. AGE? Which category below includes your age? 17 or younger 18-20 21-29 30-39 40-49 50-59 60 and above
  • 55. GENDER? What is your gender? Female Male
  • 57. the decision tree created using the feature “makes milk” is 100% correct on training data, you win!
  • 58. TRAINING INTERFACE Choose the feature that best distinguishes mammals from other creatures
  • 59. TRAINING INTERFACE the decision tree created using the feature “has hair” is 94% correct on training data, you win!
  • 60. OVERLAP OF SIGNIFICANT GENE SETS FROM DIFFERENT CURE GAME FILTERS PhD or MD (3,070 games) Cancer Knowledge (4,660 games) Biologist (4,913 games) PhD & Cancer Knowledge (2,373 games) No Expertise (3,607 games)
  • 61. MOST RANDOM GENE EXPRESSION SIGNATURES ARE SIGNIFICANTLY ASSOCIATED WITH BREAST CANCER OUTCOME Still need to pick gene sets Feature selection challenge still relevant Very useful grain of salt in interpreting these results.. Venet et al.(2011). PLoS Comp. Bio.

Notas do Editor

  1. What if we could harness just a tiny fraction of that human effort???
  2. All of this work still requires human effort
  3. a, Two-dimensional presentation of transcript ratios for 98 breast tumours. There were 4,968 significant genes across the group. Each row represents a tumour and each column a single gene. As shown in the colour bar, red indicates upregulation, green downregulation, black no change, and grey no data available. The yellow line marks the subdivision into two dominant tumour clusters. b, Selected clinical data for the 98 patients in a: BRCA1 germline mutation carrier (or sporadic patient), ER expression, tumour grade 3 (versus grade 1 and 2), lymphocytic infiltrate, angioinvasion, and metastasis status. White indicates positive, black negative and grey denotes tumours derived from BRCA1 germline carriers who were excluded from the metastasis evaluation. The cluster below the yellow line consists of 36 tumours, of which 34 are ER negative (total 39 ER-negative) and 16 are carriers of the BRCA1 mutation (total 18). c, Enlarged portion from a containing a group of genes that co-regulate with the ER- gene (ESR1). Each gene is labelled by its gene name or accession number from GenBank. Contig ESTs ending with RC are reverse-complementary of the named contig EST. d, Enlarged portion from a containing a group of co-regulated genes that are the molecular reflection of extensive lymphocytic infiltrate, and comprise a set of genes expressed in T and B cells. (Gene annotation as in c.)
  4. a, Two-dimensional presentation of transcript ratios for 98 breast tumours. There were 4,968 significant genes across the group. Each row represents a tumour and each column a single gene. As shown in the colour bar, red indicates upregulation, green downregulation, black no change, and grey no data available. The yellow line marks the subdivision into two dominant tumour clusters. b, Selected clinical data for the 98 patients in a: BRCA1 germline mutation carrier (or sporadic patient), ER expression, tumour grade 3 (versus grade 1 and 2), lymphocytic infiltrate, angioinvasion, and metastasis status. White indicates positive, black negative and grey denotes tumours derived from BRCA1 germline carriers who were excluded from the metastasis evaluation. The cluster below the yellow line consists of 36 tumours, of which 34 are ER negative (total 39 ER-negative) and 16 are carriers of the BRCA1 mutation (total 18). c, Enlarged portion from a containing a group of genes that co-regulate with the ER- gene (ESR1). Each gene is labelled by its gene name or accession number from GenBank. Contig ESTs ending with RC are reverse-complementary of the named contig EST. d, Enlarged portion from a containing a group of co-regulated genes that are the molecular reflection of extensive lymphocytic infiltrate, and comprise a set of genes expressed in T and B cells. (Gene annotation as in c.)
  5. Main reason statistically is inadequate sample size and correlated data structure. (Xu 2010).Makes it difficult to trust the predictors when different genes appear every time.
  6. though progress is being made on this issue, e.g. Margolin showed very good agreement between cross-validation, test set, and validation performance for models submitted to Sage challenge.
  7. a, Two-dimensional presentation of transcript ratios for 98 breast tumours. There were 4,968 significant genes across the group. Each row represents a tumour and each column a single gene. As shown in the colour bar, red indicates upregulation, green downregulation, black no change, and grey no data available. The yellow line marks the subdivision into two dominant tumour clusters. b, Selected clinical data for the 98 patients in a: BRCA1 germline mutation carrier (or sporadic patient), ER expression, tumour grade 3 (versus grade 1 and 2), lymphocytic infiltrate, angioinvasion, and metastasis status. White indicates positive, black negative and grey denotes tumours derived from BRCA1 germline carriers who were excluded from the metastasis evaluation. The cluster below the yellow line consists of 36 tumours, of which 34 are ER negative (total 39 ER-negative) and 16 are carriers of the BRCA1 mutation (total 18). c, Enlarged portion from a containing a group of genes that co-regulate with the ER- gene (ESR1). Each gene is labelled by its gene name or accession number from GenBank. Contig ESTs ending with RC are reverse-complementary of the named contig EST. d, Enlarged portion from a containing a group of co-regulated genes that are the molecular reflection of extensive lymphocytic infiltrate, and comprise a set of genes expressed in T and B cells. (Gene annotation as in c.)
  8. Walk you through it as a player and then I’ll explain what is going on.
  9. Playing 10,000 75-game series, we would only expect 27 or more occurrences of a particular gene in 1 of the 10,000 series.BCL2: B-cell lymphoma 2, regulator of apoptosisAARD: alanine and arginine rich domain containing protein, no information
  10. Playing 10,000 75-game series, we would only expect 27 or more occurrences of a particular gene in 1 of the 10,000 series.BCL2: B-cell lymphoma 2, regulator of apoptosisAARD: alanine and arginine rich domain containing protein, no informationSingle tailed : the chances of getting S by chance given O
  11. Disease terms came from PharmGKB associations to genes made using NCBI gene and Pubmed.
  12. Disease terms came from PharmGKB associations to genes made using NCBI gene and Pubmed.
  13. Main reason statistically is inadequate sample size and correlated data structure. (Xu 2010).
  14. though progress is being made on this issue, e.g. Margolin showed very good agreement between cross-validation, test set, and validation performance for models submitted to Sage challenge.