1. Open Science: When
Theory Meets Practice
Philip E. Bourne PhD, FACMI
Stephenson Chair of Data Science
Director, Data Science Institute
Professor of Biomedical Engineering
peb6a@virginia.edu
https://www.slideshare.net/pebourne
Albert Dorfman Lecture 16/8/17
2. My Bias in Addressing this
Question
• Research in computational biology and big data
• Open science zealot
• AVC for Innovation UCSD
• Maintained biological data resources for 15 years
(PDB, IEDB)
• Chief Data Officer of the NIH for 3 years (federal
view)
• DSI Director 1 month (state view)
2Albert Dorfman Lecture6/8/17
3. Open Science: One Definition
• Making as much as the basic and clinical research
life cycle as open as possible without compromising
the wishes of all stakeholders with the view of
accelerating and improving the quality of the
research process.
• Having as many people as possible contribute to
research outcomes.
6/8/17 Albert Dorfman Lecture 3
5. Diffuse Intrinsic Pontine Gliomas (DIPG)
• Occur 1:100,000
individuals
• Peak incidence 6-8 years
of age
• Median survival 9-12
months
• Surgery is not an option
• Chemotherapy ineffective
and radiotherapy only
transitive
From Adam Resnick
6. Timeline of genomic studies in DIPG
• Landmark studies identify
histone mutations as
recurrent driver mutations in
DIPG ~2012
• Almost 3 years later, in
largely the same datasets,
but partially expanded, the
same two groups and 2
others identify ACVR1
mutations as a secondary, co-
occurring mutation
From Adam Resnick
7. What do we need to do differently to
reveal ACVR1?
• ACVR1 is a targetable kinase
• Inhibition of ACVR1 inhibited tumor
progression in vitro
• ~300 DIPG patients a year
• ~60 are predicted to have ACVR1
• If large scale data sets were only
integrated with TCGA and/or rare
disease data in 2012, ACVR1 mutations
would have been identified
• 60 patients/year X 3 years = 180
children’s lives (who likely succumbed to
the disease during that time) could have
been impacted if only data were FAIR
From Adam Resnick
8. Use case: Having as many people as
possible contribute to research
outcomes….
The Story of Meridith
6/8/17 Albert Dorfman Lecture 8
9. A broader example of what comes
out of open science…
6/8/17 Albert Dorfman Lecture 9
10. Driving sharing and innovation: Open
Science Prize
NIH, Wellcome Trust, HHMI
https://www.openscienceprize.org
Accepted PLOS Biology
• An international scientific challenge competition to encourage and
support the prototyping and development of services, tools, or
platforms that enable utilization of open content
• 96 submissions received
• Solvers from 45 countries,
spanning 5 continents
• Timeline
• May 2016: Phase 1 winners announced at Health DataPalooza
• Dec 1, 2016: Presentations and public voting
• Feb 2017: Overall winner announced
11. Consider some of the history of open
science from the NIH perspective …
6/8/17 Albert Dorfman Lecture 11
Some slides courtesy of Francis Collins
17. A Culture of Sharing
1999 20042003 2007 20142008
Research
Tools Policy
NIH Data Sharing
Policy
Model
Organism
Policy
Genome-wide
Association
(GWAS) Policy
2012
NIH Public
Access Policy
(Publications)
Big Data to
Knowledge (BD2K)
Initiative
Genomic Data
Sharing (GDS)
Policy
Modernization of
NIH Clinical Trials
White House
Initiative
(2013 “Holdren
Memo”)
18. Guiding Principle of NIH GWAS
Policy
The greatest public benefit will be realized if
data from GWAS are made available, under
terms and conditions consistent with the
informed consent provided by individual
participants, in a timely manner to the largest
possible number of investigators.
NIH expectation that data would be shared in the NIH
database of Genotype and Phenotype (dbGaP)
19. Data Access Requests Per Year
2007–September 2015
32962
21973
0
5000
10000
15000
20000
25000
30000
35000
2007 2008 2009 2010 2011 2012 2013 2014 2015
Total Approved
20. A Culture of Sharing
1999 20042003 2007 20142008
Research
Tools Policy
NIH Data Sharing
Policy
Model
Organism
Policy
Genome-wide
Association
(GWAS) Policy
2012
NIH Public
Access Policy
(Publications)
Big Data to
Knowledge (BD2K)
Initiative
Genomic Data
Sharing (GDS)
Policy
Modernization of
NIH Clinical Trials
White House
Initiative
(2013 “Holdren
Memo”)
21. NIH Public Access Policy for Publications
• Ensures public access to published results of all
research funded by NIH since 2008
• Recipients of NIH funds required to submit final peer-
reviewed journal manuscripts to PubMed Central (PMC)
upon acceptance for publication
• Papers must be accessible to the public on PMC no later
than 12 months after publication
22. A Culture of Sharing
1999 20042003 2007 20142008
Research
Tools Policy
NIH Data Sharing
Policy
Model
Organism
Policy
Genome-wide
Association
(GWAS) Policy
2012
NIH Public
Access Policy
(Publications)
Big Data to
Knowledge (BD2K)
Initiative
Genomic Data
Sharing (GDS)
Policy
Modernization of
NIH Clinical Trials
White House
Initiative
(2013 “Holdren
Memo”)
23. Harnessing Data to Improve Health:
BD2K (Big Data to Knowledge)
NIH’s 6-year initiative to use data science to foster an
open digital ecosystem that will accelerate efficient,
cost-effective biomedical research to enhance health,
lengthen life, and reduce illness and disability
Programs and activities:
• Advance discovery for biomedical research
• Facilitate use and re-use of biomedical data
• Develop analytical methods and software
• Enhance biomedical data science training
24. Will we do this research in a different
way?
Will it become more like Airbnb?
6/6/17 UNC 24
Bonazzi & Bourne 2017 PLOS Biology 15(4) e2001818
25. I am not crazy, hear me out
• Airbnb is a platform that supports a trusted relationship
between consumer (renter) and supplier (host)
• The platform focuses on maximizing the exchange of
services between supplier and consumer and maximizing
the amount of trust associated with a given stakeholder
• It seems to be working:
• 60 million users searching 2 million listings in 192 countries
• Average of 500,000 stays per night.
• Evaluation of US $25bn
Bonazzi & Bourne 2017 PLOS Biology 15(4) e2001818
27. Paper Author Paper Reader
Data Provider Data Consumer
Employer Employee
Reagent
Provider
Reagent
Consumer
Software
Provider
Software
Consumer
Grant Writer Grant Reviewer
Supplier Consumer Platform
MS Project
Google Drive
Coursera
Researchgate
Academia.edu
Open Science
Framework
Synapse
F1000
Rio
Educator Student
Platforms – The situation today
28. Commons Compliance
• Treat products of research – data,
methods, papers etc. as digital objects
• These digital objects exist in a shared
virtual space
• Digital object compliance through FAIR
principles:
• Findable
• Accessible (and usable)
• Interoperable
• Reusable
https://commonfund.nih.gov/bd2k/commons
29. 1. A link brings up figures
from the paper
0. Full text of PLoS papers stored
in a database
2. Clicking the paper figure retrieves
data from the PDB which is
analyzed
3. A composite view of
journal and database
content results
We Need Data and
Knowledge About That
Data to Interoperate
1. User clicks on content
2. Metadata and
webservices to data
provide an interactive
view that can be
annotated
3. Selecting features
provides a
data/knowledge
mashup
4. Analysis leads to new
content I can share
4. The composite view has
links to pertinent blocks
of literature text and back to the PDB
1.
2.
3.
4.
The Knowledge and Data Cycle
PLoS Comp. Biol. 2005 1(3) e34
30. Incentives
• Airbnb
• Monetize unutilized
space
• Ease of use
• New vacation
experience
• Commons
• Need to improve rigor
and reproducibility
• Productivity
• Sustainability
• Education and training
• Opportunity to
undertake elastic
compute on large
complex data
https://commonfund.nih.gov/bd2k/commons
31. A Culture of Sharing
1999 20042003 2007 20142008
Research
Tools Policy
NIH Data Sharing
Policy
Model
Organism
Policy
Genome-wide
Association
(GWAS) Policy
2012
NIH Public
Access Policy
(Publications)
Big Data to
Knowledge (BD2K)
Initiative
Genomic Data
Sharing (GDS)
Policy
Modernization of
NIH Clinical Trials
White House
Initiative
(2013 “Holdren
Memo”)
32. NIH Genomic Data Sharing (GDS)
Policy
• Purpose
• Sets forth expectations, responsibilities that ensure broad,
responsible sharing of genomic research data in a timely
manner
• Scope
• All NIH-funded research generating large-scale human or non-
human genomic data – and their use for subsequent research
• Data to be submitted to NIH-designated data repositories (e.g.,
dbGaP, GEO, GenBank, WormBase, FlyBase, Rat Genome
Database)
• Applies to all funding mechanisms (grants, contracts,
intramural support) with no minimum threshold for cost
• Released August 2014; effective January 25, 2015
gds.nih.gov
33. Data Sharing Goes Global: GA4GH
Global Alliance for Genomics and Health
• Accelerating the potential of genomic medicine to
advance human health, by:
• Establishing common framework of approaches to enable
effective, responsible sharing of genomic and clinical data
• Catalyzing data sharing projects that drive and
demonstrate value of data sharing
• Alliance*: >350 leading institutions (healthcare,
research, advocacy, life science, IT) representing 35
countries
• Working groups (Clinical, Data, Security, Regulatory &
Ethics) assess, prioritize needs
• Form task teams to produce tools, solutions,
demonstration projects
*Statistics as of October 5, 2015
34. A Culture of Sharing
1999 20042003 2007 20142008
Research
Tools Policy
NIH Data Sharing
Policy
Model
Organism
Policy
Genome-wide
Association
(GWAS) Policy
2012
NIH Public
Access Policy
(Publications)
Big Data to
Knowledge (BD2K)
Initiative
Genomic Data
Sharing (GDS)
Policy
Modernization of
NIH Clinical Trials
White House
Initiative
(2013 “Holdren
Memo”)
35. Modernizing NIH Clinical Trials
Activities: The Need
• NIH-Funded trials published within 100 months of
completion
Less than 50% published within 30 months of completion
BMJ 2012;344:d7292
39. Acknowledgements
6/6/17 UNC 39
The BD2K Team at NIH
My New Colleagues at UVA
The 150 folks who have passed through my laboratory
https://docs.google.com/spreadsheets/d/1QZ48UaKcwDl_iFCvBmJsT03FK-bMchdfuIHe9Oxc-rw/edit#gid=0
Notas do Editor
One example: BrainBox, an Open Neuroimaging Laboratory that will enable collaboration around annotation, discovery and analysis of publicly available brain imaging data.
OpenAQ: A Global Community Building the First Open, Real-Time Air Quality Data Hub for the World
Real-Time Evolutionary Tracking for Pathogen Surveillance and Epidemiological Investigation
Open Neuroimaging Laboratory
OpenTrialsFDA
Fruit Fly Brain Observatory
MyGene2: Accelerating Gene Discovery with Radically Open Data Sharing
“As biology’s first large-scale project, the HGP paved the way for numerous consortium-based research ventures. The NHGRI alone has been involved in launching more than 25 such projects since 2000. These have presented new challenges to biomedical research — demanding, for instance, that diverse groups from different countries and disciplines come together to share and analyse vast data sets.”
“The HGP changed the norms around data sharing in biomedical research.”
Added (10/1/15): TCGA, dbGaP, GTR
Fig. a. Polymorphic variants within sampled populations. The area of each pie is proportional to the number of polymorphisms within a population. Pies are divided into four slices, representing variants private to a population (darker colour unique to population), private to a continental area (lighter colour shared across continental group), shared across continental areas (light grey), and shared across all continents (dark grey). Dashed lines indicate populations sampled outside of their ancestral continental region.
2013 White House Initiative: “Increasing Access to the Results of Federally Funded Scientific Research”
Updated to include numbers through September 2015.
From Dina Paltoo [10/6/15]: “The data in the first slide is for all of dbGaP 2007-2014. The information came from a version of what is on the GDS website (https://gds.nih.gov/19dataaccesscommitteereview_dbGaP.html) and in a Nature Genetics paper (http://www.nature.com/ng/journal/v46/n9/full/ng.3062.html), but results from information that we receive from NCBI.”
2013 White House Initiative: “Increasing Access to the Results of Federally Funded Scientific Research”
The NIH Public Access Policy implements Division F Section 217 of PL 111-8 (Omnibus Appropriations Act, 2009).
http://publicaccess.nih.gov/policy.htm
OSP’s summary:
The NIH Public Access Policy for publications has been in a requirement for all recipients of NIH funds since 2008. It implements Division G, Title II, Section 218 of PL 110-161 (Consolidated Appropriations Act, 2008). The NIH Public Access Policy ensures that the public has access to the published results of NIH-funded research. It requires scientists to submit final peer-reviewed journal manuscripts that arise from NIH funds to the digital archive PubMed Central (PMC) upon acceptance for publication. Scientists can also deposit papers through partnerships NIH has established with publishers. To help advance science and improve human health, the Policy requires that NIH supported papers are accessible to the public on PMC no later than 12 months after publication.
2013 White House Initiative: “Increasing Access to the Results of Federally Funded Scientific Research”
Updated by ADDS group 8/25/15
29
2013 White House Initiative: “Increasing Access to the Results of Federally Funded Scientific Research”
2013 White House Initiative: “Increasing Access to the Results of Federally Funded Scientific Research”
Figure 2. Cumulative percentage of studies published in a peer reviewed biomedical journal indexed by Medline during 100 months after trial completion among all NIH funded clinical trials registered within ClinicalTrials.gov
Public benefits to clinical trials data-sharing (OSP):
Inform future research and research funding decisions
Mitigate bias (e.g., non publication of results, especially negative results)
Prevent duplication of unsafe trials
Meet ethical obligation to human subjects (i.e., that results inform science)
Increase access to data about marketed products
All contribute to public trust in clinical research
Source: Ross JS, Tse T, Zarin DA, Xu H, Zhou L, Krumholz HM. Publication of NIH funded trials registered in ClinicalTrials.gov: cross-sectional analysis. BMJ 2012;344:d7292.