This is an overview of the Data Biosphere Project, its goals, its architecture, and the three core projects that form its foundation. We also discuss data commons.
Advanced Machine Learning for Business Professionals
A Data Biosphere for Biomedical Research
1. A Data Biosphere for
Biomedical Research
Robert L. Grossman
University of Chicago &
Open Commons Consortium
AIRI IT Summit
Grand Rapids, Michigan
May 1, 2018
3. The challenge of big data in biomedical and behavioral
research…
The commoditization of sensors is
creating an explosive growth of data.
It can take weeks to download large datasets, it is difficult to
set up compliant computing infrastructure, and it can take
months to integrate & format the data for analysis.
There is not enough
funding for every
researcher to house all the
data they need
4. More challenges…
Data produced by different groups using different
methods is hard to integrate and compare.
There are few good software
platforms for researchers to use to
share their large datasets.
Most researchers don’t have the
bioinformatics support to process all
the data that could help their
research.
5. IT infrastructure challenges
• Data size
• Security & compliance
Limited funding
Growing importance of open
data, open reproducible
science & data ecosystems
6. IT infrastructure challenges
Limited funding
Growing importance of open
data, open reproducible
science & data ecosystems
data commonsData commons co-locate data
with cloud computing
infrastructure and commonly used
software services, tools & apps
for managing, analyzing and
sharing data to create an
interoperable resource for the
research community.*
*Robert L. Grossman, Allison Heath, Mark Murphy, Maria Patterson and Walt Wells, A Case for Data Commons Towards Data Science as a Service, IEEE
Computing in Science and Engineer, 2016. Source of image: The CDIS, GDC, & OCC data commons infrastructure at a University of Chicago data center.
7. Research ethics
committees (RECs) review
the ethical acceptability of
research involving human
participants. Historically,
the principal emphases of
RECs have been to protect
participants from physical
harms and to provide
assurance as to
participants’ interests and
welfare.*
[The Framework] is
guided by, Article 27 of
the 1948 Universal
Declaration of Human
Rights. Article 27
guarantees the rights
of every individual in
the world "to share in
scientific advancement
and its benefits"
(including to freely
engage in responsible
scientific inquiry)…*
Protect human
subject data
The right of human
subjects to benefit
from research.
*GA4GH Framework for Responsible Sharing of Genomic and Health-Related Data, see goo.gl/CTavQR
Data sharing with protections provides the evidence
so patients can benefit from advances in research.
Data commons balance protecting human subject data with open
research that benefits patients:
9. NCI Genomic Data Commons* • The GDC makes
available over 2.5 PB of
data available for access
via an API, analysis by
cloud resources on
public clouds, and
downloading.
• In a typical month, the
GDC is used by over
20,000 unique users and
over 2 PB of data are
accessed/downloaded.
• The GDC is based upon
an open source
software stack that can
be used to build other
data commons.
*See: NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for cancer
genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112.
The GDC consists of a 1) data exploration & visualization portal (DAVE), 2) data
submission portal, 3) data analysis and harmonization system system, 4) an API
so third party can build applications.
11. Systems 1 & 2: Data Portals to Explore and Submit Data
12. • MuSE
(MD Anderson)
• VarScan2 (Washington
Univ.)
• SomaticSniper
(Washington Univ.)
• MuTect2
(Broad Institute)
Source: Zhenyu Zhang, et. al. and the GDC Project Team, Uniform Genomic Data Analysis in the NCI
Genomic Data Commons, to appear.
System 3: Data Harmonization System To Analyze all of the
Submitted Data with a Common Pipelines
13. System 4: An API to Support User Defined Applications and
Notebooks to Create a Data Ecosystem
https://api.gdc.cancer.gov//files/5003adf1-1cfd-467d-8234-0d396422a4ee?fields=state
• The GDC has a REST API so that researchers can develop their own
applications.
• There are third party applications that use the REST API for Python, R,
Jupyter notebooks and Shiny.
• The REST API drives the GDC data portal, data submission system, etc.
14. Benefits of Data Commons and Data Sharing (1 of 2)
1. The data is available to other researchers for discovery,
which moves the research field faster.
2. Data commons support repeatable, reproducible and open
research.
3. Some diseases are dependent upon having a critical mass
of data to provide the required statistical power for the
scientific evidence (e.g. to study combinations of rare
mutations in cancer)
4. With more data, smaller effects can be studied (e.g. to
understand the effect of environmental factors on disease).
Source: Robert L. Grossman, Supporting Open Data and Open Science With Data Commons: Some Suggested Guidelines for Funding Organizations,
2017, https://www.healthra.org/download-resource/?resource-url=/wp-content/uploads/2017/08/Data-Commons-
Guidelines_Grossman_8_2017.pdf
15. Benefits of Data Commons and Data Sharing (2 of 2)
5. Data commons enable researchers to work with large
datasets at much lower cost to the funder than if each
researcher set up their own local environment.
6. Data commons generally provide higher security and greater
compliance than most local computing environments.
7. Data commons support large scale computation so that the
latest bioinformatics pipelines can be run.
8. Data commons can interoperate with each other so that over
time data sharing can benefit from a “network effect”
17. Authors:
- Josh Denny
- David Glazer
- Robert L Grossman
- Ben Paten
- Anthony Philippakis
Source: Josh Denny, David Glazer, Robert L. Grossman, Benedict Paten, Anthony Philippakis, A Data Biosphere for Biomedical
Research, Medium, Oct 16, 2017, medium.com/@benedictpaten/a-data-biosphere-for-biomedical-research-d212bbfae95d
18. Concepts:
- Datasets
- Software Components
- Data Environments
Authors:
- Josh Denny
- David Glazer
- Robert L Grossman
- Ben Paten
- Anthony Philippakis
19. Principles:
- Modular
- Open
- Community-driven
- Standards-based
Driver Projects:
- All of Us
- Human Cell Atlas
- NCI Cloud Resources
Authors:
- Josh Denny
- David Glazer
- Robert L Grossman
- Ben Paten
- Anthony Philippakis
23. Ingest
Explore
HCA
Analysis
Engine
Firecloud, AoU, NIH DC
Ingest
Explore
GDC
CRDC
Methods
Repo
Work-
Spaces
Store
Ingest
Store
Explore
AoU
Store
DOS
Differing implementations
CDR Index-d datastore
WES
Toil
Cromwell
TES
Agora
Dockstore
GA4GH Standardized APIs
IDs
Meta
Data Biosphere best practices
AuthN
AuthZ
Adapted from slide by Anthony Philippakis
24. The Origins of the Data Biosphere
• Anthony, Ben & Bob met at GA4GH meeting in Hinxton in May 2017
• Realized that this was a chance to drive interoperability.
• Goals of our collaboration are:
o Architect a federated data commons based on best practices, GA4GH
standards, and emerging standards, and see it reduced to practice.
o Nucleate an ecosystem of activity that goes beyond just our own
groups (“We are building a neighborhood, not a house.”)
o Bring interoperability among flagship NIH projects
Adapted from slide by Anthony Philippakis
25. 4. Getting Involved with the Data Biosphere Project*
*This section represents my personal views, and not necessarily the views of the Data Biosphere Project.
26. Activity 1: Contribute Applications and Tools to
Current & Emerging Data Biosphere Ecosystem(s)
Diagram: Robert L. Grossman, Progress Towards Cancer Data Ecosystems, The Cancer Journal: The Journal of Principles &
Practice of Oncology, 2018, to appear.
27. Activity 2: Participate in the GitHub Open Source
Software Community Building Data Biosphere Platforms
and Applications
29. Activity 4: Participate in the Open Commons Consortium
& Build Your Own Data Commons
www.occ-data.org
30. 5. Recommendations for Research Institutes*
*This section represents my personal views, and not necessarily the views of the Data Biosphere Project.
31. Rec. 1: Put a senior leader in charge of data and data strategy
for your institute (a chief data officer, chief analytics officer,
etc.) and develop and implement a data strategy.
Strategic planning is the continuous process of making
present entrepreneurial (risk-taking) decisions
systematically and with the greatest knowledge of their
futurity; organizing systematically the efforts needed to
carry out these decisions; and measuring the results of
these decisions against the expectations through
organized, systematic feedback.
Peter Drucker, Management Tasks and Responsibilities, Harper and Row, 1974
32. Rec 2: Establish internal best practices for data.
Examples
• Support data tiers
o Data catalog
o Data lake
o Data commons
• Practice data portability
• etc.
Number
Size &
Complexity
Data Commons
Data Lake
Data Catalog
35. Summary
1. Data commons co-locate data with cloud computing infrastructure and
commonly used software services, tools & apps for managing,
analyzing and sharing data to create an interoperable resource for the
research community.
2. The Data Biosphere Project are developing open, modular,
community-driven and standards-based data environments.
3. The Data Biosphere Project are working to develop open common APIs
across the NCI GDC / Cancer Research Data Commons, the NIH All of
Us Project, and the CZI Human Cell Atlas Project.
4. Contact us if you are interested in getting involved in the Data
Biosphere Project.
38. For more information:
• To learn more about the Data Biosphere: Josh Denny, David Glazer, Robert L. Grossman, Benedict Paten, Anthony
Philippakis, A Data Biosphere for Biomedical Research, https://medium.com/@benedictpaten/a-data-biosphere-
for-biomedical-research-d212bbfae95d
• To learn more about data commons: Robert L. Grossman, et. al. A Case for Data Commons: Toward Data Science
as a Service, Computing in Science & Engineering 18.5 (2016): 10-20. Also https://arxiv.org/abs/1604.02608
• To large more about large scale, secure compliant cloud based computing environments for biomedical data, see:
Heath, Allison P., et al. "Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets." Journal
of the American Medical Informatics Association 21.6 (2014): 969-975. This article describes Bionimbus Gen1.
• To learn more about the NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for
cancer genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112. The GDC was developed
using Bionimbus Gen2.
• To learn about the GDC / Gen3 API: Shane Wilson, Michael Fitzsimons, Martin Ferguson, Allison Heath, Mark
Jensen, Josh Miller, Mark W. Murphy, James Porter, Himanso Sahni, Louis Staudt, Yajing Tang, Zhining Wang,
Christine Yu, Junjun Zhang, Vincent Ferretti and Robert L. Grossman, Developing Cancer Informatics Applications
and Tools Using the NCI Genomic Data Commons API, Cancer Research, volume 77, number 21, 2017, pages e15-
e18.
39. Abstract
A Data Biosphere for Biomedical Research
As datasets grow in scale, the practice of downloading data is becoming
impractical in terms of cost (storing multiple copies of large datasets is
wasteful), accessibility (few researchers have the necessary
computational infrastructure) and security (many research laboratories
lack state-of-the-art security and access control). We propose the idea
of creating a vibrant ecosystem, which we call the “Data Biosphere.”
Building a Data Biosphere to propel progress in biomedicine will require
a community working together, including laboratory groups generating
data, software developers creating Biosphere Components, and
technical teams assembling and operating Data Environments.