O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

A Gen3 Perspective of Disparate Data

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Carregando em…3
×

Confira estes a seguir

1 de 32 Anúncio

A Gen3 Perspective of Disparate Data

Baixar para ler offline

This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data: From Pipelines in Data Commons to AI in Data Ecosystems.

This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data: From Pipelines in Data Commons to AI in Data Ecosystems.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a A Gen3 Perspective of Disparate Data (20)

Anúncio

Mais de Robert Grossman (20)

Mais recentes (20)

Anúncio

A Gen3 Perspective of Disparate Data

  1. 1. A Gen3 Perspective of Disparate Data: From Pipelines in Data Commons to AI in Data Ecosystems Robert L. Grossman Center for Translational Data Science University of Chicago March 12, 2019 San Francisco Molecular Tri Conference
  2. 2. 1. Disparate Data in a World of Data Commons and Data Ecosystems
  3. 3. Data Commons (2015-2030) Data Clouds (2010-2025) Data Ecosystems (2018 – 2030) Projects Communities Multiple Communities • Data objects in clouds • Execute bioinformatics pipelines using workflow languages and docker repositories • Expose API for data access of object and structured data • Expose data models • Harmonize data within commons • Build an ecosystem of apps across commons & resources • Harmonize data across commons • Support ML/AI across commons & resources Today, I’ll talk about the transition from data commons to data ecosystems
  4. 4. genomic data, imaging data, etc.* AWS GCP genomics clouds genomic analysis research discoveries pipelines/dockstore clinical data data objects in cloud storage structured data in databases clinical research data warehouse private data curation & management data exploration data analysis *also imaging data, proteomics data, etc. Today, I’ll talk about supporting both data objects & structured data in data commons & ecosystems.
  5. 5. 2. Building Gen3 Data Commons over the Data Commons Framework Services
  6. 6. Narrow Middle Design (aka End-to-End Design Principle) Bioinformaticians curating and submitting data Researchers analyzing data and making discoveries data clouds container-based workspaces ML/AI apps notebooks data commons Compare: Saltzer, J.H., Reed, D.P. and Clark, D.D., 1984. End-to-end arguments in system design. Technology, 100, p.0661.
  7. 7. genomic data, imaging data, etc.* AWS GCP genomics clouds genomic analysis research discoveries pipelines/dockstore clinical data data objects in cloud storage structured data in databases clinical research data warehouse private data curation & management data exploration data analysis *also imaging data, proteomics data, etc. DCFS Standards data commons / data ecosystem
  8. 8. We have updated Gen3.org
  9. 9. Video 1
  10. 10. 1. Define a data model. 2. Use the Gen3 software to auto- generate the data commons and associated API. 3. Import data into the commons using Gen3 import application. 4. Use Gen3 to explore your data and create synthetic cohorts. 5. Use platforms such as Tera, Seven Bridges, Galaxy, etc. to analyze the synthetic cohosts. 6. Develop your own container- based workflows, applications and Jupyter Notebooks. A Gen3 Data Commons Platform in Six Steps
  11. 11. Will be released in 2Q19 (Selected)
  12. 12. Data Model 1 Data Model 2 Data Model 3 Data Model 4 Data Model 5 Data Model 6 Data Model 7 Data Model 8 Data Model 9 1. What are the minimum data access services for object and structured data? 2. What are the minimum data model services? 3. What are the minimum services for identity and access management to support a passport type system? Will be released in 2Q19
  13. 13. 3. Setting Up and Operating a Data Commons or Data Ecosystem
  14. 14. 1. The Data Commons Framework Services (DCFS) is a set of software services for setting up and operating a data commons and cloud-based resources. 2. The DCF is designed to support multiple data commons, knowledge bases, and applications as part of data ecosystem. 3. It is used to help operate the NCI Cancer Research Data Commons (CRDC), NHLBI DataSTAGE, NHRGI AnVIL, and NIAID Data Hub pilot. 4. The implementation is based on the open source Gen3 software platform. Bioinformaticians curating and submitting data Researchers analyzing data and making discoveries data clouds container-based workspaces ML/AI apps notebooks data commons
  15. 15. Data Model 1 Data Model 2 Data Model 3 Data Model 4 Data Model 5 Data Model 6 Data Model 7 Data Model 8 Data Model 9 1. Data commons and resources expose API for access to data and resources 2. Data commons expose their data models through API 3. Data models include references to third party ontologies and other authorities 4. Authentication and authorization systems can interoperate 5. 2Q19: Structured data can be serialized, versioned, exported, processed and imported Will be released in 2Q19
  16. 16. Video 2
  17. 17. 1. Define a data model. 2. Use the Gen3 software to auto- generate the data commons and associated API. 3. Import data into the commons using Gen3 import application. 4. Use Gen3 to explore your data and create synthetic cohorts. 5. Use platforms such as Tera, Seven Bridges, Galaxy, etc. to analyze the synthetic cohosts. 6. Develop your own container- based workflows, applications and Jupyter Notebooks. A Gen3 Data Commons Platform in Six Steps 1. Build data commons over hosted Data Commons Framework Services 2. Interoperate your data commons with other DCFS compliant data commons.
  18. 18. Data Commons Framework Services (DCFS) Roadmap 2019 • DCFS services hosted by the University of Chicago using a Common Services Operations Center (CSOC) • You can build your own data commons over the hosted DCFS • Six production data commons will be working with GA4GH to standardize DCFS 2020 • Third parties can build data commons by standing up an entire stack including their own DCFS • You can build your own data commons using DCFS hosted by the UChicago CSOC • We expect a third party to host DCFS and support data commons over it • CSOCs can interoperate • First draft of GA4GH standard Gen3.org dcf.gen3.org
  19. 19. 4. Managing Structured Data in Data Commons and Data Ecosystems
  20. 20. Linking Structured Clinical Data with Genomic Data Object data - CRAM/BAM genomic data files, DICOM image files, anything stored on the cloud object storage systems (AWS S3, GCP GCS) Clinical data / graph data / core data / structured data - data that are harmonized to a data model and searchable using the data model and related APIs. Gen3 uses a graph data model as the logical model and PostgreSQL as the database. Data objects stored with GUIDS in one or more clouds Clinical data and other structured data stored in a database Data objects and clinical data linked in data model
  21. 21. …, but what do we do for structured data? • Within a data commons, we can use ETL tools, databases, NoSQL databases, data warehouses, etc. • But what about if we have 25 data commons that want to interop? Data Model 1 Data Model 2 Data Model 3 Data Model 5 Data Model 6 Data Model 7 Data Model 8 Data Model 9 Will be released in 2Q19Data Model 4
  22. 22. Requirement Approach Gen3 Services 1. Make the data FAIR Data objects are assign GUID & metadata and placed in multiple clouds IndexD, Fence, Metadata services via Sheepdog and Peregrine (also part of DCF services) 2. Express the pipelines in a workflow language and making them FAIR We support Common Workflow Language We support Dockstore, CWL & cwltool, use object services to manage CWL files, soon cromwell 3. Encapsulate the code and tools We encapsulate code in virtual machines & containers We use Kubernetes, Docker, Dockstore and WES 4. Link data and code Use notebooks We support Jupyter notebooks and JupyterHub 5. Make struc. data portable ??? ???
  23. 23. 5. Portable Formats for Biomedical Data
  24. 24. Life Cycle of Clinical Data (Structured Data) Initial data modelHarmonized data model (wrt ontology, NCIt, etc.) Initial upload, small changes to schema New data requiring updated data model Data used by another project, requiring new data model Subset of data is extracted from main system as a synthetic cohort and imported into analysis system 2nd, 3rd, etc. data releases, continuous creation of synthetic cohorts 4th, 5th data releases, new data model Platform refreshed & data, metadata migrated Blue – schema change Green – data change Red – platform change
  25. 25. What is the Portable Format for Biomedical Data (PFB)? ● PFB is an Avro-based serialization format with a specific schema to import, export and evolve biomedical data. ● PFB specifies metadata and data in one file. Metadata includes data dictionary, ontology references & relations between nodes. ● PFB is: ○ Portable: supporting import & export. ○ Extensible: data model changes, versioning, back- and forward compatibility; ○ Efficient: the binary format.
  26. 26. Why Avro? Avro Protobuf Self-describing ✓ ✗ Schema evolution ✓ ✓ Dynamic schema ✓ Partially, needs recompilation No need to compile ✓ ✗ Hadoop support ✓, built-in ✓, third-party libraries JSON schema ✓ ✗, special IDL for schema
  27. 27. PFB Performance (preliminary results) KidsFirst dictionary (JSON): 0.21M PostgreSQL database: 277M JSON load time: 10 minutes PostgreSQL → PFB takes 25 seconds Schema only PFB: 0.08M Schema + data PFB: 38M With compression: 9.7M PFB → PostgreSQL load time: 1 min 29 times smaller in size. Import of structured data Export of structured data
  28. 28. PFB simplifies the management of structured data in data ecosystems • PFB is much smaller and much faster for bulk import and export • PFB files contain data models and pointers to third party ontologies and authorities • PFB files can be versioned, managed as data objects in clouds, and accessed FAIR services # of nodes in data model Sheepdog (sec) PFB Import (sec) PFB Export (sec) 10 14.75 3.25 3.25 100 121.25 3.5 5.5 1000 1209.75 13 11 10000 13349.25 92 69.75
  29. 29. • PFB is an application independent and system independent serialization format for importing and exporting: 1) schema and other metadata, 2) pointers to third party ontologies and authorities, and 3) data. • PFB services can export to JSON Portable Format for Biomedical Data (PFB) Application or commons 1 Application or commons 2* Applications or services can process the PFB file PFB file *Can be the same app or commons 1 Can be managed as a data object with FAIR services
  30. 30. Requirement Approach Gen3 Services 1. Make the data FAIR Data objects are assign GUID & metadata and placed in multiple clouds IndexD, Fence, Metadata services via Sheepdog and Peregrine (also part of DCF services) 2. Express the pipelines in a workflow language and making them FAIR We support Common Workflow Language We support Dockstore, CWL & cwltool, use object services to manage CWL files, soon cromwell 3. Encapsulate the code and tools We encapsulate code in virtual machines & containers We use Kubernetes, Docker, Dockstore and WES 4. Link data and code Use notebooks We support Jupyter notebooks and JupyterHub 5. Make struc. data portable Make the data self-describ. Import & export PFB
  31. 31. For more information: • Review: Robert L. Grossman, Data Lakes, Clouds and Commons: A Review of Platforms for Analyzing and Sharing Genomic Data, Trends in Genetics 35 (2019) pp. 223-234, https://doi.org/10.1016/j.tig.2018.12.006. See also https://arxiv.org/abs/1809.01699 • To learn about data ecosystems: Robert L. Grossman, Progress Towards Cancer Data Ecosystems, The Cancer Journal: The Journal of Principles and Practice of Oncology, May/June 2018, Volume 24 Number 3, pages 122-126 doi: 10.1097/PPO.0000000000000318. • To learn more about data commons: Robert L. Grossman, et. al. A Case for Data Commons: Toward Data Science as a Service, Computing in Science & Engineering 18.5 (2016): 10-20. Also https://arxiv.org/abs/1604.02608 • To learn more about the NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for cancer genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112. The GDC was developed using Bionimbus Gen2. • To learn more about BloodPAC, Grossman, R. L., et al. "Collaborating to compete: Blood Profiling Atlas in Cancer (BloodPAC) Consortium." Clinical Pharmacology & Therapeutics (2017). BloodPAC was developed using the GDC Community Edition (CE) aka Bionimbus Gen3 • To large more about large scale, secure compliant cloud based computing environments for biomedical data, see: Heath, Allison P., et al. "Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets." Journal of the American Medical Informatics Association 21.6 (2014): 969- 975. This article describes Bionimbus Gen1.
  32. 32. 32 @BobGrossman

×