SlideShare uma empresa Scribd logo
1 de 31
Managing the genomics data deluge
at the DOE Joint Genome Institute
Kjiersten Fagnan
CIO, JGI
The DOE Joint Genome Institute at a glance
JGI MISSION:
To provide the global research community
with free access to the most advanced
integrative genome science capabilities in
support of the DOE energy &
environmental research mission
Integrative Genomics Building
(IGB)
U.S. Department of Energy Office of Science User Facility
● JGI established in 1997, User facility from 2004
● Located at Lawrence Berkeley National Laboratory
● ~285 staff; ~$80M annual funding
● 2,038 Global Primary Users in FY20; >10,000 Data
Users
JGI History
3
Environmental genomics will enable the Bioeconomy
Genetic “Circuit”
Gene Enzyme Microbial Factory
DNA
2 NH4
2+
CO3
2-
FY 2020 Users: 2,038 Worldwide
6
Users on the Map: 2,038
Academic 1,504 74%
Government 183 9%
DOE (national labs only) 161 8%
Industry 29 1%
Other 161 8%
Projects Completed/Scientific Publications
7
Cumulative Number of
Projects Completed
Cumulative Number of
Scientific Publications
Sequence Output
8
Massively Parallel Short Read Sequencing
Basepairs (GB)
Single Molecule Long Read Sequencing
Basepairs (GB)
DOE Office of Science Public Reusable
Research Data (PuRe Data)
https://science.osti.gov/Initiatives/PuRe-
Data/Resources-at-a-Glance
Deluge of Large, Complex Data Sets
10
JGI manages a 10+ PB data repository
Mega – Giga – Tera – Peta – Exa – Zetta – Yotta
5/19/2021 https://www.theatlantic.com/technology/archive/2011/05/infographic-how-big-is-a-yottabyte/239034/ 11
The cost to store 1 Yottabyte of data - $100 trillion*
This is just genomics data… we also want
metabolomes, transcriptomes, proteomes, image
data
The Immense Scale of Omics Data
5/19/2021 12
Advances in sequencing and omics technologies have far outpaced data infrastructure
How do we remove the barriers to
data access and analysis at scale?
Data Management is Critical
5/19/2021 13
PMO
S
DM
Q
AQ
C
/ RQ
C
G
AAG Plant MEP RnD Fungal
G
enome
Portal
IMG MG
M
External
C
ollaborators
Web S
ervices ( Mycocosm,
Phytozome, IMGM/ ER)
In 2013, JGI deployed a hierarchical data management
system to deal with the exponetial growth in sequence
data and analysis products
JGI Archive and Metadata Organizer (JAMO)
5/19/2021 14
G
AAG Plant MEP RnD Fungal IMG MG
M
S
DM
Q
AQ
C
/ RQ
C
Web S
ervices ( Mycocosm,
Phytozome, IMGM/ ER)
G
enome
Portal
External
C
ollaborators
PMO
JAMO’s Back-end Infrastructure
5/19/2021 15
JAMO Enabled Increased Automation Between Groups
• JGI’s core pipelines connect with JAMO and provide metadata through
templates
• Once data is available for processing, the workflows are triggered
automatically
• Data that fails QC is flagged for review
5/19/2021 16
JAMO is the Backbone of JGI’s Data Portal
5/19/2021 17
All the metadata used to populate the Data Portal
comes from JAMO’s Mongo DB
Code for America Summit Talk on JGI’s New Data Portal
Aligning Data Across Siloed Departments
Many government sectors have been collecting data digitally for decades often
in uncoordinated ways. In this talk we’ll explore how Truss and Joint Genome
Institute partnered to break down data silos and start conversations across
departments to align metadata across the organization. From establishing
baseline agreements, to finding common outcomes everyone could agree upon,
to bringing old data sets into the present, this talk will provide useful tools for
practitioners facing challenges of data misalignment across multiple
departments.
It's Thursday later in the day 2:00-3:00 pm PST
https://summit.codeforamerica.org/agenda/
5/19/2021 18
Improving Search Across JGI
5/19/2021 19
Metadata in one place makes search across all JGI programs possible
JGI-KBase
RESTful
Service
JGI Data and Metadata
system including LIMS,
GOLD, sequence,
assemblies, annotations
Metadata and file types
User Query
Response
Data sets
Most of JGI’s Infrastructure is @NERSC
5/19/2021 20
Berkeley Lab is on a Major Fault Line
5/19/2021 21
NERSC is
here!
Most samples used to generate data at JGI
are unique and irreplaceable
Backing up Irreplaceable Data
• Moved 1 PB of data to ORNL for safe-keeping
• Data migration completed in 5 days using Globus
• Enables access to the data – but only useful with the right metadata
5/19/2021 22
Main JGI
Data
Repository
API
HPSS
Archive
JAMO light
DTN
DTN
SUMMIT
API
What can you do with all that data and a supercomputer?
A Gordon Bell Prize (Supercomputing) winner in 2018 used all the well-
characterized publicly available data to look at genetic underpinnings of
opioid addiction.
Wayne Joubert, et al. 2018. Attacking the opioid epidemic: determining the epistatic and pleiotropic genetic architectures
for chronic pain and opioid addiction. In Proceedings of the International Conference for High Performance Computing,
Networking, Storage, and Analysis (SC ’18). IEEE Press, Article 57, 1–14.
Access to large amounts of ‘omics data
enables scientists to explore a broad range of
hypotheses!
CA has Earthquakes and Fires!
5/19/2021 24
We need to distribute Data and Analysis to
maintain scientific productivity
JGI’s Centralized Workflow System
● JGI Analysis Workflow Service (JAWS)
● Need to be able to compute at multiple centers: NERSC, LBL IT, others
● Need to have more readily reusable and modifiable bioinformatics
pipelines
● Need workflows to support FAIR* guidelines
● Objective: Portable, Reusable, Traceable workflows on a Robust platform
*Findable, Accessible, Interoperable, Reusable
25
Distributed Computing is Hard
• Managing multiple user accounts
• Different facilities have different policies
– Batch schedulers
– File system availability and data retention
• Different architectures
– CPU vs GPU
– Local disk vs parallel file systems
– Memory size and footprint
• Portability is a lot of work
5/19/2021 26
JGI is Running Analyses Across the West Coast
JGI Centralized
Workflow
System
Cromwell
Workflow
Manager
Additional
resources
(cloud, ORNL,
ANL, etc)
Common interface to
access resources
initial
testing
future
Workflow Description Language
JGI is Running Analyses Across the West Coast
JGI Centralized
Workflow
System
Workflow Description Language
1. Find the data for
analysis in the data
management system
2. Authenticate with
Globus and transfer
the data to the remote
computing resource
3. Work is
executed, results
are generated
4. Transfer data back
to the home
repository with
Globus
5. Register the data
and metadata with
JAMO
Application tokens are accepted by the
facilities we are using making it possible to
transfer data on behalf of the user
Data Movement Between Resources – Globus!
• JGI has been using Globus since ~2012 to move data around
–One time we broke the service by trying to move millions of tiny files that
were all in the same directory :D
• Globus enables JGI collaborators to download large amounts of data
–Biggest customers are the Bioenergy Research Centers – DOE funded
facilities investigating biofuels
–Some JGI Users are still willing to wait 9+ days for a
download to complete via the browser – education opportunity!
• Globus is an integral part of JAWS
–Enables the application to move data between computing
resources on behalf of the user
5/19/2021 29
Summary
• JGI is a DOE User Facility that produces a lot of complex, unique data
for the scientific community
• As instruments improve, the data is higher quality – *metadata can still
be problematic
• We’d be lost without a good data management system
• JGI is turning to distributed computing for processing and large-scale
analyses
• Data movement made much easier and faster with Globus
5/19/2021 30
Upcoming Virtual Annual Meeting/Resource Calls
● Aug 30 – Sept 1: 3 x 6-hour days, 2 sessions/day
– Exploring the Universe of Specialized Metabolites
– From Microbial Sequence to Environmental Function
– The Many Facets of Plant-Microbial Interactions
– Machine Learning and Artificial Intelligence for Biology
– Integrative Omics-Inspired Plant and Microbe Engineering
– Technology Innovations
● Community Science Program (CSP) Functional Genomics
proposal deadline: July 31
– Genes/Pathway synthesis
– Strain engineering
– Data mining
– Metabolomics
– RNA-seq
● Call New Investigator Call proposal deadline: Sept 15
– Bacterial and archaeal isolates and single cell draft genomes
– Metagenomes/metatranscriptomes
– DNA synthesis- and Metabolomics-based functional analysis
bit.ly/JGI-User-Programs
bit.ly/JGI-Meeting2021
jgi-comms@lbl.gov

Mais conteúdo relacionado

Mais procurados

Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Dan Taylor
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
 
Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data CenterGilles Fedak
 
Study on potential capabilities of a nodb system
Study on potential capabilities of a nodb systemStudy on potential capabilities of a nodb system
Study on potential capabilities of a nodb systemijitjournal
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011Ian Foster
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?Robert Grossman
 
Networking Materials Data
Networking Materials DataNetworking Materials Data
Networking Materials DataIan Foster
 
BioCASE web services for germplasm data sets, at FAO, Rome (2006)
BioCASE web services for germplasm data sets, at FAO, Rome (2006)BioCASE web services for germplasm data sets, at FAO, Rome (2006)
BioCASE web services for germplasm data sets, at FAO, Rome (2006)Dag Endresen
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
 
TDWG and GBIF, at European genbank network meeting (Bonn, April 2004)
TDWG and GBIF, at European genbank network meeting (Bonn, April 2004)TDWG and GBIF, at European genbank network meeting (Bonn, April 2004)
TDWG and GBIF, at European genbank network meeting (Bonn, April 2004)Dag Endresen
 
User Engagement in Research Data Curation
User Engagement in Research Data CurationUser Engagement in Research Data Curation
User Engagement in Research Data CurationUniversity of Edinburgh
 
Collaboration to Curation: The High Rise Project meets Edinburgh DataShare
Collaboration to Curation: The High Rise Project meets Edinburgh DataShare Collaboration to Curation: The High Rise Project meets Edinburgh DataShare
Collaboration to Curation: The High Rise Project meets Edinburgh DataShare University of Edinburgh
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Robert Grossman
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 

Mais procurados (20)

Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2
 
Glasgow University Geo Metadata Workshop
Glasgow University Geo Metadata WorkshopGlasgow University Geo Metadata Workshop
Glasgow University Geo Metadata Workshop
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)
 
Geospatial Metadata Workshop
Geospatial Metadata WorkshopGeospatial Metadata Workshop
Geospatial Metadata Workshop
 
Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data Center
 
Study on potential capabilities of a nodb system
Study on potential capabilities of a nodb systemStudy on potential capabilities of a nodb system
Study on potential capabilities of a nodb system
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
Networking Materials Data
Networking Materials DataNetworking Materials Data
Networking Materials Data
 
BioCASE web services for germplasm data sets, at FAO, Rome (2006)
BioCASE web services for germplasm data sets, at FAO, Rome (2006)BioCASE web services for germplasm data sets, at FAO, Rome (2006)
BioCASE web services for germplasm data sets, at FAO, Rome (2006)
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
TDWG and GBIF, at European genbank network meeting (Bonn, April 2004)
TDWG and GBIF, at European genbank network meeting (Bonn, April 2004)TDWG and GBIF, at European genbank network meeting (Bonn, April 2004)
TDWG and GBIF, at European genbank network meeting (Bonn, April 2004)
 
User Engagement in Research Data Curation
User Engagement in Research Data CurationUser Engagement in Research Data Curation
User Engagement in Research Data Curation
 
Collaboration to Curation: The High Rise Project meets Edinburgh DataShare
Collaboration to Curation: The High Rise Project meets Edinburgh DataShare Collaboration to Curation: The High Rise Project meets Edinburgh DataShare
Collaboration to Curation: The High Rise Project meets Edinburgh DataShare
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Geoservices Activities at EDINA
Geoservices Activities at EDINAGeoservices Activities at EDINA
Geoservices Activities at EDINA
 
Participatory Web
Participatory WebParticipatory Web
Participatory Web
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 

Semelhante a GlobusWorld 2021: Managing Genomics Data at the DOE Joint Genomics Institute

Data accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphereData accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphereAlex Hardisty
 
GBIF: An infrastructure for infrastructures
GBIF: An infrastructure for infrastructures GBIF: An infrastructure for infrastructures
GBIF: An infrastructure for infrastructures Francisco Pando
 
Big Data in Bioinformatics & the Era of Cloud Computing
Big Data in Bioinformatics & the Era of Cloud ComputingBig Data in Bioinformatics & the Era of Cloud Computing
Big Data in Bioinformatics & the Era of Cloud ComputingIOSR Journals
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forumChris Dwan
 
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU Love Arora
 
GBIF BIFA mentoring, Day 5a Data management, July 2016
GBIF BIFA mentoring, Day 5a Data management, July 2016GBIF BIFA mentoring, Day 5a Data management, July 2016
GBIF BIFA mentoring, Day 5a Data management, July 2016Dag Endresen
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionIan Foster
 
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECAProject
 
Dealing with Semantic Heterogeneity in Real-Time Information
Dealing with Semantic Heterogeneity in Real-Time InformationDealing with Semantic Heterogeneity in Real-Time Information
Dealing with Semantic Heterogeneity in Real-Time InformationEdward Curry
 
Global Network Advancement Group - Next Generation Network-Integrated Systems
Global Network Advancement Group - Next Generation Network-Integrated SystemsGlobal Network Advancement Group - Next Generation Network-Integrated Systems
Global Network Advancement Group - Next Generation Network-Integrated SystemsLarry Smarr
 
Global Network Advancement Group Next Generation Network-Integrated Sys...
      Global Network Advancement GroupNext Generation Network-Integrated Sys...      Global Network Advancement GroupNext Generation Network-Integrated Sys...
Global Network Advancement Group Next Generation Network-Integrated Sys...Larry Smarr
 
Keynote, Oman Geospatial Expo, Dec 2013
Keynote, Oman Geospatial Expo, Dec 2013Keynote, Oman Geospatial Expo, Dec 2013
Keynote, Oman Geospatial Expo, Dec 2013Steven Ramage
 
Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...African Open Science Platform
 
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...Phil Cryer
 
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...datacite
 
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_creaData bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_creaWirelessInfo
 
Ontology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and SecurityOntology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and SecurityBarry Smith
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchRobert Grossman
 

Semelhante a GlobusWorld 2021: Managing Genomics Data at the DOE Joint Genomics Institute (20)

Data accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphereData accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphere
 
Open Access as a Means to Produce High Quality Data
Open Access as a Means to Produce High Quality DataOpen Access as a Means to Produce High Quality Data
Open Access as a Means to Produce High Quality Data
 
GBIF: An infrastructure for infrastructures
GBIF: An infrastructure for infrastructures GBIF: An infrastructure for infrastructures
GBIF: An infrastructure for infrastructures
 
Big Data in Bioinformatics & the Era of Cloud Computing
Big Data in Bioinformatics & the Era of Cloud ComputingBig Data in Bioinformatics & the Era of Cloud Computing
Big Data in Bioinformatics & the Era of Cloud Computing
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
 
GBIF BIFA mentoring, Day 5a Data management, July 2016
GBIF BIFA mentoring, Day 5a Data management, July 2016GBIF BIFA mentoring, Day 5a Data management, July 2016
GBIF BIFA mentoring, Day 5a Data management, July 2016
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, Evolution
 
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
 
Dealing with Semantic Heterogeneity in Real-Time Information
Dealing with Semantic Heterogeneity in Real-Time InformationDealing with Semantic Heterogeneity in Real-Time Information
Dealing with Semantic Heterogeneity in Real-Time Information
 
Global Network Advancement Group - Next Generation Network-Integrated Systems
Global Network Advancement Group - Next Generation Network-Integrated SystemsGlobal Network Advancement Group - Next Generation Network-Integrated Systems
Global Network Advancement Group - Next Generation Network-Integrated Systems
 
Global Network Advancement Group Next Generation Network-Integrated Sys...
      Global Network Advancement GroupNext Generation Network-Integrated Sys...      Global Network Advancement GroupNext Generation Network-Integrated Sys...
Global Network Advancement Group Next Generation Network-Integrated Sys...
 
Keynote, Oman Geospatial Expo, Dec 2013
Keynote, Oman Geospatial Expo, Dec 2013Keynote, Oman Geospatial Expo, Dec 2013
Keynote, Oman Geospatial Expo, Dec 2013
 
Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...
 
Shifting the goal post – from high impact journals to high impact data
 Shifting the goal post – from high impact journals to high impact data Shifting the goal post – from high impact journals to high impact data
Shifting the goal post – from high impact journals to high impact data
 
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
 
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
 
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_creaData bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
 
Ontology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and SecurityOntology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and Security
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 

Mais de Globus

Advanced Globus System Administration Topics
Advanced Globus System Administration TopicsAdvanced Globus System Administration Topics
Advanced Globus System Administration TopicsGlobus
 
Instrument Data Automation: The Life of a Flow
Instrument Data Automation: The Life of a FlowInstrument Data Automation: The Life of a Flow
Instrument Data Automation: The Life of a FlowGlobus
 
Building Research Applications with Globus PaaS
Building Research Applications with Globus PaaSBuilding Research Applications with Globus PaaS
Building Research Applications with Globus PaaSGlobus
 
Reliable, Remote Computation at All Scales
Reliable, Remote Computation at All ScalesReliable, Remote Computation at All Scales
Reliable, Remote Computation at All ScalesGlobus
 
Best Practices for Data Sharing Using Globus
Best Practices for Data Sharing Using GlobusBest Practices for Data Sharing Using Globus
Best Practices for Data Sharing Using GlobusGlobus
 
An Introduction to Globus for Researchers
An Introduction to Globus for ResearchersAn Introduction to Globus for Researchers
An Introduction to Globus for ResearchersGlobus
 
Introduction to Research Automation with Globus
Introduction to Research Automation with GlobusIntroduction to Research Automation with Globus
Introduction to Research Automation with GlobusGlobus
 
Globus for System Administrators
Globus for System AdministratorsGlobus for System Administrators
Globus for System AdministratorsGlobus
 
Introduction to Globus for System Administrators
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System AdministratorsGlobus
 
Introduction to Data Transfer and Sharing for Researchers
Introduction to Data Transfer and Sharing for ResearchersIntroduction to Data Transfer and Sharing for Researchers
Introduction to Data Transfer and Sharing for ResearchersGlobus
 
Introduction to the Globus Platform for Developers
Introduction to the Globus Platform for DevelopersIntroduction to the Globus Platform for Developers
Introduction to the Globus Platform for DevelopersGlobus
 
Introduction to the Command Line Interface (CLI)
Introduction to the Command Line Interface (CLI)Introduction to the Command Line Interface (CLI)
Introduction to the Command Line Interface (CLI)Globus
 
Automating Research Data with Globus Flows and Compute
Automating Research Data with Globus Flows and ComputeAutomating Research Data with Globus Flows and Compute
Automating Research Data with Globus Flows and ComputeGlobus
 
Automating Research Data Flows and Introduction to the Globus Platform
Automating Research Data Flows and Introduction to the Globus PlatformAutomating Research Data Flows and Introduction to the Globus Platform
Automating Research Data Flows and Introduction to the Globus PlatformGlobus
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System AdministrationGlobus
 
Introduction to Globus for System Administrators
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System AdministratorsGlobus
 
Introduction to Globus for New Users
Introduction to Globus for New UsersIntroduction to Globus for New Users
Introduction to Globus for New UsersGlobus
 
Working with Globus Platform Services and Portals
Working with Globus Platform Services and PortalsWorking with Globus Platform Services and Portals
Working with Globus Platform Services and PortalsGlobus
 
Globus Automation
Globus AutomationGlobus Automation
Globus AutomationGlobus
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System AdministrationGlobus
 

Mais de Globus (20)

Advanced Globus System Administration Topics
Advanced Globus System Administration TopicsAdvanced Globus System Administration Topics
Advanced Globus System Administration Topics
 
Instrument Data Automation: The Life of a Flow
Instrument Data Automation: The Life of a FlowInstrument Data Automation: The Life of a Flow
Instrument Data Automation: The Life of a Flow
 
Building Research Applications with Globus PaaS
Building Research Applications with Globus PaaSBuilding Research Applications with Globus PaaS
Building Research Applications with Globus PaaS
 
Reliable, Remote Computation at All Scales
Reliable, Remote Computation at All ScalesReliable, Remote Computation at All Scales
Reliable, Remote Computation at All Scales
 
Best Practices for Data Sharing Using Globus
Best Practices for Data Sharing Using GlobusBest Practices for Data Sharing Using Globus
Best Practices for Data Sharing Using Globus
 
An Introduction to Globus for Researchers
An Introduction to Globus for ResearchersAn Introduction to Globus for Researchers
An Introduction to Globus for Researchers
 
Introduction to Research Automation with Globus
Introduction to Research Automation with GlobusIntroduction to Research Automation with Globus
Introduction to Research Automation with Globus
 
Globus for System Administrators
Globus for System AdministratorsGlobus for System Administrators
Globus for System Administrators
 
Introduction to Globus for System Administrators
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System Administrators
 
Introduction to Data Transfer and Sharing for Researchers
Introduction to Data Transfer and Sharing for ResearchersIntroduction to Data Transfer and Sharing for Researchers
Introduction to Data Transfer and Sharing for Researchers
 
Introduction to the Globus Platform for Developers
Introduction to the Globus Platform for DevelopersIntroduction to the Globus Platform for Developers
Introduction to the Globus Platform for Developers
 
Introduction to the Command Line Interface (CLI)
Introduction to the Command Line Interface (CLI)Introduction to the Command Line Interface (CLI)
Introduction to the Command Line Interface (CLI)
 
Automating Research Data with Globus Flows and Compute
Automating Research Data with Globus Flows and ComputeAutomating Research Data with Globus Flows and Compute
Automating Research Data with Globus Flows and Compute
 
Automating Research Data Flows and Introduction to the Globus Platform
Automating Research Data Flows and Introduction to the Globus PlatformAutomating Research Data Flows and Introduction to the Globus Platform
Automating Research Data Flows and Introduction to the Globus Platform
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
 
Introduction to Globus for System Administrators
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System Administrators
 
Introduction to Globus for New Users
Introduction to Globus for New UsersIntroduction to Globus for New Users
Introduction to Globus for New Users
 
Working with Globus Platform Services and Portals
Working with Globus Platform Services and PortalsWorking with Globus Platform Services and Portals
Working with Globus Platform Services and Portals
 
Globus Automation
Globus AutomationGlobus Automation
Globus Automation
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
 

Último

TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 

Último (20)

TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 

GlobusWorld 2021: Managing Genomics Data at the DOE Joint Genomics Institute

  • 1. Managing the genomics data deluge at the DOE Joint Genome Institute Kjiersten Fagnan CIO, JGI
  • 2. The DOE Joint Genome Institute at a glance JGI MISSION: To provide the global research community with free access to the most advanced integrative genome science capabilities in support of the DOE energy & environmental research mission Integrative Genomics Building (IGB) U.S. Department of Energy Office of Science User Facility ● JGI established in 1997, User facility from 2004 ● Located at Lawrence Berkeley National Laboratory ● ~285 staff; ~$80M annual funding ● 2,038 Global Primary Users in FY20; >10,000 Data Users
  • 4.
  • 5. Environmental genomics will enable the Bioeconomy Genetic “Circuit” Gene Enzyme Microbial Factory DNA 2 NH4 2+ CO3 2-
  • 6. FY 2020 Users: 2,038 Worldwide 6 Users on the Map: 2,038 Academic 1,504 74% Government 183 9% DOE (national labs only) 161 8% Industry 29 1% Other 161 8%
  • 7. Projects Completed/Scientific Publications 7 Cumulative Number of Projects Completed Cumulative Number of Scientific Publications
  • 8. Sequence Output 8 Massively Parallel Short Read Sequencing Basepairs (GB) Single Molecule Long Read Sequencing Basepairs (GB)
  • 9. DOE Office of Science Public Reusable Research Data (PuRe Data) https://science.osti.gov/Initiatives/PuRe- Data/Resources-at-a-Glance
  • 10. Deluge of Large, Complex Data Sets 10 JGI manages a 10+ PB data repository
  • 11. Mega – Giga – Tera – Peta – Exa – Zetta – Yotta 5/19/2021 https://www.theatlantic.com/technology/archive/2011/05/infographic-how-big-is-a-yottabyte/239034/ 11 The cost to store 1 Yottabyte of data - $100 trillion* This is just genomics data… we also want metabolomes, transcriptomes, proteomes, image data
  • 12. The Immense Scale of Omics Data 5/19/2021 12 Advances in sequencing and omics technologies have far outpaced data infrastructure How do we remove the barriers to data access and analysis at scale?
  • 13. Data Management is Critical 5/19/2021 13 PMO S DM Q AQ C / RQ C G AAG Plant MEP RnD Fungal G enome Portal IMG MG M External C ollaborators Web S ervices ( Mycocosm, Phytozome, IMGM/ ER) In 2013, JGI deployed a hierarchical data management system to deal with the exponetial growth in sequence data and analysis products
  • 14. JGI Archive and Metadata Organizer (JAMO) 5/19/2021 14 G AAG Plant MEP RnD Fungal IMG MG M S DM Q AQ C / RQ C Web S ervices ( Mycocosm, Phytozome, IMGM/ ER) G enome Portal External C ollaborators PMO
  • 16. JAMO Enabled Increased Automation Between Groups • JGI’s core pipelines connect with JAMO and provide metadata through templates • Once data is available for processing, the workflows are triggered automatically • Data that fails QC is flagged for review 5/19/2021 16
  • 17. JAMO is the Backbone of JGI’s Data Portal 5/19/2021 17 All the metadata used to populate the Data Portal comes from JAMO’s Mongo DB
  • 18. Code for America Summit Talk on JGI’s New Data Portal Aligning Data Across Siloed Departments Many government sectors have been collecting data digitally for decades often in uncoordinated ways. In this talk we’ll explore how Truss and Joint Genome Institute partnered to break down data silos and start conversations across departments to align metadata across the organization. From establishing baseline agreements, to finding common outcomes everyone could agree upon, to bringing old data sets into the present, this talk will provide useful tools for practitioners facing challenges of data misalignment across multiple departments. It's Thursday later in the day 2:00-3:00 pm PST https://summit.codeforamerica.org/agenda/ 5/19/2021 18
  • 19. Improving Search Across JGI 5/19/2021 19 Metadata in one place makes search across all JGI programs possible JGI-KBase RESTful Service JGI Data and Metadata system including LIMS, GOLD, sequence, assemblies, annotations Metadata and file types User Query Response Data sets
  • 20. Most of JGI’s Infrastructure is @NERSC 5/19/2021 20
  • 21. Berkeley Lab is on a Major Fault Line 5/19/2021 21 NERSC is here! Most samples used to generate data at JGI are unique and irreplaceable
  • 22. Backing up Irreplaceable Data • Moved 1 PB of data to ORNL for safe-keeping • Data migration completed in 5 days using Globus • Enables access to the data – but only useful with the right metadata 5/19/2021 22 Main JGI Data Repository API HPSS Archive JAMO light DTN DTN SUMMIT API
  • 23. What can you do with all that data and a supercomputer? A Gordon Bell Prize (Supercomputing) winner in 2018 used all the well- characterized publicly available data to look at genetic underpinnings of opioid addiction. Wayne Joubert, et al. 2018. Attacking the opioid epidemic: determining the epistatic and pleiotropic genetic architectures for chronic pain and opioid addiction. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’18). IEEE Press, Article 57, 1–14. Access to large amounts of ‘omics data enables scientists to explore a broad range of hypotheses!
  • 24. CA has Earthquakes and Fires! 5/19/2021 24 We need to distribute Data and Analysis to maintain scientific productivity
  • 25. JGI’s Centralized Workflow System ● JGI Analysis Workflow Service (JAWS) ● Need to be able to compute at multiple centers: NERSC, LBL IT, others ● Need to have more readily reusable and modifiable bioinformatics pipelines ● Need workflows to support FAIR* guidelines ● Objective: Portable, Reusable, Traceable workflows on a Robust platform *Findable, Accessible, Interoperable, Reusable 25
  • 26. Distributed Computing is Hard • Managing multiple user accounts • Different facilities have different policies – Batch schedulers – File system availability and data retention • Different architectures – CPU vs GPU – Local disk vs parallel file systems – Memory size and footprint • Portability is a lot of work 5/19/2021 26
  • 27. JGI is Running Analyses Across the West Coast JGI Centralized Workflow System Cromwell Workflow Manager Additional resources (cloud, ORNL, ANL, etc) Common interface to access resources initial testing future Workflow Description Language
  • 28. JGI is Running Analyses Across the West Coast JGI Centralized Workflow System Workflow Description Language 1. Find the data for analysis in the data management system 2. Authenticate with Globus and transfer the data to the remote computing resource 3. Work is executed, results are generated 4. Transfer data back to the home repository with Globus 5. Register the data and metadata with JAMO Application tokens are accepted by the facilities we are using making it possible to transfer data on behalf of the user
  • 29. Data Movement Between Resources – Globus! • JGI has been using Globus since ~2012 to move data around –One time we broke the service by trying to move millions of tiny files that were all in the same directory :D • Globus enables JGI collaborators to download large amounts of data –Biggest customers are the Bioenergy Research Centers – DOE funded facilities investigating biofuels –Some JGI Users are still willing to wait 9+ days for a download to complete via the browser – education opportunity! • Globus is an integral part of JAWS –Enables the application to move data between computing resources on behalf of the user 5/19/2021 29
  • 30. Summary • JGI is a DOE User Facility that produces a lot of complex, unique data for the scientific community • As instruments improve, the data is higher quality – *metadata can still be problematic • We’d be lost without a good data management system • JGI is turning to distributed computing for processing and large-scale analyses • Data movement made much easier and faster with Globus 5/19/2021 30
  • 31. Upcoming Virtual Annual Meeting/Resource Calls ● Aug 30 – Sept 1: 3 x 6-hour days, 2 sessions/day – Exploring the Universe of Specialized Metabolites – From Microbial Sequence to Environmental Function – The Many Facets of Plant-Microbial Interactions – Machine Learning and Artificial Intelligence for Biology – Integrative Omics-Inspired Plant and Microbe Engineering – Technology Innovations ● Community Science Program (CSP) Functional Genomics proposal deadline: July 31 – Genes/Pathway synthesis – Strain engineering – Data mining – Metabolomics – RNA-seq ● Call New Investigator Call proposal deadline: Sept 15 – Bacterial and archaeal isolates and single cell draft genomes – Metagenomes/metatranscriptomes – DNA synthesis- and Metabolomics-based functional analysis bit.ly/JGI-User-Programs bit.ly/JGI-Meeting2021 jgi-comms@lbl.gov