SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
CCCB RNA-Seq DGE
Analysis Made Easy
Center for Cancer Computational Biology (SM822)
Bioinformatics Team
Homepage: https://cccb.dfci.harvard.edu/
Twitter: @CCCBseq
So why are we here...
You have RNA-Seq data generated but...
○ uploading to Galaxy public server for analysis take forever
○ my bioinformaticians can not process it today
○ sequence alignment is taking forever
○ want to make additional differential expression contrasts
○ formating DGE result for GSEA analysis somehow doesn’t work
○ I am the bioinformatician and don’t have the time to process all this data
(for others and for free)
○ bioinformatic core services can be expensive and takes time
○ The Cancer Genomics Cloud, while powerful, requires good understanding of
Amazon or Google Cloud System to manage projects and payment for the
computing cost
CCCB Cloud System can help
Fast
○ Scalable infrastructure with virtually no computing resource limitation
○ Minimal queue time to get data analyzed
Secure
○ Google Cloud Platform (GCP) is covered by Google-DFCI BAA to ensure
HIPAA compliance security
Convenient
○ Simplified large data upload and download processes by parallelized
direct cloud-to-cloud transfer between Dropbox to GCP to reduce data
transfer time from hours to minutes
○ Like other Cloud platforms users is set up to pay for overhead and
computing time but without steep learning curve to request project or
manage payment
RNA-Seq DGE analysis should not be difficult
Most RNA-Seq data can be aligned and quantified using the same settings for
initial DGE analysis
Technical bottleneck is often to gather enough computing power and set up proper
analysis environment… after data transfer problem is solved
AlignFastq
files
Quantify DGE
Clustering
Func.
Enrichment
Please use your gmail account to log
into https://cccb-analysis.tm4.org/
And upload the fastq data files
CCCB Cloud System- authentication
1. Use Incognito/Private Browser Session
2. Sign-in to https://cccb-analysis.tm4.org
with provided Google account
- @gmail.com address
- DFCI Gsuite email
(first_last@mail.dfci.harvard.edu)
CCCB Cloud System- analysis setup
3. Click on ‘Upload files’ on analysis homepage
- All analysis projects associated with your email
- Projects created on your behalf by CCCB
- Status messages, Click on next steps
CCCB Cloud System- analysis setup
4. choose your reference genome
5. Edit the project name to something meaningful
CCCB Cloud System- file uploads
6. Upload Files
a. Dropbox
- Preferred method
- Log in again into Dropbox
- Select files and upload
b. From local computer
- File chooser
- Drag/drop interface
- Slow transfer through https
File naming instructions
- Email notification when transfer is
complete.
CCCB Cloud System- file uploads
7. After receiving email (if using
Dropbox), refresh.
Uploaded files will be visible
CCCB Cloud System- Assign Sample Name
7. Set Sample Names
Sample names are inferred from
sequencing file names. Can create
new samples or remove existing ones.
- Drag/drop files to the proper
sample
CCCB Cloud System- Align and Quantify
RNA-Seq DGE Analysis Under the Hood
- Parallelized:
- alignment (STAR aligner) ---> BAM
Files
- Sort, primary-alignment filtering,
duplicate evaluation (Samtools,
Picard)
- Quantification (featureCounts)
- Merging:
- Overall “raw” (not normalized) count
matrix
- Differential expression testing
with DESeq2
- Plots/figures
Master
Sample 1
Sample 2
Sample N
Alignment is a Computationally Intensive Process
Running on Local Computing
● Require knowledge in unix and high performance computing
● Require powerful computing infrastructure (i.e. 64 bit machine with 30+ GB RAM)
● Require ability to write scripts and program
● Require understanding of the process to run alignment program
Running on Public Web Servers
● Wait time for most public web servers such as Galaxy (https://galaxyproject.org/)
and Genboree (http://genboree.org/) increases with the number of users
● Most of them utilizes https protocol and allows only 1 fastq file upload at a time.
● The Cancer Genomics Cloud (http://www.cancergenomicscloud.org/) requires good
understanding of Amazon or Google Cloud System to setup project and payment
Typical RNASeq DGE Experimental Design
Difficult to estimate the minimum number of biological replicates required, but
typical rule of thumb:
● 3+ for cell lines
● 5+ for inbred lines of model organisms
● As many samples for human as possible
A single RNASeq experiment is usually between 6 to 20+ samples and wait time
for upload, run-time, and download increases linearly on public web server with
risk of broken connection
CCCB Cloud Infrastructure
Users
CCCB Bioinformatics
CCCB
Sequencing
Data Upload
Application
“Download 50 fastq files!”
Pulls raw data from Dropbox and
push into Google Storage buckets
Scaling
Application“Align N samples”
Independent nodes/images
- Each node needs large amount of
data (e.g. index files for aligners)
- Pre-built images minimizes data
transfer
- Communication about status
Pulls raw data and pushes
processed data to/from Google
Storage buckets
Task management for data download
“Transfer these 50 fastQ files (>2Gb each) to
my Partner’s Dropbox!”
Application
Fast download for output files using
Dropbox
Save output by direct download or
Dropbox transfer:
- Authenticated: only those
logged-in as your Google user
can access files
- Direct transfer to Dropbox
storage for fast data transfer
and backup
- Email notification after transfer
is complete
- A master directory called
“cccb_transfers/” will be
created in Dropbox and
organized by projects
Straightforward differential analysis
Available
processed samples
Human-readable
contrast name
Thresholds used for
creating heatmaps and
volcano plots
Drag/drop samples
into contrast groups
Can rename groups
Standard RNA-Seq DGE Output
Custom report
Basic figures
Output files
Raw counts, normalized counts,
Differential expression results
Files for GSEA analysis
Gene Set Enrichment Analysis
Broad Institute GSEA (http://software.broadinstitute.org/gsea/)
Directly use the normalized count matrix file and groups.cls from CCCB Cloud
Platform DGE analysis result support files that can be imported into Broad Gene
Set Enrichment Analysis (GSEA) on MSigDB
RNASeq Data Visualization
Multi-experiment viewer (WebMEV)-- http://mev.tm4.org
Directly use the raw count matrix from CCCB Cloud Platform and import to do more
advanced analyses including:
- Clustering (hierarchical, k-means, PCA, etc)
- GO enrichment, pathway enrichment analyses
Backup Slides
For more information on Pipeline Services
Pricing Structure for RNASeq DGE
DFCI/BWH: $18 per sample
External Academia: $24 per sample
Industry: Inquire
CCCB Cloud Platform Road Map
GATK v3 (Live)/ v4 (May)
- Germline Mutation Calling for DNA-Seq
Mutect2 (April):
- Somatic Mutation Calling for tumor/normal paired DNA-Seq
Small non-coding RNA (April):
- Mapping and quantification of small non-coding RNA classes (miRNA, piRNA,
tRNA, snoRNA)
Transcript Isoform (May):
- Novel transcript isoform identification and quantification
Important accounts and where to get them
DFCI G Suite Account (or just Google Account)
Google accounts linked with organization emails are prefered even though any
google account can be used. For DFCI community, please request an DFCI
google account (user@mail.dfci.harvard.edu) through Research Computing
website: http://rc.dfci.harvard.edu/contact-research-computing
Partners Dropbox
All Dropbox account will work with our systems. Partners Health provides virtually
unlimited encrypted storage on Dropbox Business for all Partners community
members (anyone with partners.org email) for free. Information is available here:
https://rc.partners.org/kb/collaboration/dropbox?article=2062
Agilent CrossLab (a.k.a iLab Solutions)
As most of cores and centers around DFCI, we use iLab to track all of our projects.
A free account can be requested at https://dfci.ilab.agilent.com/account/login
Request Project through iLab
For more info: http://cccb.dfci.harvard.edu/project-request
Request iLab Account and Project
For more info: http://cccb.dfci.harvard.edu/project-request
CCCB
Request iLab Account and Project
For more info: http://cccb.dfci.harvard.edu/project-request
Analysis
Pipeline
Moving Beyond Excel: Data Wrangling with R
This introductory course is designed for investigators looking to improve their data analysis skills and move beyond
Excel. Participants will be introduced to the R language and its basic capabilities for data processing, motivated by
practical examples with high-throughput sequencing data such as differential expression or variant analyses.
No prior experience with R (or programming in general) is necessary.
Topics include:
● Introduction to R and the command line
● The power and ease of programming for consistent, reproducible research
● Reading and writing formatted datasets
● Filtering
● Data “cleaning”
● Data merging
● (If time permits) Basic plotting

Mais conteúdo relacionado

Mais procurados

OREChem Services and Workflows
OREChem Services and WorkflowsOREChem Services and Workflows
OREChem Services and Workflows
marpierc
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
Databricks
 
Predictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo EvertsPredictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo Everts
Databricks
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 

Mais procurados (20)

Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
High Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopHigh Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into Hadoop
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
OREChem Services and Workflows
OREChem Services and WorkflowsOREChem Services and Workflows
OREChem Services and Workflows
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
 
Predictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo EvertsPredictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo Everts
 
CaGrid 1.0 Service Infrastructure
CaGrid 1.0 Service InfrastructureCaGrid 1.0 Service Infrastructure
CaGrid 1.0 Service Infrastructure
 
Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)
 
OGCE Project Overview
OGCE Project OverviewOGCE Project Overview
OGCE Project Overview
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache FlinkThe Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
 
Linked Data (in low-resource) Platforms: a mapping for Constrained Applicatio...
Linked Data (in low-resource) Platforms: a mapping for Constrained Applicatio...Linked Data (in low-resource) Platforms: a mapping for Constrained Applicatio...
Linked Data (in low-resource) Platforms: a mapping for Constrained Applicatio...
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
LDP4j: A framework for the development of interoperable read-write Linked Da...
LDP4j: A framework for the development of interoperable read-write Linked Da...LDP4j: A framework for the development of interoperable read-write Linked Da...
LDP4j: A framework for the development of interoperable read-write Linked Da...
 
Describing LDP Applications with the Hydra Core Vocabulary
Describing LDP Applications with the Hydra Core VocabularyDescribing LDP Applications with the Hydra Core Vocabulary
Describing LDP Applications with the Hydra Core Vocabulary
 

Semelhante a Cloud Native Analysis Platform for NGS analysis

Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with Globus
Globus
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
programmermag
 

Semelhante a Cloud Native Analysis Platform for NGS analysis (20)

Cloud Native Analysis Platform for NGS analysis
Cloud Native Analysis Platform for NGS analysisCloud Native Analysis Platform for NGS analysis
Cloud Native Analysis Platform for NGS analysis
 
CCCB Germline Variant Analysis on Cloud Platform
CCCB Germline Variant Analysis on Cloud PlatformCCCB Germline Variant Analysis on Cloud Platform
CCCB Germline Variant Analysis on Cloud Platform
 
Request CCCB Services
Request CCCB ServicesRequest CCCB Services
Request CCCB Services
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with Globus
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
Data Mobility Exhibition
Data Mobility ExhibitionData Mobility Exhibition
Data Mobility Exhibition
 
Cloud Computing in Systems Programming Curriculum
Cloud Computing in Systems Programming CurriculumCloud Computing in Systems Programming Curriculum
Cloud Computing in Systems Programming Curriculum
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
SAP OS/DB Migration using Azure Storage Account
SAP OS/DB Migration using Azure Storage AccountSAP OS/DB Migration using Azure Storage Account
SAP OS/DB Migration using Azure Storage Account
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
 
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics DataBest pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
Data Science on Google Cloud Platform
Data Science on Google Cloud PlatformData Science on Google Cloud Platform
Data Science on Google Cloud Platform
 
Data Security Governanace and Consumer Cloud Storage
Data Security Governanace and Consumer Cloud StorageData Security Governanace and Consumer Cloud Storage
Data Security Governanace and Consumer Cloud Storage
 
Practical Petabyte Pushing
Practical Petabyte PushingPractical Petabyte Pushing
Practical Petabyte Pushing
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overview
 
Chembience
ChembienceChembience
Chembience
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Cloud Native Analysis Platform for NGS analysis

  • 1. CCCB RNA-Seq DGE Analysis Made Easy Center for Cancer Computational Biology (SM822) Bioinformatics Team Homepage: https://cccb.dfci.harvard.edu/ Twitter: @CCCBseq
  • 2. So why are we here... You have RNA-Seq data generated but... ○ uploading to Galaxy public server for analysis take forever ○ my bioinformaticians can not process it today ○ sequence alignment is taking forever ○ want to make additional differential expression contrasts ○ formating DGE result for GSEA analysis somehow doesn’t work ○ I am the bioinformatician and don’t have the time to process all this data (for others and for free) ○ bioinformatic core services can be expensive and takes time ○ The Cancer Genomics Cloud, while powerful, requires good understanding of Amazon or Google Cloud System to manage projects and payment for the computing cost
  • 3. CCCB Cloud System can help Fast ○ Scalable infrastructure with virtually no computing resource limitation ○ Minimal queue time to get data analyzed Secure ○ Google Cloud Platform (GCP) is covered by Google-DFCI BAA to ensure HIPAA compliance security Convenient ○ Simplified large data upload and download processes by parallelized direct cloud-to-cloud transfer between Dropbox to GCP to reduce data transfer time from hours to minutes ○ Like other Cloud platforms users is set up to pay for overhead and computing time but without steep learning curve to request project or manage payment
  • 4. RNA-Seq DGE analysis should not be difficult Most RNA-Seq data can be aligned and quantified using the same settings for initial DGE analysis Technical bottleneck is often to gather enough computing power and set up proper analysis environment… after data transfer problem is solved AlignFastq files Quantify DGE Clustering Func. Enrichment
  • 5. Please use your gmail account to log into https://cccb-analysis.tm4.org/ And upload the fastq data files
  • 6. CCCB Cloud System- authentication 1. Use Incognito/Private Browser Session 2. Sign-in to https://cccb-analysis.tm4.org with provided Google account - @gmail.com address - DFCI Gsuite email (first_last@mail.dfci.harvard.edu)
  • 7. CCCB Cloud System- analysis setup 3. Click on ‘Upload files’ on analysis homepage - All analysis projects associated with your email - Projects created on your behalf by CCCB - Status messages, Click on next steps
  • 8. CCCB Cloud System- analysis setup 4. choose your reference genome 5. Edit the project name to something meaningful
  • 9. CCCB Cloud System- file uploads 6. Upload Files a. Dropbox - Preferred method - Log in again into Dropbox - Select files and upload b. From local computer - File chooser - Drag/drop interface - Slow transfer through https File naming instructions - Email notification when transfer is complete.
  • 10. CCCB Cloud System- file uploads 7. After receiving email (if using Dropbox), refresh. Uploaded files will be visible
  • 11. CCCB Cloud System- Assign Sample Name 7. Set Sample Names Sample names are inferred from sequencing file names. Can create new samples or remove existing ones. - Drag/drop files to the proper sample
  • 12. CCCB Cloud System- Align and Quantify
  • 13. RNA-Seq DGE Analysis Under the Hood - Parallelized: - alignment (STAR aligner) ---> BAM Files - Sort, primary-alignment filtering, duplicate evaluation (Samtools, Picard) - Quantification (featureCounts) - Merging: - Overall “raw” (not normalized) count matrix - Differential expression testing with DESeq2 - Plots/figures Master Sample 1 Sample 2 Sample N
  • 14. Alignment is a Computationally Intensive Process Running on Local Computing ● Require knowledge in unix and high performance computing ● Require powerful computing infrastructure (i.e. 64 bit machine with 30+ GB RAM) ● Require ability to write scripts and program ● Require understanding of the process to run alignment program Running on Public Web Servers ● Wait time for most public web servers such as Galaxy (https://galaxyproject.org/) and Genboree (http://genboree.org/) increases with the number of users ● Most of them utilizes https protocol and allows only 1 fastq file upload at a time. ● The Cancer Genomics Cloud (http://www.cancergenomicscloud.org/) requires good understanding of Amazon or Google Cloud System to setup project and payment
  • 15. Typical RNASeq DGE Experimental Design Difficult to estimate the minimum number of biological replicates required, but typical rule of thumb: ● 3+ for cell lines ● 5+ for inbred lines of model organisms ● As many samples for human as possible A single RNASeq experiment is usually between 6 to 20+ samples and wait time for upload, run-time, and download increases linearly on public web server with risk of broken connection
  • 16. CCCB Cloud Infrastructure Users CCCB Bioinformatics CCCB Sequencing
  • 17. Data Upload Application “Download 50 fastq files!” Pulls raw data from Dropbox and push into Google Storage buckets
  • 18. Scaling Application“Align N samples” Independent nodes/images - Each node needs large amount of data (e.g. index files for aligners) - Pre-built images minimizes data transfer - Communication about status Pulls raw data and pushes processed data to/from Google Storage buckets
  • 19. Task management for data download “Transfer these 50 fastQ files (>2Gb each) to my Partner’s Dropbox!” Application
  • 20. Fast download for output files using Dropbox Save output by direct download or Dropbox transfer: - Authenticated: only those logged-in as your Google user can access files - Direct transfer to Dropbox storage for fast data transfer and backup - Email notification after transfer is complete - A master directory called “cccb_transfers/” will be created in Dropbox and organized by projects
  • 21. Straightforward differential analysis Available processed samples Human-readable contrast name Thresholds used for creating heatmaps and volcano plots Drag/drop samples into contrast groups Can rename groups
  • 22. Standard RNA-Seq DGE Output Custom report Basic figures Output files Raw counts, normalized counts, Differential expression results Files for GSEA analysis
  • 23. Gene Set Enrichment Analysis Broad Institute GSEA (http://software.broadinstitute.org/gsea/) Directly use the normalized count matrix file and groups.cls from CCCB Cloud Platform DGE analysis result support files that can be imported into Broad Gene Set Enrichment Analysis (GSEA) on MSigDB
  • 24. RNASeq Data Visualization Multi-experiment viewer (WebMEV)-- http://mev.tm4.org Directly use the raw count matrix from CCCB Cloud Platform and import to do more advanced analyses including: - Clustering (hierarchical, k-means, PCA, etc) - GO enrichment, pathway enrichment analyses
  • 26. For more information on Pipeline Services
  • 27. Pricing Structure for RNASeq DGE DFCI/BWH: $18 per sample External Academia: $24 per sample Industry: Inquire
  • 28. CCCB Cloud Platform Road Map GATK v3 (Live)/ v4 (May) - Germline Mutation Calling for DNA-Seq Mutect2 (April): - Somatic Mutation Calling for tumor/normal paired DNA-Seq Small non-coding RNA (April): - Mapping and quantification of small non-coding RNA classes (miRNA, piRNA, tRNA, snoRNA) Transcript Isoform (May): - Novel transcript isoform identification and quantification
  • 29. Important accounts and where to get them DFCI G Suite Account (or just Google Account) Google accounts linked with organization emails are prefered even though any google account can be used. For DFCI community, please request an DFCI google account (user@mail.dfci.harvard.edu) through Research Computing website: http://rc.dfci.harvard.edu/contact-research-computing Partners Dropbox All Dropbox account will work with our systems. Partners Health provides virtually unlimited encrypted storage on Dropbox Business for all Partners community members (anyone with partners.org email) for free. Information is available here: https://rc.partners.org/kb/collaboration/dropbox?article=2062 Agilent CrossLab (a.k.a iLab Solutions) As most of cores and centers around DFCI, we use iLab to track all of our projects. A free account can be requested at https://dfci.ilab.agilent.com/account/login
  • 30. Request Project through iLab For more info: http://cccb.dfci.harvard.edu/project-request
  • 31. Request iLab Account and Project For more info: http://cccb.dfci.harvard.edu/project-request CCCB
  • 32. Request iLab Account and Project For more info: http://cccb.dfci.harvard.edu/project-request Analysis Pipeline
  • 33. Moving Beyond Excel: Data Wrangling with R This introductory course is designed for investigators looking to improve their data analysis skills and move beyond Excel. Participants will be introduced to the R language and its basic capabilities for data processing, motivated by practical examples with high-throughput sequencing data such as differential expression or variant analyses. No prior experience with R (or programming in general) is necessary. Topics include: ● Introduction to R and the command line ● The power and ease of programming for consistent, reproducible research ● Reading and writing formatted datasets ● Filtering ● Data “cleaning” ● Data merging ● (If time permits) Basic plotting