This is a talk I (Brian O'Connor) gave at Genome Informatics 2012 describing how SeqWare was used to build the Next-Generation Sequencing (NGS) software infrastructure needed for multiple genome centers (OICR & UNC) and how that was leveraged on Amazon's cloud. If you are interested in using SeqWare for your NGS analysis needs see out open source site at http://seqware.github.com or, for a commercially supported version, see http://nimbusinformatics.com.
SeqWare on the Cloud: Porting a Genome Center's Infrastructure to Amazon Web Services
1. SeqWare on the Cloud:
Porting a Genome Center’s Infrastructure
to Amazon Web Services
Brian O'Connor
SeqWare Software Architect &
Manager for Software Engineering
The Ontario Institute for Cancer Research
2. Effective Scaling
Integration Expertise
& Sharing
Effective
System
Compute &
Storage
SeqWare was designed to scale in these ways
3. Effective Scaling
Query Integration Expertise
Engine & Sharing
Poster
Effective
System
Compute &
Storage
SeqWare was designed to scale in these ways
4. The Open Source SeqWare Project
SeqWare SeqWare
Web Query Engine
Service
SeqWare Portal
SeqWare
SeqWare
Pipeline
MetaDB
Local
Cluster
Cloud
Big Data
Small Data
5. Distinguishing Features of SeqWare
Firehose
● Infrastructure Toolkit
Tavern ● Developer Framework
a Open Source/Community
● Automation
● Environment-Agnostic
Commercial ● Tailored for Big Projects
● User-Created Workflows
● Packaging Format
● Provenance Tracking
● Fault Tolerant
● Tools-Agnostic
● Open Source
6. Projects Using SeqWare
UNC Ontario
Lineberger Institute for
Cancer Cancer
Center Research Iceman, Plant Genome
HuRef 300x, Clinical Assembly
Others... Sequencing
+ local projects
+ local
projects Exome, Targeted
Whole Genome, Resequencing
Targeted Re- Whole Genome
RNASeq Sequencing, Whole Genome
RNASeq
Hundreds of 2 genomes,
1.5 TBase 38 TBase 9 genomes patient samples JBrowse on
927 samples 1,522 samples a 300x genome iPad
982 “lanes” 2,297 “lanes”
7. Scaling Expertise:
Analyzing Illumina Data @ OICR
● September 2011 rolled out SeqWare at OICR
● Goal: to deploy SeqWare and streamline
production analysis through automation
● 4 groups working together
● SeqWare Workflows for
– Large projects and common tasks
– Projects with “public uploads”
8. SeqWare at OICR
SeqWare SeqWare
Web Query Engine
Service
SeqWare Portal
SeqWare
SeqWare
Pipeline
MetaDB
Local
Cluster
Cloud
Big Data
Small Data
9. SeqWare at OICR
SeqWare SeqWare
Web Query Engine
Software Service
Engineering
SeqWare Portal
SeqWare
SeqWare
Pipeline
MetaDB
Local
Cluster
Cloud
Big Data
Small Data
10. SeqWare at OICR
SeqWare SeqWare
Web Query Engine
Service
SeqWare Portal
SeqWare
SeqWare
Pipeline
MetaDB
Local
Cluster
Pipeline &
Big Data Tool Evaluation
Cloud
Small Data
11. SeqWare at OICR
SeqWare SeqWare
Web Query Engine
Service
SeqWare Portal
SeqWare
SeqWare
Pipeline
MetaDB
Sequencing Facility Local
Cluster
LIMS
Cloud
Big Data
Small Data
12. SeqWare at OICR
SeqWare SeqWare
Web Query Engine
Service
User + Data =
SeqWare Portal
SeqWare
SeqWare
Pipeline
MetaDB “deciders”
Local
Cluster
Production
Big Data Informatics Cloud
Small Data
13. OICR Production Workflows
Multiple groups contributed including the new Pipeline and Evaluation Team
In Production Staging , Testing, & Development Workflows
14. OICR SeqWare Results
38 Trillion
Samples
Bases
Aligned
Time (~2 years)
● Automated key components in production
● What about sharing our infrastructure?
● To the Cloud!
15. “The Cloud”
Want to share infrastructure without sharing infrastructure
Core Services Transfer & Web Other Nifty
Services Services
Elastic Cloud
Compute
Elastic Beanstalk
Simple
Storage Glacier
Import/Export
Service
Elastic Map/Reduce
Linux tools DirectConnect
for disk and DynamoDB
file encryption
HBase
Hardware through API calls
16. Scaling Computation:
Analyzing SOLiD Data on Amazon
● Life Collaborations Division had 9 human
genomes such as the Iceman's genome and
HuRef resequenced at high depth
● Goal: to deploy SeqWare infrastructure on
AWS and analyze data in a scalable way
● Without building infrastructure
● Using open source tools
17. SeqWare Infrastructure on EC2
Instance
Workflow Command SeqWare or
Bundle Line MetaDB Cluster
Tools Launcher
Config
SeqWare
Web
Service Amazon Amazon EC2
S3
Import
SeqWare
Pipeline
Result
Files: SeqWare
BAM, Portal
VCF,
Reports
User SeqWare Interfaces
Amazon Instance or Cluster
18. Workflow Outputs
Results via Project Website on EC2 Variants Loaded in JBrowse
Genome Browser on Elastic Beanstalk
http://icemangenome.net
Variants in Database and Files in S3
Variant BAM VCF Annotated VCF
Database
19. Results
● Cloud delivered fantastic computational and
storage scalability
● Analyzed 9 human genomes, one at 300x!
● Costs
● 8 node HPC cluster, about 4 days
● 30x coverage genome was ~$1000 (<$15/GBase)
● ~$150 per exome ($10/GBase)
● ~$50/month/genome storage, website, & browser
20. The Future of SeqWare
● Scalability
● Cloud-based cluster launching (Starcluster/Cloudman)
● Release encryption and distributed filesystem tools
● Better documentation and easier setup
● Expertise
● Simplify pipeline language(s) and development process
● Release OICR public workflows
● Integration
● Expand NOSQL variant/annotation database
● Support for other tools like Galaxy
21. Availability
● SeqWare available at:
http://seqware.github.com, @SeqWare
Virtual Box
&
AMI
● Brian O'Connor
boconnor@oicr.on.ca
22. Acknowledgements
● SeqWare @ OICR ● Tim Harkins
● Morgan Taschuk, ● Barry Merriman
Denis Yuen, Yong Liang ● Jason Warner
● OICR SeqProdBio ● Kevin McKernan
● Tim Beck, Zheng Zha, Tony ● Vincent Ferretti
DeBat
● OICR Bioinformatics Core
● Lincoln Stein
● Francis Ouellette,
Zhibin Lu
● SeqWare @ UNC
● Neil Hayes, Sara Grimm,
Stuart Jefferys, Matt
Solloway, and the
Lineberger group