SeqWare on the Cloud: Porting a Genome Center's Infrastructure to Amazon Web Services

SeqWare on the Cloud:
Porting a Genome Center’s Infrastructure
to Amazon Web Services

Brian O'Connor

SeqWare Software Architect &
Manager for Software Engineering

The Ontario Institute for Cancer Research

Effective Scaling

Integration Expertise
& Sharing
Effective
System

Compute &
Storage

SeqWare was designed to scale in these ways

Effective Scaling

Query Integration Expertise
Engine & Sharing
Poster
Effective
System

Compute &
Storage

SeqWare was designed to scale in these ways

The Open Source SeqWare Project
SeqWare SeqWare
Web Query Engine
Service

SeqWare Portal
SeqWare
SeqWare
Pipeline
MetaDB

Local
Cluster

Cloud
Big Data

Small Data

Distinguishing Features of SeqWare
Firehose

● Infrastructure Toolkit
Tavern ● Developer Framework
a Open Source/Community
● Automation
● Environment-Agnostic
Commercial ● Tailored for Big Projects
● User-Created Workflows
● Packaging Format
● Provenance Tracking
● Fault Tolerant
● Tools-Agnostic
● Open Source

Projects Using SeqWare

UNC Ontario
Lineberger Institute for
Cancer Cancer
Center Research Iceman, Plant Genome
HuRef 300x, Clinical Assembly
Others... Sequencing

+ local projects
+ local
projects Exome, Targeted
Whole Genome, Resequencing
Targeted Re- Whole Genome
RNASeq Sequencing, Whole Genome
RNASeq
Hundreds of 2 genomes,
1.5 TBase 38 TBase 9 genomes patient samples JBrowse on
927 samples 1,522 samples a 300x genome iPad
982 “lanes” 2,297 “lanes”

Scaling Expertise:
Analyzing Illumina Data @ OICR
● September 2011 rolled out SeqWare at OICR

● Goal: to deploy SeqWare and streamline
production analysis through automation
● 4 groups working together
● SeqWare Workflows for
– Large projects and common tasks
– Projects with “public uploads”

SeqWare at OICR
SeqWare SeqWare
Web Query Engine
Service

SeqWare Portal
SeqWare
SeqWare
Pipeline
MetaDB

Local
Cluster

Cloud
Big Data

Small Data

SeqWare at OICR
SeqWare SeqWare
Web Query Engine
Software Service
Engineering

SeqWare Portal
SeqWare
SeqWare
Pipeline
MetaDB

Local
Cluster

Cloud
Big Data

Small Data

SeqWare at OICR
SeqWare SeqWare
Web Query Engine
Service

SeqWare Portal
SeqWare
SeqWare
Pipeline
MetaDB

Local
Cluster

Pipeline &
Big Data Tool Evaluation
Cloud

Small Data

SeqWare at OICR
SeqWare SeqWare
Web Query Engine
Service

SeqWare Portal
SeqWare
SeqWare
Pipeline
MetaDB

Sequencing Facility Local
Cluster

LIMS
Cloud
Big Data

Small Data

SeqWare at OICR
SeqWare SeqWare
Web Query Engine
Service
User + Data =

SeqWare Portal
SeqWare
SeqWare
Pipeline
MetaDB “deciders”

Local
Cluster

Production
Big Data Informatics Cloud

Small Data

OICR Production Workflows
Multiple groups contributed including the new Pipeline and Evaluation Team

In Production Staging , Testing, & Development Workflows

OICR SeqWare Results

38 Trillion
Samples

Bases
Aligned

Time (~2 years)

● Automated key components in production
● What about sharing our infrastructure?
● To the Cloud!

“The Cloud”
Want to share infrastructure without sharing infrastructure

Core Services Transfer & Web Other Nifty
Services Services
Elastic Cloud
Compute
Elastic Beanstalk
Simple
Storage Glacier
Import/Export
Service
Elastic Map/Reduce
Linux tools DirectConnect
for disk and DynamoDB
file encryption
HBase

Hardware through API calls

Scaling Computation:
Analyzing SOLiD Data on Amazon

● Life Collaborations Division had 9 human
genomes such as the Iceman's genome and
HuRef resequenced at high depth

● Goal: to deploy SeqWare infrastructure on
AWS and analyze data in a scalable way
● Without building infrastructure
● Using open source tools

SeqWare Infrastructure on EC2
Instance
Workflow Command SeqWare or
Bundle Line MetaDB Cluster
Tools Launcher

Config
SeqWare
Web
Service Amazon Amazon EC2
S3
Import
SeqWare
Pipeline
Result
Files: SeqWare
BAM, Portal
VCF,
Reports

User SeqWare Interfaces
Amazon Instance or Cluster

Workflow Outputs
Results via Project Website on EC2 Variants Loaded in JBrowse
Genome Browser on Elastic Beanstalk

http://icemangenome.net

Variants in Database and Files in S3

Variant BAM VCF Annotated VCF
Database

Results
● Cloud delivered fantastic computational and
storage scalability

● Analyzed 9 human genomes, one at 300x!

● Costs
● 8 node HPC cluster, about 4 days
● 30x coverage genome was ~$1000 (<$15/GBase)
● ~$150 per exome ($10/GBase)
● ~$50/month/genome storage, website, & browser

The Future of SeqWare
● Scalability
● Cloud-based cluster launching (Starcluster/Cloudman)
● Release encryption and distributed filesystem tools
● Better documentation and easier setup
● Expertise
● Simplify pipeline language(s) and development process
● Release OICR public workflows
● Integration
● Expand NOSQL variant/annotation database
● Support for other tools like Galaxy

Availability
● SeqWare available at:
http://seqware.github.com, @SeqWare

Virtual Box
&
AMI

● Brian O'Connor
boconnor@oicr.on.ca

Acknowledgements
● SeqWare @ OICR ● Tim Harkins
● Morgan Taschuk, ● Barry Merriman
Denis Yuen, Yong Liang ● Jason Warner
● OICR SeqProdBio ● Kevin McKernan
● Tim Beck, Zheng Zha, Tony ● Vincent Ferretti
DeBat
● OICR Bioinformatics Core
● Lincoln Stein

● Francis Ouellette,
Zhibin Lu
● SeqWare @ UNC
● Neil Hayes, Sara Grimm,
Stuart Jefferys, Matt
Solloway, and the
Lineberger group

SeqWare on the Cloud: Porting a Genome Center's Infrastructure to Amazon Web Services

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (16)

Semelhante a SeqWare on the Cloud: Porting a Genome Center's Infrastructure to Amazon Web Services

Semelhante a SeqWare on the Cloud: Porting a Genome Center's Infrastructure to Amazon Web Services (20)

Último

Último (20)

SeqWare on the Cloud: Porting a Genome Center's Infrastructure to Amazon Web Services