Researchers at Clemson University assigned a student summer intern to explore bioinformatics cloud solutions that leverage MPI, the OrangeFS parallel file system, AWS CloudFormation templates, and a Cluster Scheduler. The result was an AWS cluster that runs bioinformatics code optimized using MPI-IO. We give an overview of the process and show how easy it is to create clusters in AWS.
4. The one that means the needs of your unique
application needs!
Some things to consider:
• Total amount of storage required?
• Resilience required?
• Expected number of clients?
• Locality of servers and clients?
• Average file sizes? (KB, MB, GB, TB)
• Block sizes used by applications?
• IO profile? Read/Write%?
• Typical IO use case?
6. Building Blocks
• Amazon Elastic Compute Cloud (Amazon EC2)
– 1ECU to 88ECU of compute power
– 613MB to 240GB of memory
– Shared network, EBS optimized, dedicated 10Gb
Amazon EC2
• Amazon Simple Storage Service (Amazon S3)
– Unlimited capacity
– Web-scale
– Lifecycle management
Amazon S3
7. Building Blocks
• Local storage (ephemeral)
– 150GB to 3360GB per instance
– HDD and SSD
– FREE! (part of instance cost)
Ephemeral Disk
• Amazon Elastic Block Store (Amazon EBS)
–
–
–
–
1G to 1000GB per volume
Standard and Provisioned IOPS
Multiple volumes per instance
Supports snapshot to Amazon S3
Amazon EBS
8. Storage-optimized EC2 instances
http://aws.amazon.com/ec2/instance-types/
"This family includes the HI1 and HS1 instance types, and
provides you with Intel Xeon processors and directattached storage options optimized for applications with
specific disk I/O and storage capacity requirements."
• HI1 instances features SSD storage
• HS1 instances feature direct attach HDD
9. Amazon EBS optimized instances
http://aws.amazon.com/ebs/
"To enable your Amazon EC2 instances to fully
utilize the IOPS provisioned on an EBS volume,
you can launch selected Amazon EC2 instance
types as “EBS-Optimized” instances."
10. What Are Your Needs?
•
•
•
•
Temporary or long-term storage?
Shared or per instance?
How much?
How fast?
11. Long term storage
• Use Amazon S3
• Pull datasets when needed
• Easy to access using AWS CLI or API
$ aws s3 cp s3://mybucket/dataset/input /ephemeral/input
• Lifecycle to Amazon Glacier
12. Temporary Storage
• Local ephemeral for scratch
• Distributed filesystem for high-performance
scratch
– OrangeFS
– Lustre
– Ceph
• Pull data from Amazon S3
13. How much?
• With Amazon S3, you pay for what you use
• With Amazon EBS, you pay for what you
provision
• Keeping data in Amazon S3 and only pulling
what is needed helps mange cost
14. How fast?
• Ephemeral storage can deliver up to 2.2GB/sec
– more instances == more throughput
• Amazon EBS volumes support up to 4000 IOPS
– more volumes == more IOPS
• Amazon S3 scales horizontally
– more client == more throughput
– more connections == more throughput
15. Making filesystems persist
• Use Amazon EBS for block storage
• Use Amazon EBS snapshots for recovery
• Use a replicated distributed filesystem
19. RNA-Seq Differential Gene
Expression Workflow
Clemson University Professor, Dr. Alex
Feltus had been discussing with Eddie
Duffy and Dr. Barr Von Oehsen, about
optimizing the Gene Expression
Workflow.
As a result, a summer project with
Brandon Posey was started to work with
this optimization in the AWS cloud.
The longest processing steps were the
FastQ steps and is where the
optimization started.
*Workflow chart provided with permission from
Allele Systems (www.allelesystems.com)
20. OrangeFS – Scalable Parallel File System on AWS
Unified High Performance File System
OrangeFS
Instance
Amazon
DynamoDB
Amazon
EBS
volumes
Available on the AWS Marketplace and brought to you by Omnibond
21. Cloud Cluster Built using AWS, Torque/Maui, OrangeFS
Optimization Areas
• Data uploaded and
retrieved via
OrangeFS WebDav
Interface
• MPI Jobs are
submitted via
Torque & Maui
Scheduler
• All built with AWS
CloudFormation
template
MPI-IO Clients
Torque /
Maui
OrangeFS
WebDAV
OrangeFS Servers
Amazon DynamoDB
28. RNA-Seq Differential Gene
Expression Workflow
Optimization Areas
•
Fast- Splitter
rewritten in MPIIO to leverage
OrangeFS in AWS
•
Merge-FastQ also
rewritten in MPIIO to leverage
OrangeFS in AWS
*Workflow chart provided with permission from
Allele Systems (www.allelesystems.com)