The document discusses next-generation DNA and RNA sequencing applications and how IBM Storwize V7000 Unified and SONAS Gateway storage solutions can enable them. It provides an overview of DNA, RNA, sequencing technologies, and analysis tools. It then describes the Storwize V7000 Unified system which combines block and file storage, and the SONAS Gateway which is based on IBM's GPFS and provides petabyte scalability. Architectural assumptions and performance tests show these solutions offer good performance for sequencing applications.
2. Table of Contents
Abstract..................................................................................................................................... 1
Introduction: DNA and RNA sequencing applications........................................................... 2
DNA, RNA, and next-generation sequencing (NGS) technologies ......................................................... 3
Analysis tools ........................................................................................................................................... 3
Introduction: IBM Storwize V7000 Unified and SONAS systems .......................................... 5
IBM Storwize V7000 Unified system overview ........................................................................................ 5
IBM SONAS Gateway system overview .................................................................................................. 6
Differences: IBM Storwize V7000 Unified and SONAS Gateway as NAS systems ................................ 7
Architectural assumptions ...................................................................................................... 9
IBM Storwize V7000 Unified: Configurations, tests, and results ......................................... 10
IBM SONAS Gateway: NAS configurations, tests, and results ........................................... 13
File systems layout: Best practice recommendations ......................................................... 17
Solution benefits: IBM Storwize V7000 Unified and SONAS Gateway system ................... 18
Summary ................................................................................................................................. 19
Acknowledgments .................................................................................................................. 19
Appendices ............................................................................................................................. 20
Appendix A: Typical server and storage configuration sizing recommendations ................................. 20
Appendix B: Resources ......................................................................................................................... 21
About the authors................................................................................................................... 22
Trademarks and special notices ........................................................................................... 23
Enabling next-generation sequencing applications
3. Abstract
Next generation genomic sequencing technologies have been instrumental in significantly
accelerating biological research and discovery of genomes for humans, mice, snakes, plants,
bacteria, virus, cancer cells, and so on. Now, researchers process immense data sets, build
analytical deoxyribonucleic acid (DNA) models for large genomes, use reference-based analytic
methods, and further their understanding of genomic models for drug discovery, personalized
medicine, toxicology, forensics, agriculture, nanotechnology, and other emerging use cases.
IBM has now partnered with CLC bio Inc. to bring validated, and integrated smart computing
solutions that combine intelligent Assembly Cell software and optimized open systems software
from public domains, together with IBM Smarter Storage. This joint solution incorporates IBM
industry expertise, best practices, and IBM Technologies to help research institutions and
pharmaceutical companies to manage, query, analyze, and better understand integrated
genotypic and phenotypic data for medical research and patient treatment.
This paper validates that IBM Storwize V7000 Unified and Scale Out Network Attached Storage
(SONAS) Gateway based Smarter Storage solutions offer good application performance, and
availability for de novo Assembly and reference-based mapping algorithms, under the following
circumstances:
•
•
•
Access to genomic data from DNA and ribonucleic acid (RNA) sequences is configured on
IBM Storwize V7000 Unified or SONAS Gateway solutions.
The CLC Assembly Cell or open systems software applications are configured on Red Hat
Enterprise Linux (RHEL) servers.
The Network File System (NFS), v3 services are configured and delivered over the
Internet Protocol (IP) network.
This paper offers easy recommendations guidance to facilitate easy configuration and
installation of the solution to ensure an efficient installation with good performance.
Enabling next-generation sequencing applications
1
4. Introduction: DNA and RNA sequencing applications
Genetic concepts and interesting facts
All humans, animals, plants, and living organisms are comprised of cells. Inside any, and each cell,
resides a nucleus. The nucleus is a self-contained unit that acts as a central entity, managing the functions
and activity inside, and outside the cell. The nucleus contains most of the cell's genetic information,
organized as multiple long linear DNA molecules that are co-existent with a large variety of proteins, to
form chromosomes. The genes within these chromosomes make up the cell's genome. The function of the
nucleus is to maintain the integrity of these genes and control the cell activities. The nucleus is, therefore,
the control center of the cell.
Genes, DNA, and RNA
Genes are made up of various lengths of DNA, which contains four chemicals: adenine (A), guanine (G),
cytosine (C), and thymine (T). These chemicals line up similar to beads on a necklace to form strands of
code. They also pair up with each other to form the double strands that are characteristic of DNA. The
sequence of a nucleic acid is the composition of atoms that make up the nucleic acid and the chemical
bonds that bond those atoms.
DNA is a nucleic acid containing the genetic instructions used in the development and functioning of all
known living organisms (with the exception of RNA viruses). The DNA segments carrying this genetic
information are called genes. Likewise, other DNA sequences have structural purposes, or are involved in
regulating the use of this genetic information. Along with RNA and proteins, DNA is one of the three major
macromolecules that are essential for all known forms of life.
RNA is also a nucleic acid, and is one of the four major macromolecules (along with lipids, carbohydrates,
and proteins) essential for all known forms of life. Similar to DNA, RNA is made up of a long chain of
components called nucleotides. Each nucleotide consists of a nucleobase, a ribose sugar, and a
phosphate group. The sequence of nucleotides allows RNA to encode genetic information. In addition,
many viruses use RNA instead of DNA as their genetic material.
The chemical structure of RNA is very similar to that of DNA, with two differences: (a) RNA contains the
sugar ribose, while DNA contains the slightly different sugar deoxyribose (a type of ribose that lacks one
oxygen atom), and (b) RNA has the nucleobase uracil while DNA contains thymine. The other three
nucleobases namely, adenine (A), guanine (G), and cytosine (C) are the same in both DNA and RNA.
Unlike DNA, most RNA molecules are single-stranded and can adopt very complex three-dimensional
structures.
Human genome
The human genome includes a complete set of human genetic information stored as separate DNA
sequences in 23 chromosome pairs of the human cell nucleus and a small amount of mitochondrial DNA,
which are used as a source of chemical energy required for the cell to survive. The human genome is
estimated to be about 3.2 billion base pairs long and it contains about 20,000 to 25,000 distinct genes.
There are 23 chromosome pairs in each cell. The twenty third chromosome pair is a sex determining
chromosome. If it is a pair of X chromosomes, then in many animal species, it determines a female. If it is
Enabling next-generation sequencing applications
2
5. a combination of X and Y chromosomes, it determines a male. The X chromosome has more than 153
million base pairs and represents about 2000 of the 20,000 to 25,000 genes in the human genome (or
about 10% of the total gene population). The Y chromosome has about 58 million base pairs and
represents about 200 to 500 of the 20,000 to 25,000 genes in the human genome. The largest human
chromosome is chromosome 1, and is approximately 220 million base pairs long. The smallest
chromosome is the mitochondrial DNA, and is approximately 16,000 base pairs long.
DNA, RNA, and next-generation sequencing (NGS) technologies
In genetics, the sequencing processes determine the primary structure of an unbranched biopolymer. The
sequencing process results in a symbolic linear depiction (known as a sequence), which clearly
summarizes much of the atomic-level structure of the sequenced molecule.
DNA sequencing is the process of reading the nucleotide bases in a DNA molecule. It includes any
method or technology that is used to determine the order of the four bases—adenine, guanine, cytosine,
and thymine, (A,G,C,T)—in a strand of DNA.
RNA sequencing is the process of reading the nucleotide bases in a RNA molecule. It includes any
method or technology that is used to determine the order of the four bases—adenine, guanine, cytosine,
and uracil, (A,G,C,U)—in a strand of RNA.
Next-generation sequencing technologies parallelize the sequencing process, producing thousands or
millions of sequences at a time. These technologies are intended to lower the cost of sequencing beyond
what is possible with standard dye-terminator methods. High-throughput sequencing technologies
generate millions of short reads from a library of nucleotide sequences; whether they come from DNA,
RNA, or a mixture, the sequencing mechanism of each platform does not vary.
Analysis tools
The next-generation sequencing technologies read the biological specimen or the tissue sample, and
create hundreds of thousands (or even millions) of base pairs for analysis. A typical sequencing run can
range from a single day (24 hours) to a single week (162 hours) in the year 2012, and can generate data
between the ranges of 100 MB to 3 GB. In the next few years, this effort will only improve to generate
significantly more precise results even sooner, than available, with current processes, methods, and
technologies.
There are several open source, high performance next-generation sequencing tools, such as BurrowsWheeler Aligner (BWA) and Trinity, that can analyze the genomic DNA and RNA data from the
sequencers. On a commercial license, CLC bio offers the most-comprehensive, high-performance
computing solution for the Life Sciences industry. This section explains all the three applications.
Enabling next-generation sequencing applications
3
6. CLC Assembly Cell
CLC Assembly Cell is available on a commercial license. It is a high-performance computing solution for
read mapping and de novo assembling of next-generation sequencing data. It includes native color-space
support.
The command-line interface (CLI) of CLC Assembly Cell enables the functionalities to be easily included in
scripts and other next-generation sequencing workflows.
CLC Assembly Cell uses single-instruction, multiple-data (SIMD) compute instructions to parallelize and
accelerate the assembly algorithms, making the program the fastest next-generation sequencing
assembler in the market. For reference, visit the following URL:
http://www.clcbio.com/wp-content/uploads/2012/09/CLCAssemblyCell12.pdf
Burrows-Wheeler Aligner
Burrows-Wheeler Aligner (BWA) is an open-source, high-performance tool, and is available freely, with no
software licensing restrictions. It is an efficient program that aligns relatively short nucleotide sequences
against a long reference sequence, such as the human genome. It implements two algorithms, BWASHORT and BWA-SW. The former works for query sequences shorter than 200 base-pairsand the latter
for longer sequences up to around 100,000 base-pairsp. Both algorithms do gapped alignment. They are
usually more accurate and faster on queries with low error rates.
Trinity
Trinity, developed at the Broad Institute (a collaboration of MIT and Harvard Universities), is also a widely
used open-source, high-performance tool. It represents a novel method for the efficient and robust
de novo reconstruction of transcriptomes from RNA-Seq data. Trinity combines three independent
software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of
RNA-Seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each
representing the transcriptional complexity at a given gene or locus, and then processes each graph
independently to extract full-length splicing isoforms and to tease apart transcripts derived from
paralogous genes.
NGS solution benefits at a glance:
The NGS tools are enabled, tested, validated, and certified. They are then included in optimized solutions
by IBM®. IBM has used technology, industry expertise, best practices, and leading analytical partner
applications into a tightly integrated solution. With this solution, research institutions and pharmaceutical
companies can easily manage, query, analyze, and better understand integrated genotypic and
phenotypic data for medical research and patient treatment. They can:
•
•
•
Organize, integrate, and manage different kinds of data to enable focused clinical
research, including: diagnostic, clinical, demographic, genomic, phenotypic, imaging,
environmental, and more.
Enable secure, cross-department collection and sharing of clinical and research data.
Ensure flexibility and growth with open and industry-standards based architecture.
Enabling next-generation sequencing applications
4
7. Introduction: IBM Storwize V7000 Unified and SONAS systems
This section provides introductory details and highlights of IBM Storwize® V7000 Unified and IBM SONAS
Gateway systems.
IBM Storwize V7000 Unified system overview
Many users have deployed storage area network (SAN) attached storage for their applications requiring
the highest levels of performance while separately deploying network-attached storage (NAS) for its ease
of use and lower cost networking. This divided approach adds complexity by introducing multiple
management points and also creates islands of storage that reduce efficiency.
The Storwize V7000 Unified system provides the ability to combine both block and file storage into a single
system. By consolidating storage systems, multiple management points can be eliminated and storage
capacity can be shared across both types of access, helping to improve overall storage utilization. The
Storwize V7000 Unified system also presents a single, easy-to-use management interface that supports
both block and file storage, helping to simplify administration further.
The Storwize V7000 Unified system builds on the functions and high-performance design of the Storwize
V7000 system and integrates proven IBM software capabilities to deliver new levels of efficiency.
The Storwize V7000 Unified system provides identical software capabilities as the IBM SONAS system, as
follows:
•
•
•
•
•
•
Massive scalability:
− Supports billions of files (up to 21 petabytes of storage) in a single file system
− Supports upto 256 file systems per single SONAS platform
Flexibility:
− Allows access to data in a single global namespace, allowing all users a single,
logical view of files through a single drive letter such as a Z drive
− Provides efficient distribution of files, images, and application updates and fixes
to multiple locations quickly and cost effectively
− Provides multiple storage tiers for flexible, efficient management of petabytes of
files.
− Supports industry-standard protocols: Common Internet File System (CIFS),
Network File System (NFS), File Transfer Protocol (FTP), Hypertext Transfer
Protocol Secure (HTTPS), and Secure Copy Protocol (SCP)
Performance: Leverages two dual-port (all ports active) 10 GbE interface cards offering
high bandwidth and additional connectivity in each SONAS interface node to manage
multiple data streams and functions (for example, backup, replication, antivirus).
Data protection: File system and fileset-level snapshots (up to 256 per file system)
provide a way to partition the namespace into smaller, more manageable units.
Management: CLI and browser-based, simple, intuitive, and state-of-the-art administrative
GUI provide icon-based navigation, informative graphics, and SONAS visualizations that
streamline storage tasks and display real-time capacity, performance, and system health.
Antivirus: Integrates with McAfee and Symantec Antivirus, enabling users to secure data
from malware and use the most commonly deployed ISV antivirus applications.
Enabling next-generation sequencing applications
5
8. •
•
(Clarification, for purposes of this particular paper: In Life Sciences, there is a separate
definition for antivirus – An ultramicroscopic (20 to 200 nm in diameter), infectious agent
that replicates within host cells. It is composed of a DNA, RNA core, and a protein coat.
The authors do not refer to the Life Sciences definition, in this paper.
Cloud features: Self-managing, autonomic system enables capacity, provisioning, and
other IT service management decisions to be made dynamically, without human
intervention or increased administrative costs. IBM Active Cloud Engine™ enables
ubiquitous access to files from across the globe quickly and cost effectively.
Operational savings and total cost of ownership (TCO):
− Consolidates multiple individual filers and their management, thereby avoiding
problems associated with administering an array of disparate NAS systems
− Automates file placement by transparently moving files to another internal or
external storage pool, optimizes your storage resources, and offers tremendous
time and cost savings in administering petabytes of files
− Helps conserve floor space (up to a petabyte of data in less than a square
meter), is highly scalable and can help reduce capital expenditure and enhance
operational efficiency; its advanced architecture virtualizes and consolidates file
space into a single, enterprise-wide file system, which can translate into reduced
TCO
IBM SONAS Gateway system overview
The IBM SONAS Gateway system is designed to manage vast repositories of information in enterprise
environments requiring very large capacities, high levels of performance, and high availability.
SONAS Gateway uses a mature technology from the IBM high-performance computing (HPC) experience.
It is based upon the IBM General Parallel File System (IBM GPFS™), a highly scalable clustered file
system. SONAS Gateway is an easy-to-install, turnkey, modular, scale out NAS solution. It provides the
performance, clustered scalability, high availability, and functionality that are essential for meeting
strategic multi-petabyte age and cloud storage requirements.
SONAS Gateway currently offers the following features and capabilities:
•
•
Massive scalability:
− Supports billions of files (up to 21 petabytes of storage) in a single file system
− Supports upto 256 file systems per single SONAS platform
Flexibility:
− Allows access to data in a single global namespace, allowing all users a single,
logical view of files through a single drive letter such as a Z drive
− Provides efficient distribution of files, images, and application updates and fixes
to multiple locations quickly and cost effectively
− Provides multiple storage tiers for flexible, efficient management of petabytes of
files
− Supports industry-standard protocols: CIFS, NFS, FTP, HTTPS, and SCP
Enabling next-generation sequencing applications
6
9. •
•
•
•
•
•
Performance: Leverages two dual-port (all ports active) 10 GbE interface cards offering
high bandwidth and additional connectivity in each SONAS interface node to manage
multiple data streams and functions (for example, backup, replication, antivirus).
Data protection: File system and fileset-level snapshots (up to 256 per file system)
provide a way to partition the namespace into smaller, more manageable units.
Management: CLI and browser-based, simple, intuitive, and state-of-the-art administrative
GUI provide icon-based navigation, informative graphics and SONAS visualizations that
streamline storage tasks and display real-time capacity, performance, and system health.
Antivirus: Integrates with McAfee and Symantec Antivirus, enabling users to secure data
from malware and uses the most commonly deployed ISV antivirus applications.
(Clarification, for purposes of this particular paper: In Life Sciences, there is a separate
definition for antivirus – An ultramicroscopic (20 to 200 nm in diameter), infectious agent
that replicates within host cells. It is composed of a DNA, RNA core, and a protein coat.
The authors do not refer to the Life Sciences definition, in this paper.
Cloud features: Self-managing, autonomic system enables capacity, provisioning and
other IT service management decisions to be made dynamically, without human
intervention or increased administrative costs. IBM Active Cloud Engine enables
ubiquitous access to files from across the globe quickly and cost effectively.
Operational savings and TCO:
− Consolidates multiple individual filers and their management, thereby avoiding
problems associated with administering an array of disparate NAS systems.
− Automates file placement by transparently moving files to another internal or
external storage pool, optimizes your storage resources, and offers tremendous
time and cost savings in administering petabytes of files
− Helps conserve floor space (up to a petabyte of data in less than a square
meter), is highly scalable and can help reduce capital expenditure and enhance
operational efficiency; its advanced architecture virtualizes and consolidates file
space into a single, enterprise-wide file system, which can translate into reduced
TCO
Differences: IBM Storwize V7000 Unified and SONAS Gateway as NAS systems
The difference between the IBM Storwize V7000 Unified and SONAS Gateway systems lies in the
workloads that each system can support. The Storwize V7000 Unified system can support smaller and
medium-size workloads, while the SONAS Gateway system has the scalability to deliver high performance
for extremely large application workloads and capacities, typically for the entire enterprise.
Enabling next-generation sequencing applications
7
10. Table 1 offers the comparative product positioning between the Storwize V7000 Unified and SONAS
systems:
No.
Attribute
Storwize V7000 Unified
SONAS
1
Maximum number of
interface nodes
2
30
2
Maximum number of
storage nodes
N/A
60
3
Maximum raw
capacity of file storage
360 TB (3 TB drives x 12
drives per expansion unit x
10 expansion units)
21.6 PB (3TB drives x 240
drives x 30 controllers).
4
Maximum size of
single shared file
system (GPFS)
8 PB
8 PB
5
Maximum number of
file systems within a
single system
64
256
6
Maximum size of a
single file
8 PB
8 PB
7
Maximum number of
files per storage
system
4 Billion
4 Billion
8
Maximum number of
dependent file sets
per file system
256
3000
9
Maximum number of
independent file sets
256
1000
10
Maximum number of
independent file sets
256
1000
Table 1: Comparative product positioning of Storwize V7000 Unified and SONAS Gateway systems
Enabling next-generation sequencing applications
8
11. Architectural assumptions
Make a note of the following architectural assumptions and caveats in regard to the technical content of
this paper.
This paper does:
• Offer information and recommendations for tuning adjustments to achieve good
performance in normal NAS production environments.
• Allow a non-technical customer or user to quickly tune their NAS environment by using
recommendations, observations, tips, and best practices, as documented.
• Provide information from a non-technical user point of view for fast implementations.
This paper does not:
• Explain the various technologies and solutions to establish or publish any benchmarks.
• Guarantee a specific performance of any technical element.
• Provide or offer any information to overcome previously established benchmarks.
• Explain or explore newer technologies, standards, and concepts such as 40 GbE
connections, NFS V4, cloud multi-tenancy and so on.
• Offer any guidance on how to determine hardware sizing or capacity planning for your
installation.
Caveats:
•
•
•
•
Use cognizance in making your decisions.
Do not take any published numbers literally.
For this paper, the tests were run on different IBM equipments located at different IBM
data centers. Note that the performance results might vary, depending on unique server /
client conditions, architectural configurations, network behaviors, application
dependencies, and operational environments. Your performance and mileage might vary
from the test results.
Recommended best practices sometimes differ from the test configurations. The test
configurations were set up to observe certain behavior in specific test situations. The best
practices are recommended to run operations in production environments.
Enabling next-generation sequencing applications
9
12. IBM Storwize V7000 Unified: Configurations, tests, and results
Configuration and tests
An IBM Storwize V7000 Unified system was tested with the three NGS applications: CLC Assembly Cell,
BWA, and Trinity. The connectivity between the Storwize V7000 Unified system and the single application
server was configured as NAS-attached configuration. This configuration was a typical use case for a
small research facility, with minimal compute resources, as shown in Figure 1.
Figure 1: NAS-attached Storwize V7000 Unified configuration with NGS applications for a small research facility
Test results with CLC Assembly Cell
The following tables summarize the results of successful testing of de novo assembly and reference
assembly with the CLC Assembly Cell software, BWA application software, and Trinity application
software using identical server and storage configurations, as demonstrated in Figure 1.
Enabling next-generation sequencing applications
10
14. Test results with the BWA application
When the BWA application was run on the same server with the same Storwize V7000 Unified system as
the storage back-end, the following test results were obtained, as shown in Table 5.
Input
Threads
Storwize
V7000
1
Unified
(No cache)
Storwize
2
V7000 Unified
(with cache)
Local
Comparing BWA
reads_100m.fq
with
humangenome.fa
8
44 min 46 s
44 min 57.483 s
44 min 59 s
16
26 min 29 s
25 min 38.118 s
26 min 38 s
24
20 min 50 s
20 min 9.843 s
21 min 34 s
32
18 min 40 s
19 min 0.600 s
18 min 40 s
64
24 min 58 s
26 min 47.676 s
26 min 20 s
Table 5: BWA performance results with various file system options
1
2
Mount options: rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0 0 0
Mount options: rw, noatime, nodiratime, rsize=1048576, wsize=1048576, proto=tcp, vers=3, timeo=600, addr=9.11.82.103
Enabling next-generation sequencing applications
12
15. Test results with the Trinity application
When the Trinity application was run on the same server with the same Storwize V7000 Unified system as
the storage back-end, the following test results were obtained, as shown in Table 6.
Mount options
Duration
fm1p1:/ibm/gpfs_15k/ngsfs on /gpfs0 type nfs (rw,addr=9.11.82.103)
869 min 4.618 s
fm1p1:/ibm/gpfs_15k/ngsfs on /gpfs0 type nfs
(rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,timeo=60
0,addr=9.11.82.103)
866 min 47.335 s
rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0
00
More than 4 days
/dev/sdb on /xfs type xfs (rw,nobarrier)
787 min 54.544 s
9.11.83.71:/ibm/gpfs1/Life_sciences_bak on /NGS type nfs
(rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,timeo=60
0,addr=9.11.83.71)
875 min 36.453 s
rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0
00
More than 4 days
Table 6: Trinity application performance results with various file system mount point options
Note: Run times are large as the Trinity application creates millions of files, ranging from 0 MB to 10 MB in
size. This is a typical behavior of Trinity applications.
IBM SONAS Gateway: NAS configurations, tests, and results
Configurations and tests
An IBM SONAS Gateway system was tested with the three NGS applications: CLC Assembly Cell, BWA,
and Trinity. The connectivity between the SONAS Gateway system and 14 IBM BladeCenter® blade
servers was configured as a NAS-attached configuration. The blade servers represented application
services. This configuration was a typical use case for a medium- to large-research facility, with adequate
compute and performance resources, as shown in Figure 2.
Enabling next-generation sequencing applications
13
16. Figure 2: NAS-attached SONAS Gateway configuration with NGS applications for a medium- to large-research facility
Enabling next-generation sequencing applications
14
17. Test results with CLC Assembly Cell
The following tables summarize the results of successful testing of de novo assembly and reference
assembly with the CLC Assembly Cell software, BWA application software, and Trinity application
software using identical server and storage configurations, as demonstrated in Figure 2.
Input
Cores
(threads)
SONAS
Gateway
(minutes)*
gz-fastq
16 (16)
573
16 (32)
439
16 (16)
547
16 (32)
406
fasta
Table 7: CLC Assembly Cell performance results with de novo assembly using the non paired-end option
Input
Cores
(threads)
SONAS
Gateway
(minutes)*
gz-fastq
16 (16)
591
16 (32)
449
16 (16)
588
16 (32)
437
fasta
Table 8: CLC Assembly Cell performance results with de novo assembly using paired-end information
CLC
Assembly
Cell
Cores
(threads)
SONAS
Gateway
(minutes)*
Version 4
16 (32)
148
Table 9: CLC Assembly Cell performance results with paired-end reference mapping information
*Mount options:rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0 0 0
Enabling next-generation sequencing applications
15
18. Test results with BWA application
Table 10 summarizes the results of successful testing with the BWA application software.
Input
Threads
SONAS
Gateway
Hx9203**
No cache
SONAS
Gateway
Hx9201**
No cache
SONAS
Gateway
Hx9201***
cache
BWA
reads_10
0m.fq
vs
humange
nome.fa
8
50 min 31 s
43 min 58.889 s
44 min 6.455 s
16
30 min 7 s
25 min 30.709 s
26 min 30.983 s
24
26 min 45 s
22 min 49.733 s
22 min 17.789 s
32
25 min 46 s
22 min 38.095 s
23 min 14.785 s
Hx9202
Hx9205
Hx9206
Hx9207
Hx9208
Hx9210
Hx9211
Hx9212
26 min 34 s
30 min 14 s
30 min 50 s
30 min 1 s
31 min 12 s
30 min 32 s
32 min 13 s
30 min 54 s
Table 10: Results of successful testing of BWA applications on 14 servers attached to the SONAS Gateway system
The following mount options were documented, with the results as listed in Table 10.
** rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0 0 0
*** rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,timeo=600,addr=9.11.82.103
Enabling next-generation sequencing applications
16
19. Test results with Trinity application
When the Trinity application was run on the same server, with the same Storwize V7000 Unified system as
a storage back-end, the following test results were obtained, as in Table 11, below:
Mount options
Duration
172.26.39.180:/ibm/gpfs136gb_15k on /lifesci type nfs
(rw,addr=172.26.39.180)
804 min 2.373 s
172.26.39.180:/ibm/gpfs136gb_15k on /lifesci type nfs
(rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,time
o=600,addr=172.26.39.180)
812 min 7.432 s
172.26.39.180:/ibm/gpfs136gb_15k on /lifesci type nfs
(rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,time
o=600,addr=172.26.39.180)
760 min 50.722 s
774 min 20.714 s
Table 11: Results of successful testing of Trinity Application on 14 servers attached to SONAS Gateway system
File systems layout: Best practice recommendations
To ensure good application performance, with optimal runtimes, when server(s) are connected over the 10
GbE Ethernet to IBM Storwize V7000 Unified or SONAS Gateway systems the following considerations
should be noted:
•
•
•
Proper sizing and stability of application nodes and servers is extremely important to drive
the required levels of workloads for various different types of algorithms, such as de novo,
or reference-based mapping.
Proper selection of valid mount options affects the performance and runtime
characteristics of NGS applications. Incorrect selection of mount options results in long
running jobs, as these jobs will create millions of files ranging from 0 MB to 10 MB in size.
It was observed that all these applications did not saturate the network, the IBM Storwize
V7000 Unified system, or the IBM SONAS Gateway system.
For improved performance in a normal and a typical production environment, lay out the file systems for
NGS applications as per the following guidelines and best practice recommendations:
•
•
•
•
Different NGS applications require different types of mount options for increased
performance and optimal response time.
Create the GPFS on the SONAS Gateway or Storwize V7000 Unified system by using the
cluster method of creating the block allocation maps to achieve a uniform disk
performance across all storage capacities.
Create the GPFS on the SONAS Gateway or Storwize V7000 Unified system by using
logfileplacement value = striped to stripe the log file of the file system, across all
metadata disks.
Recommend using the block size as 256 K for both, short-term, and long-term storage.
Enabling next-generation sequencing applications
17
20. •
•
•
As a best practice, run all RHEL 6.2 servers with dual 10 GbE bonded network channel
connections, with MTU=9000.
To support various NGS application workloads, two interface nodes are recommended on
the Storwize V7000 Unified system for increased availability.
To support various NGS application workloads, at least two interface nodes are
recommended on the SONAS Gateway system for increased availability.
Solution benefits: IBM Storwize V7000 Unified and SONAS
Gateway system
Both, SONAS and Storwize V7000 Unified systems offer the following significant benefits, for clients
running NGS Applications for efficient analysis of genomic data from DNA and RNA sequences:
•
Easily examine a large group of potential gene candidates by using typical applications
such as blast, linkage analysis, mascot etc., that can quickly search and rapidly screen
targets in genomic databases, genomes and assays.
•
Efficiently create targeted drug treatments. Easily enable the scale-up development of
new drug molecules developed through Drug Discovery (Research, Synthesis), PreClinical Development (Preparation, Formulation, Pre-dosage design), Pre-FDA (new drug
formulation, standards).
•
Delivers tight integration between ERP and Pharma Supply Chains - SONAS easily
supports pharmaceutical processes to scale-up of API’s (active pharmaceutical
ingredients) from Milligram to Kilogram quantities for commercial manufacturing and
distribution of drugs, with improved visibility into process optimization and consistent yield
variability across batches.
•
Lowers TCO by efficiently reducing drug discovery costs through use/reuse of databases,
common analytical data, processes and standards throughout the pharmaceutical
operational chain.
•
Deliver on-demand cloud computing models to rapidly address changing levels of
analytical computational capacities and facilitate self service of analytical tools, pooling of
analytic, research development, pharmaceutical manufacturing resources, and common
and scalable transactional processes and standards.
Enabling next-generation sequencing applications
18
21. Summary
This paper validates that IBM Storwize V7000 Unified and SONAS Gateway based solutions offer good
application performance with excellent virtualization and availability under the following circumstances:
•
•
•
Access to genomic data from DNA and RNA sequences is configured on the IBM Storwize
V7000 Unified or SONAS Gateway system.
The CLC Assembly Cell or open systems software applications are configured on RHEL
servers.
The NFS v3 services are configured and delivered over the IP network.
This paper offers recommendations and guidance to facilitate easy configuration and installation of the
solution to ensure an efficient installation with good performance.
Acknowledgments
Special thanks to the teams from CLC bio in Denmark for loaning the software licenses of the CLC
Assembly Cell software, which enabled the IBM test team to create a representative operational test
environment in IBM data centers and run tests to document real-life results.
Many Thanks to the IBM client executives, IBM Systems and Technology Group members, and other team
members who contributed with their recommendations during the test run and review process, and
enabled successful completion and validation so that CLC bio software applications can run successfully
over various environments facilitated by IBM Storwize V7000 Unified and SONAS Gateway systems.
The IBM team also acknowledges with special thanks to Connie Borton, Michael Nelson, Cathy Drews,
Daniel Drinnon, and Larry Garibay for their invaluable help and assistance, without which the software
validation of three independently different software applications would not have been successful.
Enabling next-generation sequencing applications
19
22. Appendices
Appendix A: Typical server and storage configuration sizing recommendations
This section includes a typical recommendation and a guideline for server and storage configuration sizing
for small, medium, and large research facilities. While this information is typical, the authors do understand
that there will be differences in various organizations in terms of the following criteria:
•
•
•
•
•
•
•
Different number and types of sequencers in the facility
Different types of genomes being worked on in the laboratory / organization
Different processes being pursued within the organization – be it reference mapping,
assembly or transcriptions, or downstream analytics
The amount of data that is required to be kept active
The amount of data that is required to be kept archived
The response time required in terms of the number of genomes per day, per week, or per
month
And many other factors
Tier 1: 1 to 2 human size genomes per week, for both de novo and reference-based mapping
Single server and internal storage configuration
•
•
•
•
IBM system x3750 with 2.4 GHz E5 4640, ½ TB RAM, 16 TB internal disks
4 sockets, 32 cores, 2.4 GHz Intel® Xeon® processor E5 4640
32 x 16 GB 1600MHz DDR3 DIMMs,
16 x 2.5-inch 1 TB SAS drive
Tier 2: 3 to 10 human size genomes per week or need more than 15 TB online for both de novo and
reference-based mapping
Multiple server and external storage configuration
•
•
•
•
•
•
IBM BladeCenter HS23 frame enclosed with14 blade servers. Each blade server
configured with 2.6 GHz Intel Xeon processor E5 2670, 128 GB RAM and dual10 GbE
connection ports
2 sockets, 16 cores, 2.6 GHz Intel Xeon processor E5 2670
16 x 8 GB 1333MHz DDR3 DIMMs
2 x 2.5-inch 300 GB SAS drive
96 x 2.5-inch 600 GB 10 K rpm drives in four enclosures of IBM Storwize V7000 Unified.
The IBM Storwize V7000 Unified system can host up to 10 enclosures, and therefore, if
more capacity is needed in the future, more disks can be added to the remaining six
enclosures.
1 external switch (or customer supplied switch) to support 10 GbE connectivity
Tier 3: More than 10 human size genomes per week or need more than 100 TB online or need for
downstream analysis.
Enabling next-generation sequencing applications
20
23. This is a custom configuration. You can contact IBM.
Appendix B: Resources
The following websites provide useful references to supplement the information contained in this paper:
•
Introduction to Genetics
en.wikipedia.org/wiki/Introduction_to_genetics
•
DNA
en.wikipedia.org/wiki/DNA
•
DNA Sequencing
en.wikipedia.org/wiki/DNA_sequencing
•
Cell Nucleus
en.wikipedia.org/wiki/Cell_nucleus
•
Human Genome
en.wikipedia.org/wiki/Human_genome_map
•
RNA
en.wikipedia.org/wiki/RNA
•
RNA-Seq
en.wikipedia.org/wiki/RNA-Seq
•
X Chromosome
en.wikipedia.org/wiki/X_chromosome
•
Y Chromosome
en.wikipedia.org/wiki/Y_chromosome
•
Trinity
www.broadinstitute.org/scientific-community/software/trinity
•
Burroughs Wheeler Aligner
bio-bwa.sourceforge.net/
•
CLC bio Applications
www.clcbio.com/
•
IBM Redbooks®
ibm.com/redbooks
Enabling next-generation sequencing applications
21
24. •
IBM Publications Center
www.elink.ibmlink.ibm.com/public/applications/publications/cgibin/pbi.cgi?CTY=US
•
IBM Scale Out Network Attached Storage Architecture, Planning and Implementation
Basics [SG24-7875-00]
ibm.com/redbooks/abstracts/sg247875.html?Open
•
IBM Scale Out Network Attached Storage Concepts [SG24-7874-00]
ibm.com/redbooks/abstracts/sg247874.html?Open
•
IBM Storwize V7000 Introduction and Implementation Guide [SG247938]
ibm.com/redbooks/redpieces/abstracts/sg247938.html?Open
About the authors
Dr. Tzy-Hwa K. (Kathy) Tzeng, is a Senior Technical Staff Member (STSM) for IBM Systems and
Technology Group ISV Strategy and Enablement Organization. She received her Ph.D. in Genetics and
Plant Pathology from Iowa State University. Prior to IBM, she led drug discovery projects in bioinformatics,
proteomics, and genomics. At IBM, she is responsible for the strategy and content of IBM Life Sciences
application plans, portfolio, and product positioning. You can reach Dr. Kathy Tzeng at tzy@us.ibm.com
Dr. Ruzhu Chen is an IBM Certified Expert IT Specialist for IBM Systems and Technology Group,
focusing on computational chemistry and NGS applications. Over the last ten years, he has successfully
tuned, benchmarked, and optimized solutions for IBM worldwide partners, and customers. Ruzhu earned
his Masters degree in Biochemistry from University of Sciences and Technology of China, a second
Masers degree in Computer Science and a Ph.D. in Molecular Biology, both from the University of
Oklahoma. You can reach Dr. Ruzhu Chen at ruzhuchen@us.ibm.com.
Justin Morosi is a Consulting IT Specialist working for IBM Systems and Technology Group as a
Worldwide Technical Architect focusing on HPC solutions. He has worked for IBM for over 14 years and
has more than 20 years of consulting and solution design experience. He holds numerous industryrecognized certifications from Cisco, Microsoft®, VMware, Red Hat, and IBM. His areas of expertise
include high-performance computing/storage, high availability, disaster recovery, and virtualization. You
can reach Justin Morosi at jmorosi@us.ibm.com.
Prashant Avashia is a software engineer in IBM Systems and Technology Group ISV Strategy and
Enablement Organization. With more than 15 years of experience, he has successfully architected,
engineered, and implemented enterprise infrastructure solutions for key global clients in healthcare,
financial, and software industries. He earned his master's degree in Industrial Engineering from Kansas
State University, and a bachelor's degree in Mechanical Engineering from Osmania University, India. You
can reach Prashant Avashia at pavashia@us.ibm.com.
Enabling next-generation sequencing applications
22
26. presented here to communicate IBM's current investment and development activities as a good faith effort
to help with our customers' future planning.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled
environment. The actual throughput or performance that any user will experience will vary depending upon
considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the
storage configuration, and the workload processed. Therefore, no assurance can be given that an
individual user will achieve throughput or performance improvements equivalent to the ratios stated here.
Photographs shown are of engineering prototypes. Changes may be incorporated in production models.
Any references in this information to non-IBM websites are provided for convenience only and do not in
any manner serve as an endorsement of those websites. The materials at those websites are not part of
the materials for this IBM product and use of those websites is at your own risk.
Enabling next-generation sequencing applications
24