Mais conteúdo relacionado Semelhante a Taming Big Science Data Growth with Converged Infrastructure (20) Taming Big Science Data Growth with Converged Infrastructure1. Taming Big Science Data Growth with
Converged Infrastructure
©2014 BioTeam, Inc. All Rights Reserved.
Real-world strategies and implementation details for building converged storage
infrastructure to support the performance, scalability and collaborative
requirements of today's NGS workflows.
Aaron D. Gardner
Senior Scientific Consultant, BioTeam, Inc.
aaron@bioteam.net
BIOTEAM
Enabling Science
2. 1. Introduction
2.The State of NGS Data Analysis
3. Converged Infrastructure
4. Solutions to Support NGS Data Analysis
©2014 BioTeam, Inc. All Rights Reserved.
BIOTEAM
Enabling Science
3. ©2014 BioTeam, Inc. All Rights Reserved.
| About Myself
Who am I?
A computer engineer who spent the
last 14 years with biologists in situ
Exposed to NGS in 2005
Have worked (for better or worse)
with most NGS platforms and data
types
Along the way learned bioinformatics,
data management, HPC, storage, and
general research cyberinfrastructure
Desire to help the broader life sciences
community lead me to BioTeam
BIOTEAM
Enabling Science
&
14Years
Later…
&
4. ©2014 BioTeam, Inc. All Rights Reserved.
| About BioTeam
Who are we?
Independent consulting shop
Staffed by scientists forced to learn IT,
SW & HPC to get our own research done
12+ years bridging the “gap”
between science, IT & high
performance computing
BioTeam@Bio-IT World ’14
Did you just come from Chris’s talk?
Make sure to check out his slides…
We have lots going on at the
conference this year
Come visit us at booth #324
BIOTEAM
Enabling Science
5. ©2014 BioTeam, Inc. All Rights Reserved.
| AboutThisTalk
What are we going to talk about?
A quick look at NGS analysis trends
Challenges in performance, scalability, and collaboration
Strategies that address these challenges
The benefits of pairing converged infrastructure with NGS
Example topologies and implementations
BIOTEAM
Enabling Science
Approach
Topics discussed the same way they would
over coffee (or tea)
I talk about vendors and technologies
I have experience with– that’s why DDN
invited me to speak (thanks)
Feel free to reach out to me during the
conference if any of this interests you
6. ©2014 BioTeam, Inc. All Rights Reserved.
| A Note About the Big Picture
At BioTeam our mission is to enable science (see above)
i.e. Great people, enabled by great technology, actively
engaging in broader scientific communities
Technology alone doesn’t cover this mandate…
• Instruments never installed, unopened server boxes, idle accelerator racks
Gathering minds without the right resources and tools…
• They flee for the cloud, desk clusters, or other companies or institutions
Locking away resources and data stifles collaboration…
• Focus on services that empower instead of barriers that contain scientists
BIOTEAM
Enabling Science
7. 1. Introduction
2.The State of NGS Data Analysis
3. Converged Infrastructure
4. Solutions to Support NGS Data Analysis
©2014 BioTeam, Inc. All Rights Reserved.
BIOTEAM
Enabling Science
8. ©2014 BioTeam, Inc. All Rights Reserved.
|NGS Analysis Challenges
Performance
Compute is easy, just not necessarily efficient
Analysis pipelines longer and more complex
Usually serial steps still lurking in them
“New programs”– wrapper scripts with a twist
Don’t address performance of fundamental
algorithms underneath
Scalability
See few analysis algorithms scaled to 1-100K
cores each year, same w/ accelerators
Still vast majority lucky to reach 10-100
Life sciences still mostly a HTC problem
Checkpointing becoming increasingly
important
Varying and mixed IO patterns make HTC
problematic on shared storage
BIOTEAM
Enabling Science
9. ©2014 BioTeam, Inc. All Rights Reserved.
|NGS Analysis Challenges
Collaboration
Community movement to more efficient
sequence data structures—very encouraging
(e.g. SAM/BAM/CRAM/VCF/HDF5)
Sharing of datasets still incredibly problematic
Large sequencing centers, institutes, commercial
interests embracing Science DMZ, data transfer
node concept (w/ Globus, Aspera, etc.)
Without data lifecycle management, this
newfound scientific data mobility will only
amplify storage issues (enter iRODS, etc.)
Last mile problem for collaborators with poor
network connectivity
Need real Big Data collobration solutions
Find a way to bring computation to the data so
the last mile disappears
BIOTEAM
Enabling Science
10. ©2014 BioTeam, Inc. All Rights Reserved.
| Observation: NGS Analysis Inversion IBIOTEAM
Enabling Science
Infrastructure Spending (in theory):
NGS Analysis Facility Infrastructure Spending (in practice):
Infrastructure Itself (the 80%)
SW and HW
Integrations
(the 20%)
Infrastructure
Itself
(the 80%)
SW and HW Integrations (the 20%)
Minimizing integration overhead is one of the principal challenges right now when
designing scientific computing environments.
This holds for NGS, as well as other scientific domains
Analysis environments from pieces which have never previously been tried together
Synthesized based on what’s best for business instead of technical merits, efficiency is
wasted, and for small and midsize infrastructures integration overhead balloons
11. ©2014 BioTeam, Inc. All Rights Reserved.
| Observation: NGS Analysis Inversion IIBIOTEAM
Enabling Science
The 20% The 80%
The 80% The 20%
What the industry is shooting for (in theory):
Where we seem to be (in practice):
12. 1. Introduction
2.The State of NGS Data Analysis
3. Converged Infrastructure
4. Solutions to Support NGS Data Analysis
©2014 BioTeam, Inc. All Rights Reserved.
BIOTEAM
Enabling Science
13. ©2014 BioTeam, Inc. All Rights Reserved.
| Traditional Infrastructure
Traditional computational infrastructures are comprised of separate
hardware (storage, networking, computation) and software
(provisioning, monitoring, management, etc.) components
Pieced into one-off solutions
Integrated and tuned on-site
(can take months for large systems)
As these infrastructures scale,
they become snow flakes
BIOTEAM
Enabling Science
14. ©2014 BioTeam, Inc. All Rights Reserved.
| Converged Infrastructure
With converged infrastructure, multiple hardware and software
components are developed, selected, integrated, and tuned
together, producing a pre-optimized solution
Infrastructure building block approach
Some vendors (e.g. DDN) offer mature
converged storage products like the SFA embedded platform
Facebook’s Open Compute Project lends itself to building
converged infrastructures w/ OCP compliant components
Analysis appliances (e.g. SlipStream) also use the
converged infrastructure model
Converged infrastructure shifts the focus from integrating
hardware components to building software services, which is
where organizations can better distinguish and define themselves
BIOTEAM
Enabling Science
15. ©2014 BioTeam, Inc. All Rights Reserved.
| Converged Infrastructure: Example SolutionBIOTEAM
Enabling Science
Remote iRODS
server
Traditional
iRODS
iRODS
clients
iRODS data and
control access
NAS file access
RAID controllerRAID controllerRAID controller
SAN Switch
iRODS/iCAT
server
Block storage
access
Cluster
Network Switch
File server File serverFile server
Disk
array
Disk
array
Disk
array
16. ©2014 BioTeam, Inc. All Rights Reserved.
| Converged Infrastructure: Example SolutionBIOTEAM
Enabling Science
iRODS
clients
iRODS data and
control access
High performance
NAS file access
iCAT & iRODS
servers
Network Switch
Cluster
Integrated Appliance Reduces Complexity
and Integration Time
Remote iRODS
server
Converged
iRODS
17. ©2014 BioTeam, Inc. All Rights Reserved.
| Is Converged Infrastructure inYour Critical Path?BIOTEAM
Enabling Science
Converged infrastructure not as necessary when:
1. Hiring lots of smart people and committing their time to infrastructure
2. Attacking a single or small set of large problems
3. Rarely revalidating or reintegrating your HW stack after deployment
• This is because if you tie your platform closely to a mixed and disparate
hardware stack: staff time to explore reintegration and revalidation
issues, rewrite code for new architectures—this can work for hyper
giants and single service efforts but legacy and vendor-controlled
codes, flexible infrastructure, infrastructure for yet unknown or
unsolved problems—converged infrastructure buys these down…
18. ©2014 BioTeam, Inc. All Rights Reserved.
| Tiered Service Models and Changing Staff RolesBIOTEAM
Enabling Science
Challenges With This Model:
• Need for single instance resources
capable of dealing with big data
• Now need multitenancy capabilities
even as a single organization
• Must minimize latency to better utilize
limited resources—public cloud’s
massive scalability approach might not
be suitable for a small or midsize
research environment with legacy
codes, inexperienced users, etc.
DevOps and the cloud have changed the relationship between the
researcher and the IT practitioner permanently
Research computing staff should be developing best practices, not acting
as a human ‘sudo’ for informaticists
Users
instantiate
resources on
demand which
they have
privileged
access to–but
no support is
offered beyond
clearing hang-
ups
Services
requiring a
higher degree
of reliability
and/or security
are built and
managed by IT
staff, with
unprivileged
access
provided to
users
Core
computational
services are still
supported end-
to-end by IT
staff, and are
consumed by
resources in the
previous two
levels
Solution: Move to a tired service and support model
19. 1. Introduction
2.The State of NGS Data Analysis
3. Converged Infrastructure
4. Solutions to Support NGS Data Analysis
©2014 BioTeam, Inc. All Rights Reserved.
BIOTEAM
Enabling Science
20. GPFS is a fast parallel file system written by IBM
Distributed metadata and locking
Good performance with small files
Tunable for large numbers of small files
Native Linux and Windows clients
CIFS and NFSv3 (v4 works, unsupported)
Raw NGS data is big
NGS analysis datasets are getting bigger
They can require lots of IOPS during analysis
Lots of space required to store what comes after
Can’t satisfy all of these considerations with a single storage tier without tremendous cost
Solution: Hierarchical Storage Management (HSM)
Create different pools of storage, policies govern data movement
SSD for metadata, small files,VMs, etc. and SATA for capacity and sequential access
Can also use tape, object storage, and others as cold archive or warm near-line tiers
NOTE: Lustre now has some HSM capabilities too as of version 2.5
©2014 BioTeam, Inc. All Rights Reserved.
| Tiered Data Storage (e.g. GPFS w/ HSM)BIOTEAM
Enabling Science
Example GPFS based
GRIDScaler System from DDN:
21. ©2014 BioTeam, Inc. All Rights Reserved.
| Science DMZ (e.g. ESnet Model)
Core Drivers
• Enterprise networking architecture is optimized for many
small data flows (Web 2.0, mobile, Internet ofThings)
• Not optimized for fewer large data flows
• Deep packet inspection & stateful firewalls
can’t handle large flows, performance tanks
3 Components of a Science DMZ
1. Fast network paths with streamlined security specific to large scientific data flows
2. DataTransfer Node(s) specifically tuned and dedicated to moving large data flows
3. Network monitoring and measurement node(s)
• Government & academic sites have done similar things for years without the name
• BioTeam strongly believes in the Science DMZ concept
• At this point anybody moving large scientific data should be evaluating
• We are already helping deploy them
ESnet has a great web resource available: http://fasterdata.es.net/
BIOTEAM
Enabling Science
22. ©2014 BioTeam, Inc. All Rights Reserved.
| Implementation: Science DMZBIOTEAM
Enabling Science
Design Source: “The Science DMZ: Introduction & Architecture” – ESnet
23. ©2014 BioTeam, Inc. All Rights Reserved.
| Information Lifecycle Management (ILM) (e.g. iRODS)BIOTEAM
Enabling Science
iRODS, the Integrated Rule-Oriented Data System, is a project for building the next
generation data management cyberinfrastructure. One of the main ideas behind
iRODS is to provide a system that enables a flexible, adaptive, customizable data
management architecture. Suitable for preserving data over its lifecycle.
At the iRODS core, a Rule Engine interprets the Rules to decide how the system is to
respond to various requests and conditions.
Interfaces: GUI, Web, WebDAV, CLI
Operations:
Search, Access and View,
Add/Extract Metadata, Annotate,
Analyze & Process,
Manage, Replicate, Copy, Share,
Repurpose,
Track access, Subscribe & more…
iRODS Server software and Rule Engine run
on each data server. The iRODS iCAT
Metadata Catalog uses a database to track
metadata describing data and everything that
happens to it
24. ©2014 BioTeam, Inc. All Rights Reserved.
| Implementation: Science DMZ + ILMBIOTEAM
Enabling Science
25. ©2014 BioTeam, Inc. All Rights Reserved.
| NGS Data Analysis (on a Hybrid HPC Cloud)
General Concept
• On-site local resources are a “cache” that exists…
• To be used constantly
• For best data locality
• For specialized resources
• For security
• Elastic resources from public or parent organization’s private cloud
• The middleware offers cloud-style IaaS and/or PaaS
• Multi-tenant– users/virtual communities can spin up their own resources, clusters, etc.
• These on-demand systems accommodate unique software configurations and services
(suited to varying NGS workflows, etc.)
Sounds great, but…
• It will be a while before you can pull a solution like this off the shelf
• Would be a good candidate for a converged infrastructure offering
Goal: HPC-like performance and latency, cloud-like elasticity and provisioning
BIOTEAM
Enabling Science
26. ©2014 BioTeam, Inc. All Rights Reserved.
| Implementation: Science DMZ + ILM + HPC CloudBIOTEAM
Enabling Science
27. ©2014 BioTeam, Inc. All Rights Reserved.
| PartingThoughts & Lessons Learned
1. Confirmation Bias
• Just because it wasn’t viable before, doesn’t mean it won’t ever be
2. Depth Perception
• Bleeding Edge? Leading Edge? State of the Art? Legacy? Ready to Sunset?
3. Outliers
• The existence of edge or corner cases does not necessarily invalidate a solution,
but it does mean you better understand the scope the solution covers
4. The Power of &&
• Multipart solutions seen as complex, abandoned in search of a silver bullet
• Combining ideas is more collaborative and doesn’t force an ultimatum
5. GameTheory
• Bringing a chess set to a checkers tournament…
6. Relationship overTechnology
• Work with vendors and collaborators that are interested in making a long term
investment in what you do
BIOTEAM
Enabling Science
28. ThankYou
Questions and Discussion Welcome
©2014 BioTeam, Inc. All Rights Reserved.
Aaron D. Gardner
Senior Scientific Consultant, BioTeam, Inc.
aaron@bioteam.net
BIOTEAM
Enabling Science