GTC 2013: Practical NGS Data Management

1
Practical NGS Data Management
2013 GTC Bioinformatics & Data Management Strategies - San Francisco
Wednesday, June 19, 13

2
I’m Chris.
I’m an infrastructure geek.
I work for the BioTeam.
www.bioteam.net - Twitter: @chris_dag

Who, What, Why ...
3
BioTeam
‣ Independent consulting shop
‣ Staﬀed by scientists forced
to learn IT, SW & HPC to get
our own research done
‣ 10+ years bridging the “gap”
between science, IT & high
performance computing
‣ We get to see how many
groups of smart people
tackle similar problems

Listen to me at your own risk
4
Standard Dag Disclaimer
‣ I’m not an expert, pundit,
visionary or “thought leader”
‣ Any career success entirely due
to shamelessly copying what
actual smart people do
‣ I’m biased, burnt-out & cynical
‣ Filter my words accordingly

5
So why are you here?

6
It’s a risky time to be doing Bio-IT

Science progressing way faster than IT can refresh/change
The Central Problem Is ...
‣ Instrumentation & protocols are changing FAR
FASTER than we can refresh our Research-IT &
Scientiﬁc Computing infrastructure
• Bench science is changing month-to-month ...
• ... while our IT infrastructure only gets refreshed every
2-7 years
‣ We have to design systems TODAY that can
support unknown research requirements &
workﬂows over many years (gulp ...)
7

The Central Problem Is ...
‣ The easy period is over
‣ 5 years ago we could toss
inexpensive storage and
servers at the problem;
even in a nearby closet or
under a lab bench if
necessary
‣ That does not work any
more; real solutions
required
8

9
The new normal.

We are here today because ...
‣ It has never been easier to
acquire vast amounts of data
cheaply and easily
‣ Growth rate of data creation/
ingest exceeds rate at which
the storage industry is
improving disk capacity
‣ Not just a storage lifecycle
problem. This data *moves*
and often needs to be shared
among multiple entities and
providers
• ... ideally without punching holes in
your ﬁrewall or consuming all
available internet bandwidth
10

If you get it wrong ...
‣ Lost opportunity
‣ Missing capability
‣ Frustrated & very vocal scientiﬁc staﬀ
‣ Problems in recruiting, retention,
publication & product development
11

12
Topic #1: “Meta Issues”

Get Comfortable With Insane Rates Of Change
Meta Issue #1
‣ Genome Sequencing
innovation rate
is simply ludicrous.
‣ Similar rapid change in
tools & lab protocols
‣ We MUST acknowledge
and plan for disruptive
science aﬀecting IT
systems and methods
13

Get Comfortable With Insane Rates Of Change
Meta Issue #1, continued
‣ Multiple ways to approach this, often aﬀected
by how your funding cycle works
1. Over-design & Over-provision
2. Spend upfront on agility & scaling at the IT core
3. Incremental refresh of smaller, “right-sized” systems
14

Know Your Platforms
Meta Issue #2
‣ NGS Platform has huge
impact on IT footprint
‣ Which category do you fall in?
• Single NGS instrument? Multiple
NGS instruments?
• One NGS vendor or many?
• Outsourced sequencing?
• Outsourced sequencing +
analysis?
• Hybrid mix of onsite & outsourced?
15

Understand The Duty Cycle
Meta Issue #3
‣ NGS “duty cycle” also has huge inﬂuence on size
and shape of the IT footprint
‣ What type of lab are you running?
• 24x7 NGS Industrial Production?
• Central NGS Core Facility?
• Single PI, Department or Workgroup?
‣ Operational & Analysis Workﬂow?
• Sequence .... Analyze .... Sequence?
• Sequence ... Sequence ... Sequence ... Analysis
16

Know The Tool Landscape
Meta Issue #4
‣ Analysis & data tools are changing almost as
rapidly as NGS chemistry & platforms
• Open source, commercial, on-premise and cloud-hosted
are all in the mix for 2013
‣ Understand that your software landscape and
toolchain may change signiﬁcantly (multiple
times) over the lifespan of your NGS eﬀorts
‣ IT people also need solid understanding of
WHO makes algorithm & toolchain decisions
17

Diagram Your Pipelines & Data Flows
Meta Issue #5
‣ Different groups with the same NGS platform
will do wildly different things
• Low end example: push FASTQ into CLCBio, export
VCF files and “call it a day”
• High end example: custom reference alignment or de-
novo assembly followed by intense human-driven
bioinformatics
• Complex example: genomic medicine & NGS data
being used to drive clinical decisions
18

19
Can we do an NGS talk without using the ‘C’ word?

Whether you like it or not ...
NGS & “The Cloud”
‣ You can’t ignore
or avoid the cloud
‣ Period.
20

‣ Why you can’t ignore the cloud in 2013
1. NGS data ﬂows small enough to allow “write to cloud”
2. NGS vendors are forcing the issue
3. Our local storage is increasingly becoming “cloud aware”
4. Your users may prefer a cloud-hosted solution
5. Sequencing partners can deliver directly to the cloud
6. Easier to partner on NGS analysis & data distribution
7. Cloud economic (particularly storage) trends are clear
21

‣ Why you need to start work NOW
‣ Blunt Truth: 90% of cloud technical bits are easy to
understand and fast to implement.
‣ Almost frictionless to access the cloud and NGS
vendors have a vested interest in making it faster and
easier; cloud may be in use today without your
knowledge
‣ Risk of scientists bypassing/leapfrogging internal IT
22

‣ You need to start NOW because of the 10% of
“cloud stuﬀ” that is neither fast nor easy ...
• Internal policies/procedures & risk assessment
• Adding additional internet capacity takes time
• Safe networking, ﬁrewall, VPN, VPC and Identity
Management implementations require experts to design
and potentially lengthy implementation periods
‣ “Accessing” cloud is easy. Using it properly,
safely & persistently is not easy and not trivial.
23

24
Subnets & VPC can be more complex than the compute & storage

25
Storage.
(the hard bit ...)

Storage & Information Management
‣ Compute power in 2103 is a cheap
commodity
‣ Storage? Not so much.
• Still many ways to spectacularly waste
money
• Incredible diversity of vendors, products &
capability
‣ A signiﬁcant percentage of your
budget and pre-purchase design
eﬀorts should center around
storage, data movement & data
lifecycle management
26

Storage & Information Management
‣ Only time for 3 bits of
advice:
1. The need for a default
storage ‘stance’
2. WHAT you store is as
important as HOW you
store it
3. Importance of non-crap
metrics
27

Storage ‘Stance’
‣ Storage landscape is
immense & diverse
• 100TB storage can be bought
for $12,000 - $400,000
‣ You need a ‘default stance’
‣ Good news is you have
many options
‣ ... your ‘stance’ is often
deﬁned by how your budget
and funding cycles work
28

Monolithic “all-in-one”
Storage ‘Stance’ - Option 1
‣ Sized for the future on Day #1
• Purchased upfront with future need in mind; looks
overbuilt and over-provisioned early on
‣ Good for:
• Groups with “one-shot” funding & little refresh chances
over the lifespan of the platform
‣ Not great if:
• Business or science changes unexpectedly
• You did your sizing/scaling/growth calculations wrong
29

Tier 1 Storage - Easy to Grow & Manage
‣ Invest upfront in peta-capable single-namespace & low operational
burden
• Enterprise-grade storage that is very easy to manage, maintain and grow over time
‣ Good for:
• Organizations where getting new headcount is harder than spending CapEx;
Intentional spend on hardware that does not require additional humans to run &
maintain it.
• Organizations with budgeting that allows for incremental refresh cycles
• Organizations without onsite gurus & dedicated storage admins
‣ Downside:
• Upfront & ongoing investment can be large; possibly affecting compute, tools or
software budget
• Expensive relative to alternatives. You are paying for “future-proof” scalability &
systems engineered for the lowest possible operational burden
30

Getting clever & straddling Tier 1 and Tier 2 Storage
‣ Strive for peta-capable single-namespace and easy operation
• ... but be willing to make modest trade-offs in exchange for lower cost
• Look at both Tier 1 and Tier 2 storage vendors (and people like Cambridge
Computer)
‣ Good for:
• Organizations willing to take a more active role in vendor selection, design,
deployment & operation
• Organizations motivated by ROI and willing to make modest trade-offs in
capability, performance or operational burden in exchange for lower CapEx cost
‣ Downside:
• More risk in this area - easy to make a misguided decision. Requires brains &
active interest in pre-sale design and vendor selection process. May require
more storage admin effort day-to-day. Some trade-offs are better than others.
31

Clever but not dumb
‣ Midrange “Cheap & Clever”
• There are tons of very interesting Tier2 and Tier3 storage options available.
The hard part is separating the good stuff from the crap stuff.
• Check out: RAID Inc, NexentaStor, NexSan, etc. etc.
‣ Good for:
• Budget constrained groups with motivated IT people
‣ Downside
• Might have to throw away stuff as you outgrow it (“forklift upgrade”)
• Careful pre-purchase work required to properly config/size it
• Storage design may force changes on scientific workflows
• Higher administrative burden
32

DIY & Super Cheap
‣ DIY & Disruptive
• Incredibly disruptive stuff is out there for motivated DIY’ers and people who
can’t afford Tier 1 and Tier 2 platforms; This is where you can spend
$12,000 on a 100TB storage node.
• Driven largely by high-density x86_64 server chassis and many people
writing clever software (both free and commercial). There are NAS, SAN,
Parallel and Distributed ﬁlesystem options all in this realm
‣ Good for:
• Smart people & organizations with guru storage & sysadmin resources.
• People with no money or people who spent all their money on NGS
instrument & reagents and “forgot about all that IT stuff ...”
‣ Downside
• Non-trivial risk to science. Catastrophic data loss and science-disrupting
downtime can all easily occur down at this level; Mess up badly and you will
LOSE YOUR JOB
33

WHAT you store is as important as HOW you store it
34

Information Management
‣ Often Overlooked
• Hopefully previous speakers convinced you of the value
gained from “information lifecycle management”
‣ The Core Problem
• POSIX ﬁlesystem semantics are insufﬁcient for storing all
of the attributes and information we want to tag our data
with
• ... “something else” is required
35

Something else is required ...
Information Management
‣ The large NGS heavy hitters are all looking at
“metadata aware” storage as the penultimate
solution
‣ Small & midrange NGS shops usually
leverage LIMS with a bit of storage
reporting/analytics
‣ LIMS warning:
• NGS vendors tend to assume you will only use NGS
instruments that they make! Their software may not
handle a future “multi-platform” NGS environment
• Beware of the time/effort/cost required to modify
many LIMS systems that are on the market today
• BioTeam consulting has resulted in some products
being made in this space
- MiniLIMS, Slipstream NGS & Galaxy Editions
36

Importance of ‘non-crap’ metrics
37

Storage Metrics
‣ It is VERY important that you understand what you
are storing and what the short, medium and long-
term trend lines look like
‣ Very few people actually bother to do this
‣ ... and many that do end up producing pretty graphs
that look good on dashboards but don’t actually
help drive scaling, refresh or upgrade decisions.
‣ You need metrics that can drive actionable
decisions related to storage management and
growth
38

Some (Biased) Examples ...
39

It’s 2013 ... we know what questions to ask about our storage
40

A 6 month rolling window provides real/actionable info ...
41

Critical to have a handle on “raw” vs “derived” data also
42
Raw
70%
Derived
30%
PacBio (1.55 TB vs .569 TB)
Raw
86%
Derived
14%
Roche454 (4.55 TB vs .757 TB)
Raw
85%
Derived
15%
Illumina (10.171 TB vs 1.86 TB

Physical & Network NGS Ingest
43

You need a plan for both network and physical ingest
NGS Data Ingest
‣ Whatever your ‘stance’ is today regarding
ingest of external NGS data it will almost
certainly change over time
• ... interesting public domain data sets
• Data from collaborators & partners
• Moving data among your own organization
‣ Plan for both ‘network’ and ‘physical’ methods
44

NGS Data Ingest
‣ Ingest is hard. It may seem easy but it’s not,
especially if you care about data integrity.
• Are you validating MD5 checksums on every file each
time it moves from location A to location B?
‣ ... it can also sap a lot of time and effort from
your staff if done ad-hoc or in a disorganized
way
‣ Both physical and network-based ingest require
non-trivial amounts of upfront thought. Some
infrastructure & software may also be required
45

46
“Naked” Data Movement

47
Physical data movement station; Unit= Naked Disk

48
“Naked” Data Archive

49
Cloud/Network-based Data Movement
High speed 7+ hour sustained transfer from US East to West Coast
Sufﬁcient for a NGS core facility ...

Physical NGS Data Ingest
‣ Physical ingest is best done with dedicated hardware
and (ideally) a dedicated workstation
‣ Things to think about
• How are you labeling/storing/tracking physical media? Who does
the work? Expensive PhD? IT staff? Is there a written SOP guiding
the process?
• How does physical media end up at your loading dock? Where
does it go after that?
• Is your ingest workstation fast enough to handle MD5
checksumming on the ﬂy? Enough RAM for lots of TCP sessions?
• Is your ingest station physically located in an optimal network
location to facilitate the data movement to core storage?
50

Network NGS Data Ingest
‣ Network ingest (at high speed) requires advance
planning and potential infrastructure
‣ Things to think about
• Commercial via Aspera? OpenSource via GridFTP? Something
else?
• How exactly will you safely get data inside your organization via the
internet? How do you move from DMZ through ﬁrewall and onto
your internal scientiﬁc IT systems?
• Can you move data at speed without taking down VOIP and
Teleconferencing systems or making network admins cry?
• Will the IDS or Firewall doing deep packet inspection or protocol
reassembly melt under the load?
51

52
Wrap-up: Summary Tips

53
Ending Advice: 1 of 6
‣ Understand the ‘interesting time’ we are in
• Science is changing faster than we can refresh IT
• Disruptive innovation in the NGS space itself
‣ Advice:
• Spend as much time thinking about future ﬂexibility as
you spend on actual current needs & requirements

54
‣ NGS Assumptions don’t last very long
• Will you change NGS vendor, platform or method?
• Will the tools in use today still be in use tomorrow?
• How will the “local vs. outsourced vs. cloud” landscape
change for you over the next few years?
‣ Advice:
• Avoid things that lock you into a vendor or platform
• Look long and hard at your default assumptions

55
‣ You need Physical & Network Ingest Planning
• You may have standardized on one method or practice
but there will always be outliers and unexpected
situations; Data always seems to be on the move!
- NGS data volume mean outliers are non-trivial to handle
‣ Advice:
• Just think about how you would handle the edge cases
and unexpected; don’t go crazy with upfront investment.

56
‣ You need a cloud strategy. Today.
- Your users or vendors may force the issue
- The economic trend lines make cloud inescapable
- 90% of cloud is “easy”. Remaining 10% takes time & effort
‣ Advice:
• 100% Cloud is not unreasonable in 2013*
• Do the boring/long work now (policies, procedure, etc.)
• Consider laying the tech groundwork (Bandwidth, VPN, VPC
& Identity Management) now so you can easily and simply
make use of the cloud when needed

57
‣ Compute & Analysis
‣ Advice:
• Compute power is essentially a commodity in 2013
- Both local and “on the cloud”
• Easy and relatively inexpensive to acquire and deploy
• There are some potential ‘gotcha’ and tuning areas that
deserve advance thought and attention
- ... but relative to storage & data it’s an “easy” problem area

58
‣ Storage & Data Management
‣ Advice:
• Bulk of your attention & budget goes here
• Huge diversity in product and feature offerings mean
more risk & more chances of mistakes
- Outside expertise & NGS-aware vendors like Accunet &
Cambridge Computer really can act as “value added resellers”
• Pick one of the “default stances” that best match your
organization funding & stafﬁng model and build around
that
Text

59
end; Thanks!
Slides: http://slideshare.net/chrisdag/

61
Infrastructure Tour
What does this stuff look like?

62
The cliche image

63
Lab-local HPC & storage

64
More lab-local kit

65
Small core w/ multiple NGS instrument support

66
Small cluster; large storage

67
Mid-sized core facility

68
Large Core Facility

69
Large Core Facility: Just Storage

70
Regional Scientiﬁc Computing “Hub”

71
Petabyte-scale Storage

72
Yep. This counts.
16 monster compute nodes + 22 GPU nodes
Cost? 30 bucks an hour via AWS Spot Market
Real world screenshot from mid-2012

73
Physical data movement station;

74
Physical data movement station; Unit= Naked Disk

75
Cloud/Network-based Data Movement
High speed 7+ hour sustained transfer from US East to West Coast
Sufﬁcient for a NGS core facility ...

76
“Naked” Data Movement

77
“Naked” Data Archive

GTC 2013: Practical NGS Data Management

Recommended

Recommended

More Related Content

More from Chris Dagdigian

More from Chris Dagdigian (20)

Recently uploaded

Recently uploaded (20)

GTC 2013: Practical NGS Data Management