Slides from my workshop presentation at this GTC Bioinformatics Data Management Strategies Workshop:
http://www.gtcbio.com/component/conference/bioinformatics-and-data-management-strategies-workshop-agenda
Drop me a line (chris@bioteam.net) if you want the full PDF download.
WordPress Websites for Engineers: Elevate Your Brand
GTC 2013: Practical NGS Data Management
1. 1
Practical NGS Data Management
2013 GTC Bioinformatics & Data Management Strategies - San Francisco
Wednesday, June 19, 13
2. 2
I’m Chris.
I’m an infrastructure geek.
I work for the BioTeam.
www.bioteam.net - Twitter: @chris_dag
Wednesday, June 19, 13
3. Who, What, Why ...
3
BioTeam
‣ Independent consulting shop
‣ Staffed by scientists forced
to learn IT, SW & HPC to get
our own research done
‣ 10+ years bridging the “gap”
between science, IT & high
performance computing
‣ We get to see how many
groups of smart people
tackle similar problems
Wednesday, June 19, 13
4. Listen to me at your own risk
4
Standard Dag Disclaimer
‣ I’m not an expert, pundit,
visionary or “thought leader”
‣ Any career success entirely due
to shamelessly copying what
actual smart people do
‣ I’m biased, burnt-out & cynical
‣ Filter my words accordingly
Wednesday, June 19, 13
6. 6
It’s a risky time to be doing Bio-IT
Wednesday, June 19, 13
7. Science progressing way faster than IT can refresh/change
The Central Problem Is ...
‣ Instrumentation & protocols are changing FAR
FASTER than we can refresh our Research-IT &
Scientific Computing infrastructure
• Bench science is changing month-to-month ...
• ... while our IT infrastructure only gets refreshed every
2-7 years
‣ We have to design systems TODAY that can
support unknown research requirements &
workflows over many years (gulp ...)
7
Wednesday, June 19, 13
8. The Central Problem Is ...
‣ The easy period is over
‣ 5 years ago we could toss
inexpensive storage and
servers at the problem;
even in a nearby closet or
under a lab bench if
necessary
‣ That does not work any
more; real solutions
required
8
Wednesday, June 19, 13
10. We are here today because ...
‣ It has never been easier to
acquire vast amounts of data
cheaply and easily
‣ Growth rate of data creation/
ingest exceeds rate at which
the storage industry is
improving disk capacity
‣ Not just a storage lifecycle
problem. This data *moves*
and often needs to be shared
among multiple entities and
providers
• ... ideally without punching holes in
your firewall or consuming all
available internet bandwidth
10
Wednesday, June 19, 13
11. If you get it wrong ...
‣ Lost opportunity
‣ Missing capability
‣ Frustrated & very vocal scientific staff
‣ Problems in recruiting, retention,
publication & product development
11
Wednesday, June 19, 13
13. Get Comfortable With Insane Rates Of Change
Meta Issue #1
‣ Genome Sequencing
innovation rate
is simply ludicrous.
‣ Similar rapid change in
tools & lab protocols
‣ We MUST acknowledge
and plan for disruptive
science affecting IT
systems and methods
13
Wednesday, June 19, 13
14. Get Comfortable With Insane Rates Of Change
Meta Issue #1, continued
‣ Multiple ways to approach this, often affected
by how your funding cycle works
1. Over-design & Over-provision
2. Spend upfront on agility & scaling at the IT core
3. Incremental refresh of smaller, “right-sized” systems
14
Wednesday, June 19, 13
15. Know Your Platforms
Meta Issue #2
‣ NGS Platform has huge
impact on IT footprint
‣ Which category do you fall in?
• Single NGS instrument? Multiple
NGS instruments?
• One NGS vendor or many?
• Outsourced sequencing?
• Outsourced sequencing +
analysis?
• Hybrid mix of onsite & outsourced?
15
Wednesday, June 19, 13
16. Understand The Duty Cycle
Meta Issue #3
‣ NGS “duty cycle” also has huge influence on size
and shape of the IT footprint
‣ What type of lab are you running?
• 24x7 NGS Industrial Production?
• Central NGS Core Facility?
• Single PI, Department or Workgroup?
‣ Operational & Analysis Workflow?
• Sequence .... Analyze .... Sequence?
• Sequence ... Sequence ... Sequence ... Analysis
16
Wednesday, June 19, 13
17. Know The Tool Landscape
Meta Issue #4
‣ Analysis & data tools are changing almost as
rapidly as NGS chemistry & platforms
• Open source, commercial, on-premise and cloud-hosted
are all in the mix for 2013
‣ Understand that your software landscape and
toolchain may change significantly (multiple
times) over the lifespan of your NGS efforts
‣ IT people also need solid understanding of
WHO makes algorithm & toolchain decisions
17
Wednesday, June 19, 13
18. Diagram Your Pipelines & Data Flows
Meta Issue #5
‣ Different groups with the same NGS platform
will do wildly different things
• Low end example: push FASTQ into CLCBio, export
VCF files and “call it a day”
• High end example: custom reference alignment or de-
novo assembly followed by intense human-driven
bioinformatics
• Complex example: genomic medicine & NGS data
being used to drive clinical decisions
18
Wednesday, June 19, 13
19. 19
Can we do an NGS talk without using the ‘C’ word?
Wednesday, June 19, 13
20. Whether you like it or not ...
NGS & “The Cloud”
‣ You can’t ignore
or avoid the cloud
‣ Period.
20
Wednesday, June 19, 13
21. Whether you like it or not ...
NGS & “The Cloud”
‣ Why you can’t ignore the cloud in 2013
1. NGS data flows small enough to allow “write to cloud”
2. NGS vendors are forcing the issue
3. Our local storage is increasingly becoming “cloud aware”
4. Your users may prefer a cloud-hosted solution
5. Sequencing partners can deliver directly to the cloud
6. Easier to partner on NGS analysis & data distribution
7. Cloud economic (particularly storage) trends are clear
21
Wednesday, June 19, 13
22. Whether you like it or not ...
NGS & “The Cloud”
‣ Why you need to start work NOW
‣ Blunt Truth: 90% of cloud technical bits are easy to
understand and fast to implement.
‣ Almost frictionless to access the cloud and NGS
vendors have a vested interest in making it faster and
easier; cloud may be in use today without your
knowledge
‣ Risk of scientists bypassing/leapfrogging internal IT
22
Wednesday, June 19, 13
23. Whether you like it or not ...
NGS & “The Cloud”
‣ You need to start NOW because of the 10% of
“cloud stuff” that is neither fast nor easy ...
• Internal policies/procedures & risk assessment
• Adding additional internet capacity takes time
• Safe networking, firewall, VPN, VPC and Identity
Management implementations require experts to design
and potentially lengthy implementation periods
‣ “Accessing” cloud is easy. Using it properly,
safely & persistently is not easy and not trivial.
23
Wednesday, June 19, 13
24. 24
Subnets & VPC can be more complex than the compute & storage
Wednesday, June 19, 13
26. Storage & Information Management
‣ Compute power in 2103 is a cheap
commodity
‣ Storage? Not so much.
• Still many ways to spectacularly waste
money
• Incredible diversity of vendors, products &
capability
‣ A significant percentage of your
budget and pre-purchase design
efforts should center around
storage, data movement & data
lifecycle management
26
Wednesday, June 19, 13
27. Storage & Information Management
‣ Only time for 3 bits of
advice:
1. The need for a default
storage ‘stance’
2. WHAT you store is as
important as HOW you
store it
3. Importance of non-crap
metrics
27
Wednesday, June 19, 13
28. Storage ‘Stance’
‣ Storage landscape is
immense & diverse
• 100TB storage can be bought
for $12,000 - $400,000
‣ You need a ‘default stance’
‣ Good news is you have
many options
‣ ... your ‘stance’ is often
defined by how your budget
and funding cycles work
28
Wednesday, June 19, 13
29. Monolithic “all-in-one”
Storage ‘Stance’ - Option 1
‣ Sized for the future on Day #1
• Purchased upfront with future need in mind; looks
overbuilt and over-provisioned early on
‣ Good for:
• Groups with “one-shot” funding & little refresh chances
over the lifespan of the platform
‣ Not great if:
• Business or science changes unexpectedly
• You did your sizing/scaling/growth calculations wrong
29
Wednesday, June 19, 13
30. Tier 1 Storage - Easy to Grow & Manage
Storage ‘Stance’ - Option 2
‣ Invest upfront in peta-capable single-namespace & low operational
burden
• Enterprise-grade storage that is very easy to manage, maintain and grow over time
‣ Good for:
• Organizations where getting new headcount is harder than spending CapEx;
Intentional spend on hardware that does not require additional humans to run &
maintain it.
• Organizations with budgeting that allows for incremental refresh cycles
• Organizations without onsite gurus & dedicated storage admins
‣ Downside:
• Upfront & ongoing investment can be large; possibly affecting compute, tools or
software budget
• Expensive relative to alternatives. You are paying for “future-proof” scalability &
systems engineered for the lowest possible operational burden
30
Wednesday, June 19, 13
31. Getting clever & straddling Tier 1 and Tier 2 Storage
Storage ‘Stance’ - Option 3
‣ Strive for peta-capable single-namespace and easy operation
• ... but be willing to make modest trade-offs in exchange for lower cost
• Look at both Tier 1 and Tier 2 storage vendors (and people like Cambridge
Computer)
‣ Good for:
• Organizations willing to take a more active role in vendor selection, design,
deployment & operation
• Organizations motivated by ROI and willing to make modest trade-offs in
capability, performance or operational burden in exchange for lower CapEx cost
‣ Downside:
• More risk in this area - easy to make a misguided decision. Requires brains &
active interest in pre-sale design and vendor selection process. May require
more storage admin effort day-to-day. Some trade-offs are better than others.
31
Wednesday, June 19, 13
32. Clever but not dumb
Storage ‘Stance’ - Option 4
‣ Midrange “Cheap & Clever”
• There are tons of very interesting Tier2 and Tier3 storage options available.
The hard part is separating the good stuff from the crap stuff.
• Check out: RAID Inc, NexentaStor, NexSan, etc. etc.
‣ Good for:
• Budget constrained groups with motivated IT people
‣ Downside
• Might have to throw away stuff as you outgrow it (“forklift upgrade”)
• Careful pre-purchase work required to properly config/size it
• Storage design may force changes on scientific workflows
• Higher administrative burden
32
Wednesday, June 19, 13
33. DIY & Super Cheap
Storage ‘Stance’ - Option 4
‣ DIY & Disruptive
• Incredibly disruptive stuff is out there for motivated DIY’ers and people who
can’t afford Tier 1 and Tier 2 platforms; This is where you can spend
$12,000 on a 100TB storage node.
• Driven largely by high-density x86_64 server chassis and many people
writing clever software (both free and commercial). There are NAS, SAN,
Parallel and Distributed filesystem options all in this realm
‣ Good for:
• Smart people & organizations with guru storage & sysadmin resources.
• People with no money or people who spent all their money on NGS
instrument & reagents and “forgot about all that IT stuff ...”
‣ Downside
• Non-trivial risk to science. Catastrophic data loss and science-disrupting
downtime can all easily occur down at this level; Mess up badly and you will
LOSE YOUR JOB
33
Wednesday, June 19, 13
34. WHAT you store is as important as HOW you store it
34
Wednesday, June 19, 13
35. Information Management
‣ Often Overlooked
• Hopefully previous speakers convinced you of the value
gained from “information lifecycle management”
‣ The Core Problem
• POSIX filesystem semantics are insufficient for storing all
of the attributes and information we want to tag our data
with
• ... “something else” is required
35
Wednesday, June 19, 13
36. Something else is required ...
Information Management
‣ The large NGS heavy hitters are all looking at
“metadata aware” storage as the penultimate
solution
‣ Small & midrange NGS shops usually
leverage LIMS with a bit of storage
reporting/analytics
‣ LIMS warning:
• NGS vendors tend to assume you will only use NGS
instruments that they make! Their software may not
handle a future “multi-platform” NGS environment
• Beware of the time/effort/cost required to modify
many LIMS systems that are on the market today
• BioTeam consulting has resulted in some products
being made in this space
- MiniLIMS, Slipstream NGS & Galaxy Editions
36
Wednesday, June 19, 13
38. Storage Metrics
‣ It is VERY important that you understand what you
are storing and what the short, medium and long-
term trend lines look like
‣ Very few people actually bother to do this
‣ ... and many that do end up producing pretty graphs
that look good on dashboards but don’t actually
help drive scaling, refresh or upgrade decisions.
‣ You need metrics that can drive actionable
decisions related to storage management and
growth
38
Wednesday, June 19, 13
40. It’s 2013 ... we know what questions to ask about our storage
40
Wednesday, June 19, 13
41. A 6 month rolling window provides real/actionable info ...
41
Wednesday, June 19, 13
42. Critical to have a handle on “raw” vs “derived” data also
42
Raw
70%
Derived
30%
PacBio (1.55 TB vs .569 TB)
Raw
86%
Derived
14%
Roche454 (4.55 TB vs .757 TB)
Raw
85%
Derived
15%
Illumina (10.171 TB vs 1.86 TB
Wednesday, June 19, 13
44. You need a plan for both network and physical ingest
NGS Data Ingest
‣ Whatever your ‘stance’ is today regarding
ingest of external NGS data it will almost
certainly change over time
• ... interesting public domain data sets
• Data from collaborators & partners
• Moving data among your own organization
‣ Plan for both ‘network’ and ‘physical’ methods
44
Wednesday, June 19, 13
45. You need a plan for both network and physical ingest
NGS Data Ingest
‣ Ingest is hard. It may seem easy but it’s not,
especially if you care about data integrity.
• Are you validating MD5 checksums on every file each
time it moves from location A to location B?
‣ ... it can also sap a lot of time and effort from
your staff if done ad-hoc or in a disorganized
way
‣ Both physical and network-based ingest require
non-trivial amounts of upfront thought. Some
infrastructure & software may also be required
45
Wednesday, June 19, 13
50. You need a plan for both network and physical ingest
Physical NGS Data Ingest
‣ Physical ingest is best done with dedicated hardware
and (ideally) a dedicated workstation
‣ Things to think about
• How are you labeling/storing/tracking physical media? Who does
the work? Expensive PhD? IT staff? Is there a written SOP guiding
the process?
• How does physical media end up at your loading dock? Where
does it go after that?
• Is your ingest workstation fast enough to handle MD5
checksumming on the fly? Enough RAM for lots of TCP sessions?
• Is your ingest station physically located in an optimal network
location to facilitate the data movement to core storage?
50
Wednesday, June 19, 13
51. You need a plan for both network and physical ingest
Network NGS Data Ingest
‣ Network ingest (at high speed) requires advance
planning and potential infrastructure
‣ Things to think about
• Commercial via Aspera? OpenSource via GridFTP? Something
else?
• How exactly will you safely get data inside your organization via the
internet? How do you move from DMZ through firewall and onto
your internal scientific IT systems?
• Can you move data at speed without taking down VOIP and
Teleconferencing systems or making network admins cry?
• Will the IDS or Firewall doing deep packet inspection or protocol
reassembly melt under the load?
51
Wednesday, June 19, 13
53. 53
Ending Advice: 1 of 6
‣ Understand the ‘interesting time’ we are in
• Science is changing faster than we can refresh IT
• Disruptive innovation in the NGS space itself
‣ Advice:
• Spend as much time thinking about future flexibility as
you spend on actual current needs & requirements
Wednesday, June 19, 13
54. 54
Ending Advice: 2 of 6
‣ NGS Assumptions don’t last very long
• Will you change NGS vendor, platform or method?
• Will the tools in use today still be in use tomorrow?
• How will the “local vs. outsourced vs. cloud” landscape
change for you over the next few years?
‣ Advice:
• Avoid things that lock you into a vendor or platform
• Look long and hard at your default assumptions
Wednesday, June 19, 13
55. 55
Ending Advice: 3 of 6
‣ You need Physical & Network Ingest Planning
• You may have standardized on one method or practice
but there will always be outliers and unexpected
situations; Data always seems to be on the move!
- NGS data volume mean outliers are non-trivial to handle
‣ Advice:
• Just think about how you would handle the edge cases
and unexpected; don’t go crazy with upfront investment.
Wednesday, June 19, 13
56. 56
Ending Advice: 4 of 6
‣ You need a cloud strategy. Today.
- Your users or vendors may force the issue
- The economic trend lines make cloud inescapable
- 90% of cloud is “easy”. Remaining 10% takes time & effort
‣ Advice:
• 100% Cloud is not unreasonable in 2013*
• Do the boring/long work now (policies, procedure, etc.)
• Consider laying the tech groundwork (Bandwidth, VPN, VPC
& Identity Management) now so you can easily and simply
make use of the cloud when needed
Wednesday, June 19, 13
57. 57
Ending Advice: 5 of 6
‣ Compute & Analysis
‣ Advice:
• Compute power is essentially a commodity in 2013
- Both local and “on the cloud”
• Easy and relatively inexpensive to acquire and deploy
• There are some potential ‘gotcha’ and tuning areas that
deserve advance thought and attention
- ... but relative to storage & data it’s an “easy” problem area
Wednesday, June 19, 13
58. 58
Ending Advice: 6 of 6
‣ Storage & Data Management
‣ Advice:
• Bulk of your attention & budget goes here
• Huge diversity in product and feature offerings mean
more risk & more chances of mistakes
- Outside expertise & NGS-aware vendors like Accunet &
Cambridge Computer really can act as “value added resellers”
• Pick one of the “default stances” that best match your
organization funding & staffing model and build around
that
Text
Wednesday, June 19, 13
72. 72
Yep. This counts.
16 monster compute nodes + 22 GPU nodes
Cost? 30 bucks an hour via AWS Spot Market
Real world screenshot from mid-2012
Wednesday, June 19, 13