SlideShare a Scribd company logo
1 of 77
1
Practical NGS Data Management
2013 GTC Bioinformatics & Data Management Strategies - San Francisco
Wednesday, June 19, 13
2
I’m Chris.
I’m an infrastructure geek.
I work for the BioTeam.
www.bioteam.net - Twitter: @chris_dag
Wednesday, June 19, 13
Who, What, Why ...
3
BioTeam
‣ Independent consulting shop
‣ Staffed by scientists forced
to learn IT, SW & HPC to get
our own research done
‣ 10+ years bridging the “gap”
between science, IT & high
performance computing
‣ We get to see how many
groups of smart people
tackle similar problems
Wednesday, June 19, 13
Listen to me at your own risk
4
Standard Dag Disclaimer
‣ I’m not an expert, pundit,
visionary or “thought leader”
‣ Any career success entirely due
to shamelessly copying what
actual smart people do
‣ I’m biased, burnt-out & cynical
‣ Filter my words accordingly
Wednesday, June 19, 13
5
So why are you here?
Wednesday, June 19, 13
6
It’s a risky time to be doing Bio-IT
Wednesday, June 19, 13
Science progressing way faster than IT can refresh/change
The Central Problem Is ...
‣ Instrumentation & protocols are changing FAR
FASTER than we can refresh our Research-IT &
Scientific Computing infrastructure
• Bench science is changing month-to-month ...
• ... while our IT infrastructure only gets refreshed every
2-7 years
‣ We have to design systems TODAY that can
support unknown research requirements &
workflows over many years (gulp ...)
7
Wednesday, June 19, 13
The Central Problem Is ...
‣ The easy period is over
‣ 5 years ago we could toss
inexpensive storage and
servers at the problem;
even in a nearby closet or
under a lab bench if
necessary
‣ That does not work any
more; real solutions
required
8
Wednesday, June 19, 13
9
The new normal.
Wednesday, June 19, 13
We are here today because ...
‣ It has never been easier to
acquire vast amounts of data
cheaply and easily
‣ Growth rate of data creation/
ingest exceeds rate at which
the storage industry is
improving disk capacity
‣ Not just a storage lifecycle
problem. This data *moves*
and often needs to be shared
among multiple entities and
providers
• ... ideally without punching holes in
your firewall or consuming all
available internet bandwidth
10
Wednesday, June 19, 13
If you get it wrong ...
‣ Lost opportunity
‣ Missing capability
‣ Frustrated & very vocal scientific staff
‣ Problems in recruiting, retention,
publication & product development
11
Wednesday, June 19, 13
12
Topic #1: “Meta Issues”
Wednesday, June 19, 13
Get Comfortable With Insane Rates Of Change
Meta Issue #1
‣ Genome Sequencing
innovation rate
is simply ludicrous.
‣ Similar rapid change in
tools & lab protocols
‣ We MUST acknowledge
and plan for disruptive
science affecting IT
systems and methods
13
Wednesday, June 19, 13
Get Comfortable With Insane Rates Of Change
Meta Issue #1, continued
‣ Multiple ways to approach this, often affected
by how your funding cycle works
1. Over-design & Over-provision
2. Spend upfront on agility & scaling at the IT core
3. Incremental refresh of smaller, “right-sized” systems
14
Wednesday, June 19, 13
Know Your Platforms
Meta Issue #2
‣ NGS Platform has huge
impact on IT footprint
‣ Which category do you fall in?
• Single NGS instrument? Multiple
NGS instruments?
• One NGS vendor or many?
• Outsourced sequencing?
• Outsourced sequencing +
analysis?
• Hybrid mix of onsite & outsourced?
15
Wednesday, June 19, 13
Understand The Duty Cycle
Meta Issue #3
‣ NGS “duty cycle” also has huge influence on size
and shape of the IT footprint
‣ What type of lab are you running?
• 24x7 NGS Industrial Production?
• Central NGS Core Facility?
• Single PI, Department or Workgroup?
‣ Operational & Analysis Workflow?
• Sequence .... Analyze .... Sequence?
• Sequence ... Sequence ... Sequence ... Analysis
16
Wednesday, June 19, 13
Know The Tool Landscape
Meta Issue #4
‣ Analysis & data tools are changing almost as
rapidly as NGS chemistry & platforms
• Open source, commercial, on-premise and cloud-hosted
are all in the mix for 2013
‣ Understand that your software landscape and
toolchain may change significantly (multiple
times) over the lifespan of your NGS efforts
‣ IT people also need solid understanding of
WHO makes algorithm & toolchain decisions
17
Wednesday, June 19, 13
Diagram Your Pipelines & Data Flows
Meta Issue #5
‣ Different groups with the same NGS platform
will do wildly different things
• Low end example: push FASTQ into CLCBio, export
VCF files and “call it a day”
• High end example: custom reference alignment or de-
novo assembly followed by intense human-driven
bioinformatics
• Complex example: genomic medicine & NGS data
being used to drive clinical decisions
18
Wednesday, June 19, 13
19
Can we do an NGS talk without using the ‘C’ word?
Wednesday, June 19, 13
Whether you like it or not ...
NGS & “The Cloud”
‣ You can’t ignore
or avoid the cloud
‣ Period.
20
Wednesday, June 19, 13
Whether you like it or not ...
NGS & “The Cloud”
‣ Why you can’t ignore the cloud in 2013
1. NGS data flows small enough to allow “write to cloud”
2. NGS vendors are forcing the issue
3. Our local storage is increasingly becoming “cloud aware”
4. Your users may prefer a cloud-hosted solution
5. Sequencing partners can deliver directly to the cloud
6. Easier to partner on NGS analysis & data distribution
7. Cloud economic (particularly storage) trends are clear
21
Wednesday, June 19, 13
Whether you like it or not ...
NGS & “The Cloud”
‣ Why you need to start work NOW
‣ Blunt Truth: 90% of cloud technical bits are easy to
understand and fast to implement.
‣ Almost frictionless to access the cloud and NGS
vendors have a vested interest in making it faster and
easier; cloud may be in use today without your
knowledge
‣ Risk of scientists bypassing/leapfrogging internal IT
22
Wednesday, June 19, 13
Whether you like it or not ...
NGS & “The Cloud”
‣ You need to start NOW because of the 10% of
“cloud stuff” that is neither fast nor easy ...
• Internal policies/procedures & risk assessment
• Adding additional internet capacity takes time
• Safe networking, firewall, VPN, VPC and Identity
Management implementations require experts to design
and potentially lengthy implementation periods
‣ “Accessing” cloud is easy. Using it properly,
safely & persistently is not easy and not trivial.
23
Wednesday, June 19, 13
24
Subnets & VPC can be more complex than the compute & storage
Wednesday, June 19, 13
25
Storage.
(the hard bit ...)
Wednesday, June 19, 13
Storage & Information Management
‣ Compute power in 2103 is a cheap
commodity
‣ Storage? Not so much.
• Still many ways to spectacularly waste
money
• Incredible diversity of vendors, products &
capability
‣ A significant percentage of your
budget and pre-purchase design
efforts should center around
storage, data movement & data
lifecycle management
26
Wednesday, June 19, 13
Storage & Information Management
‣ Only time for 3 bits of
advice:
1. The need for a default
storage ‘stance’
2. WHAT you store is as
important as HOW you
store it
3. Importance of non-crap
metrics
27
Wednesday, June 19, 13
Storage ‘Stance’
‣ Storage landscape is
immense & diverse
• 100TB storage can be bought
for $12,000 - $400,000
‣ You need a ‘default stance’
‣ Good news is you have
many options
‣ ... your ‘stance’ is often
defined by how your budget
and funding cycles work
28
Wednesday, June 19, 13
Monolithic “all-in-one”
Storage ‘Stance’ - Option 1
‣ Sized for the future on Day #1
• Purchased upfront with future need in mind; looks
overbuilt and over-provisioned early on
‣ Good for:
• Groups with “one-shot” funding & little refresh chances
over the lifespan of the platform
‣ Not great if:
• Business or science changes unexpectedly
• You did your sizing/scaling/growth calculations wrong
29
Wednesday, June 19, 13
Tier 1 Storage - Easy to Grow & Manage
Storage ‘Stance’ - Option 2
‣ Invest upfront in peta-capable single-namespace & low operational
burden
• Enterprise-grade storage that is very easy to manage, maintain and grow over time
‣ Good for:
• Organizations where getting new headcount is harder than spending CapEx;
Intentional spend on hardware that does not require additional humans to run &
maintain it.
• Organizations with budgeting that allows for incremental refresh cycles
• Organizations without onsite gurus & dedicated storage admins
‣ Downside:
• Upfront & ongoing investment can be large; possibly affecting compute, tools or
software budget
• Expensive relative to alternatives. You are paying for “future-proof” scalability &
systems engineered for the lowest possible operational burden
30
Wednesday, June 19, 13
Getting clever & straddling Tier 1 and Tier 2 Storage
Storage ‘Stance’ - Option 3
‣ Strive for peta-capable single-namespace and easy operation
• ... but be willing to make modest trade-offs in exchange for lower cost
• Look at both Tier 1 and Tier 2 storage vendors (and people like Cambridge
Computer)
‣ Good for:
• Organizations willing to take a more active role in vendor selection, design,
deployment & operation
• Organizations motivated by ROI and willing to make modest trade-offs in
capability, performance or operational burden in exchange for lower CapEx cost
‣ Downside:
• More risk in this area - easy to make a misguided decision. Requires brains &
active interest in pre-sale design and vendor selection process. May require
more storage admin effort day-to-day. Some trade-offs are better than others.
31
Wednesday, June 19, 13
Clever but not dumb
Storage ‘Stance’ - Option 4
‣ Midrange “Cheap & Clever”
• There are tons of very interesting Tier2 and Tier3 storage options available.
The hard part is separating the good stuff from the crap stuff.
• Check out: RAID Inc, NexentaStor, NexSan, etc. etc.
‣ Good for:
• Budget constrained groups with motivated IT people
‣ Downside
• Might have to throw away stuff as you outgrow it (“forklift upgrade”)
• Careful pre-purchase work required to properly config/size it
• Storage design may force changes on scientific workflows
• Higher administrative burden
32
Wednesday, June 19, 13
DIY & Super Cheap
Storage ‘Stance’ - Option 4
‣ DIY & Disruptive
• Incredibly disruptive stuff is out there for motivated DIY’ers and people who
can’t afford Tier 1 and Tier 2 platforms; This is where you can spend
$12,000 on a 100TB storage node.
• Driven largely by high-density x86_64 server chassis and many people
writing clever software (both free and commercial). There are NAS, SAN,
Parallel and Distributed filesystem options all in this realm
‣ Good for:
• Smart people & organizations with guru storage & sysadmin resources.
• People with no money or people who spent all their money on NGS
instrument & reagents and “forgot about all that IT stuff ...”
‣ Downside
• Non-trivial risk to science. Catastrophic data loss and science-disrupting
downtime can all easily occur down at this level; Mess up badly and you will
LOSE YOUR JOB
33
Wednesday, June 19, 13
WHAT you store is as important as HOW you store it
34
Wednesday, June 19, 13
Information Management
‣ Often Overlooked
• Hopefully previous speakers convinced you of the value
gained from “information lifecycle management”
‣ The Core Problem
• POSIX filesystem semantics are insufficient for storing all
of the attributes and information we want to tag our data
with
• ... “something else” is required
35
Wednesday, June 19, 13
Something else is required ...
Information Management
‣ The large NGS heavy hitters are all looking at
“metadata aware” storage as the penultimate
solution
‣ Small & midrange NGS shops usually
leverage LIMS with a bit of storage
reporting/analytics
‣ LIMS warning:
• NGS vendors tend to assume you will only use NGS
instruments that they make! Their software may not
handle a future “multi-platform” NGS environment
• Beware of the time/effort/cost required to modify
many LIMS systems that are on the market today
• BioTeam consulting has resulted in some products
being made in this space
- MiniLIMS, Slipstream NGS & Galaxy Editions
36
Wednesday, June 19, 13
Importance of ‘non-crap’ metrics
37
Wednesday, June 19, 13
Storage Metrics
‣ It is VERY important that you understand what you
are storing and what the short, medium and long-
term trend lines look like
‣ Very few people actually bother to do this
‣ ... and many that do end up producing pretty graphs
that look good on dashboards but don’t actually
help drive scaling, refresh or upgrade decisions.
‣ You need metrics that can drive actionable
decisions related to storage management and
growth
38
Wednesday, June 19, 13
Some (Biased) Examples ...
39
Wednesday, June 19, 13
It’s 2013 ... we know what questions to ask about our storage
40
Wednesday, June 19, 13
A 6 month rolling window provides real/actionable info ...
41
Wednesday, June 19, 13
Critical to have a handle on “raw” vs “derived” data also
42
Raw
70%
Derived
30%
PacBio (1.55 TB vs .569 TB)
Raw
86%
Derived
14%
Roche454 (4.55 TB vs .757 TB)
Raw
85%
Derived
15%
Illumina (10.171 TB vs 1.86 TB
Wednesday, June 19, 13
Physical & Network NGS Ingest
43
Wednesday, June 19, 13
You need a plan for both network and physical ingest
NGS Data Ingest
‣ Whatever your ‘stance’ is today regarding
ingest of external NGS data it will almost
certainly change over time
• ... interesting public domain data sets
• Data from collaborators & partners
• Moving data among your own organization
‣ Plan for both ‘network’ and ‘physical’ methods
44
Wednesday, June 19, 13
You need a plan for both network and physical ingest
NGS Data Ingest
‣ Ingest is hard. It may seem easy but it’s not,
especially if you care about data integrity.
• Are you validating MD5 checksums on every file each
time it moves from location A to location B?
‣ ... it can also sap a lot of time and effort from
your staff if done ad-hoc or in a disorganized
way
‣ Both physical and network-based ingest require
non-trivial amounts of upfront thought. Some
infrastructure & software may also be required
45
Wednesday, June 19, 13
46
“Naked” Data Movement
Wednesday, June 19, 13
47
Physical data movement station; Unit= Naked Disk
Wednesday, June 19, 13
48
“Naked” Data Archive
Wednesday, June 19, 13
49
Cloud/Network-based Data Movement
High speed 7+ hour sustained transfer from US East to West Coast
Sufficient for a NGS core facility ...
Wednesday, June 19, 13
You need a plan for both network and physical ingest
Physical NGS Data Ingest
‣ Physical ingest is best done with dedicated hardware
and (ideally) a dedicated workstation
‣ Things to think about
• How are you labeling/storing/tracking physical media? Who does
the work? Expensive PhD? IT staff? Is there a written SOP guiding
the process?
• How does physical media end up at your loading dock? Where
does it go after that?
• Is your ingest workstation fast enough to handle MD5
checksumming on the fly? Enough RAM for lots of TCP sessions?
• Is your ingest station physically located in an optimal network
location to facilitate the data movement to core storage?
50
Wednesday, June 19, 13
You need a plan for both network and physical ingest
Network NGS Data Ingest
‣ Network ingest (at high speed) requires advance
planning and potential infrastructure
‣ Things to think about
• Commercial via Aspera? OpenSource via GridFTP? Something
else?
• How exactly will you safely get data inside your organization via the
internet? How do you move from DMZ through firewall and onto
your internal scientific IT systems?
• Can you move data at speed without taking down VOIP and
Teleconferencing systems or making network admins cry?
• Will the IDS or Firewall doing deep packet inspection or protocol
reassembly melt under the load?
51
Wednesday, June 19, 13
52
Wrap-up: Summary Tips
Wednesday, June 19, 13
53
Ending Advice: 1 of 6
‣ Understand the ‘interesting time’ we are in
• Science is changing faster than we can refresh IT
• Disruptive innovation in the NGS space itself
‣ Advice:
• Spend as much time thinking about future flexibility as
you spend on actual current needs & requirements
Wednesday, June 19, 13
54
Ending Advice: 2 of 6
‣ NGS Assumptions don’t last very long
• Will you change NGS vendor, platform or method?
• Will the tools in use today still be in use tomorrow?
• How will the “local vs. outsourced vs. cloud” landscape
change for you over the next few years?
‣ Advice:
• Avoid things that lock you into a vendor or platform
• Look long and hard at your default assumptions
Wednesday, June 19, 13
55
Ending Advice: 3 of 6
‣ You need Physical & Network Ingest Planning
• You may have standardized on one method or practice
but there will always be outliers and unexpected
situations; Data always seems to be on the move!
- NGS data volume mean outliers are non-trivial to handle
‣ Advice:
• Just think about how you would handle the edge cases
and unexpected; don’t go crazy with upfront investment.
Wednesday, June 19, 13
56
Ending Advice: 4 of 6
‣ You need a cloud strategy. Today.
- Your users or vendors may force the issue
- The economic trend lines make cloud inescapable
- 90% of cloud is “easy”. Remaining 10% takes time & effort
‣ Advice:
• 100% Cloud is not unreasonable in 2013*
• Do the boring/long work now (policies, procedure, etc.)
• Consider laying the tech groundwork (Bandwidth, VPN, VPC
& Identity Management) now so you can easily and simply
make use of the cloud when needed
Wednesday, June 19, 13
57
Ending Advice: 5 of 6
‣ Compute & Analysis
‣ Advice:
• Compute power is essentially a commodity in 2013
- Both local and “on the cloud”
• Easy and relatively inexpensive to acquire and deploy
• There are some potential ‘gotcha’ and tuning areas that
deserve advance thought and attention
- ... but relative to storage & data it’s an “easy” problem area
Wednesday, June 19, 13
58
Ending Advice: 6 of 6
‣ Storage & Data Management
‣ Advice:
• Bulk of your attention & budget goes here
• Huge diversity in product and feature offerings mean
more risk & more chances of mistakes
- Outside expertise & NGS-aware vendors like Accunet &
Cambridge Computer really can act as “value added resellers”
• Pick one of the “default stances” that best match your
organization funding & staffing model and build around
that
Text
Wednesday, June 19, 13
59
end; Thanks!
Slides: http://slideshare.net/chrisdag/
Wednesday, June 19, 13
60
Wednesday, June 19, 13
61
Infrastructure Tour
What does this stuff look like?
Wednesday, June 19, 13
62
The cliche image
Wednesday, June 19, 13
63
Lab-local HPC & storage
Wednesday, June 19, 13
64
More lab-local kit
Wednesday, June 19, 13
65
Small core w/ multiple NGS instrument support
Wednesday, June 19, 13
66
Small cluster; large storage
Wednesday, June 19, 13
67
Mid-sized core facility
Wednesday, June 19, 13
68
Large Core Facility
Wednesday, June 19, 13
69
Large Core Facility: Just Storage
Wednesday, June 19, 13
70
Regional Scientific Computing “Hub”
Wednesday, June 19, 13
71
Petabyte-scale Storage
Wednesday, June 19, 13
72
Yep. This counts.
16 monster compute nodes + 22 GPU nodes
Cost? 30 bucks an hour via AWS Spot Market
Real world screenshot from mid-2012
Wednesday, June 19, 13
73
Physical data movement station;
Wednesday, June 19, 13
74
Physical data movement station; Unit= Naked Disk
Wednesday, June 19, 13
75
Cloud/Network-based Data Movement
High speed 7+ hour sustained transfer from US East to West Coast
Sufficient for a NGS core facility ...
Wednesday, June 19, 13
76
“Naked” Data Movement
Wednesday, June 19, 13
77
“Naked” Data Archive
Wednesday, June 19, 13

More Related Content

More from Chris Dagdigian

2021 Trends from the Trenches
2021 Trends from the Trenches2021 Trends from the Trenches
2021 Trends from the TrenchesChris Dagdigian
 
Bio-IT Trends From The Trenches (digital edition)
Bio-IT Trends From The Trenches (digital edition)Bio-IT Trends From The Trenches (digital edition)
Bio-IT Trends From The Trenches (digital edition)Chris Dagdigian
 
Trends from the Trenches: 2019
Trends from the Trenches: 2019Trends from the Trenches: 2019
Trends from the Trenches: 2019Chris Dagdigian
 
Practical Petabyte Pushing
Practical Petabyte PushingPractical Petabyte Pushing
Practical Petabyte PushingChris Dagdigian
 
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)Chris Dagdigian
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Chris Dagdigian
 
BioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesBioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesChris Dagdigian
 
2015 Bio-IT Trends From the Trenches
2015 Bio-IT Trends From the Trenches2015 Bio-IT Trends From the Trenches
2015 Bio-IT Trends From the TrenchesChris Dagdigian
 
2015 CDC Workshop on ScienceDMZ
2015 CDC Workshop on ScienceDMZ2015 CDC Workshop on ScienceDMZ
2015 CDC Workshop on ScienceDMZChris Dagdigian
 
BioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology ExchangeBioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology ExchangeChris Dagdigian
 
Cloud Security for Life Science R&D
Cloud Security for Life Science R&DCloud Security for Life Science R&D
Cloud Security for Life Science R&DChris Dagdigian
 
2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentationChris Dagdigian
 
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingBio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingChris Dagdigian
 
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedBio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedChris Dagdigian
 
2013: Trends from the Trenches
2013: Trends from the Trenches2013: Trends from the Trenches
2013: Trends from the TrenchesChris Dagdigian
 
Multi-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersMulti-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersChris Dagdigian
 
AWS re:Invent - Accelerating Research
AWS re:Invent - Accelerating ResearchAWS re:Invent - Accelerating Research
AWS re:Invent - Accelerating ResearchChris Dagdigian
 
Bio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersChris Dagdigian
 
Trends from the Trenches (Singapore Edition)
Trends from the Trenches (Singapore Edition)Trends from the Trenches (Singapore Edition)
Trends from the Trenches (Singapore Edition)Chris Dagdigian
 
2012: Trends from the Trenches
2012: Trends from the Trenches2012: Trends from the Trenches
2012: Trends from the TrenchesChris Dagdigian
 

More from Chris Dagdigian (20)

2021 Trends from the Trenches
2021 Trends from the Trenches2021 Trends from the Trenches
2021 Trends from the Trenches
 
Bio-IT Trends From The Trenches (digital edition)
Bio-IT Trends From The Trenches (digital edition)Bio-IT Trends From The Trenches (digital edition)
Bio-IT Trends From The Trenches (digital edition)
 
Trends from the Trenches: 2019
Trends from the Trenches: 2019Trends from the Trenches: 2019
Trends from the Trenches: 2019
 
Practical Petabyte Pushing
Practical Petabyte PushingPractical Petabyte Pushing
Practical Petabyte Pushing
 
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
 
BioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesBioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the Trenches
 
2015 Bio-IT Trends From the Trenches
2015 Bio-IT Trends From the Trenches2015 Bio-IT Trends From the Trenches
2015 Bio-IT Trends From the Trenches
 
2015 CDC Workshop on ScienceDMZ
2015 CDC Workshop on ScienceDMZ2015 CDC Workshop on ScienceDMZ
2015 CDC Workshop on ScienceDMZ
 
BioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology ExchangeBioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology Exchange
 
Cloud Security for Life Science R&D
Cloud Security for Life Science R&DCloud Security for Life Science R&D
Cloud Security for Life Science R&D
 
2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation
 
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingBio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
 
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedBio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
 
2013: Trends from the Trenches
2013: Trends from the Trenches2013: Trends from the Trenches
2013: Trends from the Trenches
 
Multi-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersMulti-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC Clusters
 
AWS re:Invent - Accelerating Research
AWS re:Invent - Accelerating ResearchAWS re:Invent - Accelerating Research
AWS re:Invent - Accelerating Research
 
Bio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility Managers
 
Trends from the Trenches (Singapore Edition)
Trends from the Trenches (Singapore Edition)Trends from the Trenches (Singapore Edition)
Trends from the Trenches (Singapore Edition)
 
2012: Trends from the Trenches
2012: Trends from the Trenches2012: Trends from the Trenches
2012: Trends from the Trenches
 

Recently uploaded

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

GTC 2013: Practical NGS Data Management

  • 1. 1 Practical NGS Data Management 2013 GTC Bioinformatics & Data Management Strategies - San Francisco Wednesday, June 19, 13
  • 2. 2 I’m Chris. I’m an infrastructure geek. I work for the BioTeam. www.bioteam.net - Twitter: @chris_dag Wednesday, June 19, 13
  • 3. Who, What, Why ... 3 BioTeam ‣ Independent consulting shop ‣ Staffed by scientists forced to learn IT, SW & HPC to get our own research done ‣ 10+ years bridging the “gap” between science, IT & high performance computing ‣ We get to see how many groups of smart people tackle similar problems Wednesday, June 19, 13
  • 4. Listen to me at your own risk 4 Standard Dag Disclaimer ‣ I’m not an expert, pundit, visionary or “thought leader” ‣ Any career success entirely due to shamelessly copying what actual smart people do ‣ I’m biased, burnt-out & cynical ‣ Filter my words accordingly Wednesday, June 19, 13
  • 5. 5 So why are you here? Wednesday, June 19, 13
  • 6. 6 It’s a risky time to be doing Bio-IT Wednesday, June 19, 13
  • 7. Science progressing way faster than IT can refresh/change The Central Problem Is ... ‣ Instrumentation & protocols are changing FAR FASTER than we can refresh our Research-IT & Scientific Computing infrastructure • Bench science is changing month-to-month ... • ... while our IT infrastructure only gets refreshed every 2-7 years ‣ We have to design systems TODAY that can support unknown research requirements & workflows over many years (gulp ...) 7 Wednesday, June 19, 13
  • 8. The Central Problem Is ... ‣ The easy period is over ‣ 5 years ago we could toss inexpensive storage and servers at the problem; even in a nearby closet or under a lab bench if necessary ‣ That does not work any more; real solutions required 8 Wednesday, June 19, 13
  • 10. We are here today because ... ‣ It has never been easier to acquire vast amounts of data cheaply and easily ‣ Growth rate of data creation/ ingest exceeds rate at which the storage industry is improving disk capacity ‣ Not just a storage lifecycle problem. This data *moves* and often needs to be shared among multiple entities and providers • ... ideally without punching holes in your firewall or consuming all available internet bandwidth 10 Wednesday, June 19, 13
  • 11. If you get it wrong ... ‣ Lost opportunity ‣ Missing capability ‣ Frustrated & very vocal scientific staff ‣ Problems in recruiting, retention, publication & product development 11 Wednesday, June 19, 13
  • 12. 12 Topic #1: “Meta Issues” Wednesday, June 19, 13
  • 13. Get Comfortable With Insane Rates Of Change Meta Issue #1 ‣ Genome Sequencing innovation rate is simply ludicrous. ‣ Similar rapid change in tools & lab protocols ‣ We MUST acknowledge and plan for disruptive science affecting IT systems and methods 13 Wednesday, June 19, 13
  • 14. Get Comfortable With Insane Rates Of Change Meta Issue #1, continued ‣ Multiple ways to approach this, often affected by how your funding cycle works 1. Over-design & Over-provision 2. Spend upfront on agility & scaling at the IT core 3. Incremental refresh of smaller, “right-sized” systems 14 Wednesday, June 19, 13
  • 15. Know Your Platforms Meta Issue #2 ‣ NGS Platform has huge impact on IT footprint ‣ Which category do you fall in? • Single NGS instrument? Multiple NGS instruments? • One NGS vendor or many? • Outsourced sequencing? • Outsourced sequencing + analysis? • Hybrid mix of onsite & outsourced? 15 Wednesday, June 19, 13
  • 16. Understand The Duty Cycle Meta Issue #3 ‣ NGS “duty cycle” also has huge influence on size and shape of the IT footprint ‣ What type of lab are you running? • 24x7 NGS Industrial Production? • Central NGS Core Facility? • Single PI, Department or Workgroup? ‣ Operational & Analysis Workflow? • Sequence .... Analyze .... Sequence? • Sequence ... Sequence ... Sequence ... Analysis 16 Wednesday, June 19, 13
  • 17. Know The Tool Landscape Meta Issue #4 ‣ Analysis & data tools are changing almost as rapidly as NGS chemistry & platforms • Open source, commercial, on-premise and cloud-hosted are all in the mix for 2013 ‣ Understand that your software landscape and toolchain may change significantly (multiple times) over the lifespan of your NGS efforts ‣ IT people also need solid understanding of WHO makes algorithm & toolchain decisions 17 Wednesday, June 19, 13
  • 18. Diagram Your Pipelines & Data Flows Meta Issue #5 ‣ Different groups with the same NGS platform will do wildly different things • Low end example: push FASTQ into CLCBio, export VCF files and “call it a day” • High end example: custom reference alignment or de- novo assembly followed by intense human-driven bioinformatics • Complex example: genomic medicine & NGS data being used to drive clinical decisions 18 Wednesday, June 19, 13
  • 19. 19 Can we do an NGS talk without using the ‘C’ word? Wednesday, June 19, 13
  • 20. Whether you like it or not ... NGS & “The Cloud” ‣ You can’t ignore or avoid the cloud ‣ Period. 20 Wednesday, June 19, 13
  • 21. Whether you like it or not ... NGS & “The Cloud” ‣ Why you can’t ignore the cloud in 2013 1. NGS data flows small enough to allow “write to cloud” 2. NGS vendors are forcing the issue 3. Our local storage is increasingly becoming “cloud aware” 4. Your users may prefer a cloud-hosted solution 5. Sequencing partners can deliver directly to the cloud 6. Easier to partner on NGS analysis & data distribution 7. Cloud economic (particularly storage) trends are clear 21 Wednesday, June 19, 13
  • 22. Whether you like it or not ... NGS & “The Cloud” ‣ Why you need to start work NOW ‣ Blunt Truth: 90% of cloud technical bits are easy to understand and fast to implement. ‣ Almost frictionless to access the cloud and NGS vendors have a vested interest in making it faster and easier; cloud may be in use today without your knowledge ‣ Risk of scientists bypassing/leapfrogging internal IT 22 Wednesday, June 19, 13
  • 23. Whether you like it or not ... NGS & “The Cloud” ‣ You need to start NOW because of the 10% of “cloud stuff” that is neither fast nor easy ... • Internal policies/procedures & risk assessment • Adding additional internet capacity takes time • Safe networking, firewall, VPN, VPC and Identity Management implementations require experts to design and potentially lengthy implementation periods ‣ “Accessing” cloud is easy. Using it properly, safely & persistently is not easy and not trivial. 23 Wednesday, June 19, 13
  • 24. 24 Subnets & VPC can be more complex than the compute & storage Wednesday, June 19, 13
  • 25. 25 Storage. (the hard bit ...) Wednesday, June 19, 13
  • 26. Storage & Information Management ‣ Compute power in 2103 is a cheap commodity ‣ Storage? Not so much. • Still many ways to spectacularly waste money • Incredible diversity of vendors, products & capability ‣ A significant percentage of your budget and pre-purchase design efforts should center around storage, data movement & data lifecycle management 26 Wednesday, June 19, 13
  • 27. Storage & Information Management ‣ Only time for 3 bits of advice: 1. The need for a default storage ‘stance’ 2. WHAT you store is as important as HOW you store it 3. Importance of non-crap metrics 27 Wednesday, June 19, 13
  • 28. Storage ‘Stance’ ‣ Storage landscape is immense & diverse • 100TB storage can be bought for $12,000 - $400,000 ‣ You need a ‘default stance’ ‣ Good news is you have many options ‣ ... your ‘stance’ is often defined by how your budget and funding cycles work 28 Wednesday, June 19, 13
  • 29. Monolithic “all-in-one” Storage ‘Stance’ - Option 1 ‣ Sized for the future on Day #1 • Purchased upfront with future need in mind; looks overbuilt and over-provisioned early on ‣ Good for: • Groups with “one-shot” funding & little refresh chances over the lifespan of the platform ‣ Not great if: • Business or science changes unexpectedly • You did your sizing/scaling/growth calculations wrong 29 Wednesday, June 19, 13
  • 30. Tier 1 Storage - Easy to Grow & Manage Storage ‘Stance’ - Option 2 ‣ Invest upfront in peta-capable single-namespace & low operational burden • Enterprise-grade storage that is very easy to manage, maintain and grow over time ‣ Good for: • Organizations where getting new headcount is harder than spending CapEx; Intentional spend on hardware that does not require additional humans to run & maintain it. • Organizations with budgeting that allows for incremental refresh cycles • Organizations without onsite gurus & dedicated storage admins ‣ Downside: • Upfront & ongoing investment can be large; possibly affecting compute, tools or software budget • Expensive relative to alternatives. You are paying for “future-proof” scalability & systems engineered for the lowest possible operational burden 30 Wednesday, June 19, 13
  • 31. Getting clever & straddling Tier 1 and Tier 2 Storage Storage ‘Stance’ - Option 3 ‣ Strive for peta-capable single-namespace and easy operation • ... but be willing to make modest trade-offs in exchange for lower cost • Look at both Tier 1 and Tier 2 storage vendors (and people like Cambridge Computer) ‣ Good for: • Organizations willing to take a more active role in vendor selection, design, deployment & operation • Organizations motivated by ROI and willing to make modest trade-offs in capability, performance or operational burden in exchange for lower CapEx cost ‣ Downside: • More risk in this area - easy to make a misguided decision. Requires brains & active interest in pre-sale design and vendor selection process. May require more storage admin effort day-to-day. Some trade-offs are better than others. 31 Wednesday, June 19, 13
  • 32. Clever but not dumb Storage ‘Stance’ - Option 4 ‣ Midrange “Cheap & Clever” • There are tons of very interesting Tier2 and Tier3 storage options available. The hard part is separating the good stuff from the crap stuff. • Check out: RAID Inc, NexentaStor, NexSan, etc. etc. ‣ Good for: • Budget constrained groups with motivated IT people ‣ Downside • Might have to throw away stuff as you outgrow it (“forklift upgrade”) • Careful pre-purchase work required to properly config/size it • Storage design may force changes on scientific workflows • Higher administrative burden 32 Wednesday, June 19, 13
  • 33. DIY & Super Cheap Storage ‘Stance’ - Option 4 ‣ DIY & Disruptive • Incredibly disruptive stuff is out there for motivated DIY’ers and people who can’t afford Tier 1 and Tier 2 platforms; This is where you can spend $12,000 on a 100TB storage node. • Driven largely by high-density x86_64 server chassis and many people writing clever software (both free and commercial). There are NAS, SAN, Parallel and Distributed filesystem options all in this realm ‣ Good for: • Smart people & organizations with guru storage & sysadmin resources. • People with no money or people who spent all their money on NGS instrument & reagents and “forgot about all that IT stuff ...” ‣ Downside • Non-trivial risk to science. Catastrophic data loss and science-disrupting downtime can all easily occur down at this level; Mess up badly and you will LOSE YOUR JOB 33 Wednesday, June 19, 13
  • 34. WHAT you store is as important as HOW you store it 34 Wednesday, June 19, 13
  • 35. Information Management ‣ Often Overlooked • Hopefully previous speakers convinced you of the value gained from “information lifecycle management” ‣ The Core Problem • POSIX filesystem semantics are insufficient for storing all of the attributes and information we want to tag our data with • ... “something else” is required 35 Wednesday, June 19, 13
  • 36. Something else is required ... Information Management ‣ The large NGS heavy hitters are all looking at “metadata aware” storage as the penultimate solution ‣ Small & midrange NGS shops usually leverage LIMS with a bit of storage reporting/analytics ‣ LIMS warning: • NGS vendors tend to assume you will only use NGS instruments that they make! Their software may not handle a future “multi-platform” NGS environment • Beware of the time/effort/cost required to modify many LIMS systems that are on the market today • BioTeam consulting has resulted in some products being made in this space - MiniLIMS, Slipstream NGS & Galaxy Editions 36 Wednesday, June 19, 13
  • 37. Importance of ‘non-crap’ metrics 37 Wednesday, June 19, 13
  • 38. Storage Metrics ‣ It is VERY important that you understand what you are storing and what the short, medium and long- term trend lines look like ‣ Very few people actually bother to do this ‣ ... and many that do end up producing pretty graphs that look good on dashboards but don’t actually help drive scaling, refresh or upgrade decisions. ‣ You need metrics that can drive actionable decisions related to storage management and growth 38 Wednesday, June 19, 13
  • 39. Some (Biased) Examples ... 39 Wednesday, June 19, 13
  • 40. It’s 2013 ... we know what questions to ask about our storage 40 Wednesday, June 19, 13
  • 41. A 6 month rolling window provides real/actionable info ... 41 Wednesday, June 19, 13
  • 42. Critical to have a handle on “raw” vs “derived” data also 42 Raw 70% Derived 30% PacBio (1.55 TB vs .569 TB) Raw 86% Derived 14% Roche454 (4.55 TB vs .757 TB) Raw 85% Derived 15% Illumina (10.171 TB vs 1.86 TB Wednesday, June 19, 13
  • 43. Physical & Network NGS Ingest 43 Wednesday, June 19, 13
  • 44. You need a plan for both network and physical ingest NGS Data Ingest ‣ Whatever your ‘stance’ is today regarding ingest of external NGS data it will almost certainly change over time • ... interesting public domain data sets • Data from collaborators & partners • Moving data among your own organization ‣ Plan for both ‘network’ and ‘physical’ methods 44 Wednesday, June 19, 13
  • 45. You need a plan for both network and physical ingest NGS Data Ingest ‣ Ingest is hard. It may seem easy but it’s not, especially if you care about data integrity. • Are you validating MD5 checksums on every file each time it moves from location A to location B? ‣ ... it can also sap a lot of time and effort from your staff if done ad-hoc or in a disorganized way ‣ Both physical and network-based ingest require non-trivial amounts of upfront thought. Some infrastructure & software may also be required 45 Wednesday, June 19, 13
  • 47. 47 Physical data movement station; Unit= Naked Disk Wednesday, June 19, 13
  • 49. 49 Cloud/Network-based Data Movement High speed 7+ hour sustained transfer from US East to West Coast Sufficient for a NGS core facility ... Wednesday, June 19, 13
  • 50. You need a plan for both network and physical ingest Physical NGS Data Ingest ‣ Physical ingest is best done with dedicated hardware and (ideally) a dedicated workstation ‣ Things to think about • How are you labeling/storing/tracking physical media? Who does the work? Expensive PhD? IT staff? Is there a written SOP guiding the process? • How does physical media end up at your loading dock? Where does it go after that? • Is your ingest workstation fast enough to handle MD5 checksumming on the fly? Enough RAM for lots of TCP sessions? • Is your ingest station physically located in an optimal network location to facilitate the data movement to core storage? 50 Wednesday, June 19, 13
  • 51. You need a plan for both network and physical ingest Network NGS Data Ingest ‣ Network ingest (at high speed) requires advance planning and potential infrastructure ‣ Things to think about • Commercial via Aspera? OpenSource via GridFTP? Something else? • How exactly will you safely get data inside your organization via the internet? How do you move from DMZ through firewall and onto your internal scientific IT systems? • Can you move data at speed without taking down VOIP and Teleconferencing systems or making network admins cry? • Will the IDS or Firewall doing deep packet inspection or protocol reassembly melt under the load? 51 Wednesday, June 19, 13
  • 53. 53 Ending Advice: 1 of 6 ‣ Understand the ‘interesting time’ we are in • Science is changing faster than we can refresh IT • Disruptive innovation in the NGS space itself ‣ Advice: • Spend as much time thinking about future flexibility as you spend on actual current needs & requirements Wednesday, June 19, 13
  • 54. 54 Ending Advice: 2 of 6 ‣ NGS Assumptions don’t last very long • Will you change NGS vendor, platform or method? • Will the tools in use today still be in use tomorrow? • How will the “local vs. outsourced vs. cloud” landscape change for you over the next few years? ‣ Advice: • Avoid things that lock you into a vendor or platform • Look long and hard at your default assumptions Wednesday, June 19, 13
  • 55. 55 Ending Advice: 3 of 6 ‣ You need Physical & Network Ingest Planning • You may have standardized on one method or practice but there will always be outliers and unexpected situations; Data always seems to be on the move! - NGS data volume mean outliers are non-trivial to handle ‣ Advice: • Just think about how you would handle the edge cases and unexpected; don’t go crazy with upfront investment. Wednesday, June 19, 13
  • 56. 56 Ending Advice: 4 of 6 ‣ You need a cloud strategy. Today. - Your users or vendors may force the issue - The economic trend lines make cloud inescapable - 90% of cloud is “easy”. Remaining 10% takes time & effort ‣ Advice: • 100% Cloud is not unreasonable in 2013* • Do the boring/long work now (policies, procedure, etc.) • Consider laying the tech groundwork (Bandwidth, VPN, VPC & Identity Management) now so you can easily and simply make use of the cloud when needed Wednesday, June 19, 13
  • 57. 57 Ending Advice: 5 of 6 ‣ Compute & Analysis ‣ Advice: • Compute power is essentially a commodity in 2013 - Both local and “on the cloud” • Easy and relatively inexpensive to acquire and deploy • There are some potential ‘gotcha’ and tuning areas that deserve advance thought and attention - ... but relative to storage & data it’s an “easy” problem area Wednesday, June 19, 13
  • 58. 58 Ending Advice: 6 of 6 ‣ Storage & Data Management ‣ Advice: • Bulk of your attention & budget goes here • Huge diversity in product and feature offerings mean more risk & more chances of mistakes - Outside expertise & NGS-aware vendors like Accunet & Cambridge Computer really can act as “value added resellers” • Pick one of the “default stances” that best match your organization funding & staffing model and build around that Text Wednesday, June 19, 13
  • 61. 61 Infrastructure Tour What does this stuff look like? Wednesday, June 19, 13
  • 63. 63 Lab-local HPC & storage Wednesday, June 19, 13
  • 65. 65 Small core w/ multiple NGS instrument support Wednesday, June 19, 13
  • 66. 66 Small cluster; large storage Wednesday, June 19, 13
  • 69. 69 Large Core Facility: Just Storage Wednesday, June 19, 13
  • 70. 70 Regional Scientific Computing “Hub” Wednesday, June 19, 13
  • 72. 72 Yep. This counts. 16 monster compute nodes + 22 GPU nodes Cost? 30 bucks an hour via AWS Spot Market Real world screenshot from mid-2012 Wednesday, June 19, 13
  • 73. 73 Physical data movement station; Wednesday, June 19, 13
  • 74. 74 Physical data movement station; Unit= Naked Disk Wednesday, June 19, 13
  • 75. 75 Cloud/Network-based Data Movement High speed 7+ hour sustained transfer from US East to West Coast Sufficient for a NGS core facility ... Wednesday, June 19, 13