High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
BioTeam Trends from the Trenches - NIH, April 2014
1. 1
Life Science HPC & Informatics: Trends from the trenches
April 2014
Wednesday, April 9, 14
2. Who, What, Why ...
2
BioTeam
‣ Independent consulting shop
‣ Staffed by scientists forced to
learn IT, SW & HPC to get our
own research done
‣ 12+ years bridging the “gap”
between science, IT & high
performance computing
‣ Our wide-ranging work is what
gets us invited to speak at
events like this ...
Wednesday, April 9, 14
3. Active at NIH since 2008
3
BioTeam & NIH
‣ Our primary goal: make science
easier for researchers at NIH via
scientific computing
‣ Recently involved in many
projects:
• NIH-Wide HPC Assessment
• NIAID HPC Assessment
• NIMH Bioinformatics Assessment
• NCATS IT/Informatics Assessment
• NIH Network Modernization Project
Wednesday, April 9, 14
4. 4
Topic 1: Scariest thing first ...
The biggest meta-issue facing life science informatics
Wednesday, April 9, 14
5. 5
It’s a risky time to be doing Bio-IT
Wednesday, April 9, 14
6. 6
Big Picture / Meta Issue
‣ HUGE revolution in the rate at which
lab platforms are being redesigned,
improved & refreshed
• Example: CCD sensor upgrade on that
confocal microscopy rig just doubled
storage requirements
• Example: The 2D ultrasound imager is
now a 3D imager
• Example: Illumina HiSeq upgrade just
doubled the rate at which you can acquire
genomes. Massive downstream increase
in storage, compute & data movement
needs
‣ For the above examples, do you
think IT was informed in advance?
Wednesday, April 9, 14
7. Science progressing way faster than IT can refresh/change
The Central Problem Is ...
‣ Instrumentation & protocols are changing FAR
FASTER than we can refresh our Research-IT &
Scientific Computing infrastructure
• Bench science is changing month-to-month ...
• ... while our IT infrastructure only gets refreshed every
2-7 years
‣ We have to design systems TODAY that can
support unknown research requirements &
workflows over many years (gulp ...)
7
Wednesday, April 9, 14
8. The Central Problem Is ...
‣ The easy period is over
‣ 5 years ago we could toss
inexpensive storage and
servers at the problem;
even in a nearby closet or
under a lab bench if
necessary
‣ That does not work any
more; real solutions
required
8
Wednesday, April 9, 14
10. And a related problem ...
‣ It has never been easier to
acquire vast amounts of data
cheaply and easily
‣ Growth rate of data creation/
ingest exceeds rate at which
the storage industry is
improving disk capacity
‣ Not just a storage lifecycle
problem. This data *moves*
and often needs to be shared
among multiple entities and
providers
• ... ideally without punching holes in
your firewall or consuming all
available internet bandwidth
10
Wednesday, April 9, 14
11. If we get it wrong ...
‣ Lost opportunity
‣ Missing capability
‣ Frustrated & very vocal scientific staff
‣ Slowed pace of scientific discovery
‣ Problems in recruiting, retention,
publication & product development
11
Wednesday, April 9, 14
12. Up to a two line subtitle, generally used to describe the
takeaway for the slide
12
Basic Bio/IT Landscape
Wednesday, April 9, 14
13. Compute related design patterns largely static
13
Core Compute
‣ Linux compute clusters
are still the baseline
compute platform
‣ Even our lab instruments
know how to submit jobs
to common HPC cluster
schedulers
‣ Compute is not hard. It’s a
commodity that is easy to
acquire & deploy in 2014
Wednesday, April 9, 14
14. We have them all
File & Data Types
‣ Massive text files
‣ Massive binary files
‣ Flatfile ‘databases’
‣ Spreadsheets everywhere
‣ Directories w/ 6 million
files
‣ Large files: 600GB+
‣ Small files: 30kb or smaller
14
Wednesday, April 9, 14
15. 15
Application characteristics
‣ Mostly SMP/threaded apps
performance bound by IO and/or
RAM
‣ Hundreds of apps, codes & toolkits
‣ 1TB - 2TB RAM “High Memory”
nodes becoming essential
‣ Lots of Perl/Python/R
‣ MPI is rare
• Well written MPI is even rarer
‣ Few MPI apps actually benefit from
expensive low-latency
interconnects*
• *Chemistry, modeling and structure work
is the exception
Wednesday, April 9, 14
16. 16
Storage & Data Management
‣ LifeSci core requirement:
• Shared, simultaneous read/write
access across many instruments,
desktops & HPC silos
‣ NAS = “easiest” option
• Scale Out NAS products are the
mainstream standard
‣ Parallel & Distributed storage
for edge cases and large
organizations with known
performance needs
• Becoming much more common:
GPFS has taken hold in LifeSci
Wednesday, April 9, 14
17. 17
Storage & Data Management
‣ Storage & data mgmt. is the #1
infrastructure headache in life
science environments
‣ Most labs need “peta capable”
storage due to unpredictable
future
• Only a small % will actually hit 1PB
• Often forced to trade away performance
in order to obtain capacity
‣ Object stores, ZFS and commodity
“Nexentastor-style” methods are
making significant inroads
Wednesday, April 9, 14
18. 18
Data Movement & Data Sharing
‣ Peta-scale data movement
needs
• Within an organization
• To/from collaborators
• To/from suppliers
• To/from public data repos
‣ Peta-scale data sharing needs
• Collaborators and partners may be
all over the world
Wednesday, April 9, 14
19. 19
Networking
‣ Major 2014 focus
‣ May surpass storage as our
#1 infrastructure headache
‣ Why?
• Petascale storage meaningless
if you can’t access/move it
• 10-Gig, 40-Gig and 100-Gig
networking will force significant
changes elsewhere in the ‘bio-
IT’ infrastructure
Wednesday, April 9, 14
20. Physical & Network
20
We Have Both Ingest Problems
‣ Significant physical ingest
occurring in Life Science
• Standard media: naked SATA drives
shipped via Fedex
‣ Cliche example:
• 30 genomes outsourced means 30
drives will soon be sitting in your mail
pile
‣ Organizations often use similar
methods to freight data between
buildings and among geographic
sites
Wednesday, April 9, 14
21. 21
Physical Ingest Just Plain Nasty
‣ Most common high-speed
network: FedEx
‣ Easy to talk about in theory
‣ Seems “easy” to scientists
and even IT at first glance
‣ Really really nasty in practice
• Incredibly time consuming
• Significant operational burden
• Easy to do badly / lose data
Wednesday, April 9, 14
22. And huge need for fast(er) research networks!
22
Huge Need For Network Ingest
1. Public data repositories have
petabytes of useful data
2. Collaborators still need to
swap data in serious ways
3. Amazon becoming an
important repo of public and
private sources
4. Many vendors now “deliver”
to the cloud
Wednesday, April 9, 14
24. 24
Life Science In One Slide:
‣ Huge compute needs but not intractable and generally
solved via Linux HPC farms. Most of our workloads are
serial/batch in nature
‣ Ludicrous rate of innovation in lab drives a similar rate of
change for our software and tool environment
‣ With science changing faster than IT, emphasis is on
agility and flexibility - we’ll trade performance for some
measure of future proofing
‣ Buried in data. Getting worse. Individual scientists can
generate petascale data streams.
‣ We have all of the Information Lifecycle problems: Storing,
Curating, Managing, Sharing, Ingesting and Moving
Wednesday, April 9, 14
28. 28
DevOps & Scriptable Everything
‣ On (real) clouds,
EVERYTHING has an API
‣ If it’s got an API you can
automate and orchestrate
it
‣ “scriptable infrastructure”
is now a reality
‣ Driving capabilities that
we will need in 2014 and
beyond
Wednesday, April 9, 14
29. 29
DevOps & Scriptable Everything
‣ Incredible innovation in
the past few years
‣ Driven mainly by
companies with
massive internet
‘fleets’ to manage
‣ ... but the benefits
trickle down to us little
people
Wednesday, April 9, 14
30. ... and conquer the enterprise
30
DevOps will enable hybrid HPC
‣ Cloud automation/
orchestration methods
have been trickling down
into our local
infrastructures
‣ Driving significant impact
on careers, job
descriptions and org charts
‣ These methods are
necessary for emerging
hybrid cloud models for
HPC/sharing
Wednesday, April 9, 14
31. 2014: Continue to blur the lines between all these roles
31
Scientist/SysAdmin/Programmer
‣ IT jobs, roles and
responsibilities are going
to change significantly
‣ SysAdmins must learn to
program in order to
harness automation tools
‣ Programmers &
Scientists can now self-
provision and control
sophisticated IT
resources
Wednesday, April 9, 14
32. 2014: Continue to blur the lines between all these roles
32
Scientist/SysAdmin/Programmer
‣ My take on the future ...
• SysAdmins (Windows & Linux) who
can’t code will have career issues
• Far more control is going into the
hands of the research end user
• IT support roles will radically change
-- no longer owners or gatekeepers
‣ IT will “own” policies,
procedures, reference patterns,
identity mgmt, security & best
practices
‣ Research will control the
“what”, “when” and “how big”
Wednesday, April 9, 14
33. Research needing more and more compute
33
IT Orgs are Changing as well...
‣ 25% of researchers will
need HPC this year
‣ 75% will need high-
volume storage
‣ IT evolved from
administrative need
• Science started grabbing
resources
• IT either adapted or was
replaced
Wednesday, April 9, 14
34. Research needing more and more compute
34
IT Orgs are Changing as well...
‣ Three types of adaptations
• IT evolved to include research
IT support
• IT split into research IT and
corporate IT
• IT became primarily research
org -> run by CSIO
‣ Orgs with scientific
missions need adaptive IT
with stake in research
projects -> restrictions kill
science
Wednesday, April 9, 14
36. 36
Compute:
‣ Kind of boring. Solved
problem in 2014
‣ Compute power is a
commodity
• Inexpensive relative to other
costs
• Far less vendor differentiation
than storage
• Easy to acquire; easy to
deploy
Wednesday, April 9, 14
37. Defensive hedge against Big Data / HDFS
37
Compute: Local Disk is Back
‣ We’ve started to see organizations move
away from blade servers and 1U pizza box
enclosures for HPC
‣ The “new normal” may be 4U enclosures
with massive local disk spindles - not
occupied, just available
‣ Why? Hadoop & Big Data
‣ This is a defensive hedge against future
HDFS or similar requirements
• Remember the ‘meta’ problem - science is
changing far faster than we can refresh IT. This
is a defensive future-proofing play.
‣ Hardcore Hadoop rigs sometimes operate
at 1:1 ratio between core count and disk
count
Wednesday, April 9, 14
38. New and refreshed HPC systems running many node types
38
Compute: Huge trend in ‘diversity’
‣ Accelerated trend since at least 2012 ...
• HPC compute resources no longer homogenous;
many types and flavors now deployed in single
HPC stacks
‣ Newer clusters mix-and-match to match
the known use cases:
• GPU nodes for compute
• GPU nodes for visualization
• Large memory nodes (512GB +)
• Very Large memory nodes (1TB +)
• ‘Fat’ nodes with many CPU cores
• ‘Thin’ nodes with super-fast CPUs
• Analytic nodes with SSD, FusionIO, flash or large
local disk for ‘big data’ tasks
Wednesday, April 9, 14
39. GPUs, Coprocessors & FPGAs
39
Compute: Hardware Acceleration
‣ Specialized hardware
acceleration has it’s place
but will not take over the
world
• “... the activation energy required
for a scientist to use this stuff is
generally quite high ...”
‣ GPU, Phi and FPGA best
used in large scale pipelines
or as specific solution to a
singular pain point
Wednesday, April 9, 14
40. Also known as hybrid clouds
Emerging Trend: Hybrid HPC
‣ Relatively new idea
• small local footprint
• large, dynamic, scalable, orchestrated
public cloud component
‣ DevOps is key to making this work
‣ High-speed network to public
cloud required
‣ Software interface layer acting as
the mediator between local and
public resources
‣ Good for tight budgets, has to be
done right to work
‣ Not many working examples yet
40
Wednesday, April 9, 14
42. 42
Network: Speed @ Core and Edge
‣ Huge potential pain point
‣ May surpass storage as our
#1 infrastructure headache
‣ Petascale data is useless if
you can’t move it or access
it fast enough
‣ Don’t be smug about 10
Gigabit - folks need to start
thinking *now* about 40 and
even 100 Gigabit Ethernet
Wednesday, April 9, 14
43. 43
Network: Speed @ Core and Edge
‣ Remember 2004 when
research storage
requirements started to dwarf
what the enterprise was
using?
‣ Same thing is happening now
for networking
‣ Research core, edge and top-
of-rack networking speeds
may exceed what the rest of
the organization has
standardized on
Wednesday, April 9, 14
44. Massive data movement needs are driving innovation
NIH Tackling this now!
‣ Currently installing
100Gb research network
‣ Will tackle the petascale
data movement head on
• NIH gaining ground on
1PB/month
• Collaboration, core
compute, data commons,
external data sources
• Science DMZ!
44
Wednesday, April 9, 14
45. Network: ‘ScienceDMZ’
‣ “ScienceDMZ” concept is real and necessary
‣ BioTeam will be building them in 2014 and
beyond
‣ Central premise:
• Legacy firewall, network and security methods
architected for “many small data flows” use cases
• Not built to handle smaller #s of massive data
flows
• Also very hard to deploy ‘traditional’ security gear on
10Gigabit and faster networks
‣ More details, background & documents at
http://fasterdata.es.net/science-dmz/
45
Background
traffic or
competing bursts
DTN traffic with
wire-speed
bursts
10GE
10GE
10GE
Wednesday, April 9, 14
46. Network: ‘ScienceDMZ’
‣ Start thinking/discussing this sooner rather
than later
‣ Just like “the cloud” this may fundamentally
change internal operations and technology
‣ Will also require conscious buy-in and
support from senior network, security and
risk management professionals
• ... these talks take time. Best to plan ahead
46
Wednesday, April 9, 14
47. Network: ‘ScienceDMZ’
‣ A Science DMZ has 3 required components:
1. Very fast “low-friction” network links and paths with
security policy and enforcement specific to scientific
workflows
2. Dedicated, high performance data transfer nodes
(“DTNs”) highly optimized for high speed data xfer
3. Dedicated network performance/measurement nodes
47
Wednesday, April 9, 14
48. 48
Simple Science DMZ:
Image source: “The Science DMZ: Introduction & Architecture” -- esnet
Wednesday, April 9, 14
49. More hype than useful reality at the moment
49
Network: SDN Hype vs. Reality
‣ Software Defined Networking
(“SDN”) is the new buzzword
‣ It will become pervasive and will
change how we build and architect
things
‣ But ...
‣ Not hugely practical at the moment
for most environments
• We need far more than APIs that control
port forwarding behavior on switches
• More time needed for all of the related
technologies and methods to coalesce
into something broadly useful and usable
Wednesday, April 9, 14
51. 51
Storage
‣ Still the biggest expense, biggest headache and
scariest systems to design in modern life science
informatics environments
‣ Many of the pain points we’ve talked about for years
are still in place:
• Explosive growth forcing tradeoffs in capacity over performance
• Lots of monolithic single tiers of storage
• Critical need to actively manage data through it’s full life cycle
(just storing data is not enough ...)
• Need for post-POSIX solutions such as iRODS and other
metadata-aware data repositories
Wednesday, April 9, 14
52. 52
Storage Trends
‣ The large but monolithic storage platforms we’ve
built up over the years are no longer sufficient
• Do you know how many people are running a single large
scale-out NAS or parallel filesystem? Most of us!
‣ Tiered storage is now an absolute requirement
• At a minimum we need an active storage tier plus something
far cheaper/deeper for cold files
‣ Expect the tiers to involve multiple vendors,
products and technologies
• The Tier1 storage vendors tend to have unacceptable
pricing for their “all in one” tiered data management solutions
Wednesday, April 9, 14
53. The Tier 1 storage vendors may be too expensive ...
53
Storage: Disruptive stuff ahead
‣ BioTeam has built 1Petabyte ZFS-based storage pools from
commodity whitebox hardware for about $100,000
‣ Infinidat “IZbox” provides 1Petabyte of usable NAS as a turnkey
appliance for roughly $375,000
• Both of these would be a nice, cost-effective archive or “cold” tier for less-
active file and data storage
• Solutions like these cost far, far less than what Tier 1 storage vendors would
charge for a petabyte of usable storage
• ... of course they come with their own risks and operational burden. This is an
area where proper research and due diligence is essential
‣ Companies like Avere Systems are producing boxes that unify
disparate storage tiers and link them to cloud and object stores
• This is a route to unifying “tier 1” storage with the “cheap & deep” storage
Wednesday, April 9, 14
55. Some final thoughts
55
Future Trends and Patterns
‣ Data generation out-
pacing technology
‣ Cheap/easy laboratory
assays taking over
• Researchers largely don’t
know what to do with it all
• Holding on to the data until
someone figures it out
• This will cause some
interesting headaches for IT
• Huge need for real “Big Data”
applications to be developed
Wednesday, April 9, 14
56. Some final thoughts
56
Future Trends and Patterns
‣ Unless there’s an investment
in ultra-high speed
networking, need to change
thought on analysis
‣ Data commons are becoming
a precedent
• Need to minimize the movement of
data
• Include compute power and
analysis platform with data
commons
‣ Move the analysis to the data,
don’t move the data
• Requires sharing/Large core
institutional resources
Wednesday, April 9, 14
57. Some final thoughts
57
Future Trends & Patterns
‣ Compute continues to become easier
‣ Data movement (physical & network) gets
harder
‣ Cost of storage will be dwarfed by “cost of
managing stored data”
‣ We can see end-of-life for our current IT
architecture and design patterns; new patterns
will start to appear over next 2-5 years
‣ We’ve got a new headache to worry about ...
Wednesday, April 9, 14
58. A new challenge ...
58
Future Trends & Patterns
‣ Responsible sharing of clinical and genomic data
will be the grand challenge of the post human
genome project era
‣ We HAVE to get it right
‣ The ‘Global Alliance’ whitepaper cosigned by 70+
organizations is a must read:
• Short link to whitepaper: http://biote.am/9j
• Long link: https://www.broadinstitute.org/files/news/pdfs/
GAWhitePaperJune3.pdf
• NIH will be critical in making this work for the world
Wednesday, April 9, 14
59. Up to a two line subtitle, generally used to describe the
takeaway for the slide
59
end; Thanks!
`
Wednesday, April 9, 14