Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
2016 05 sanger
1. Genomes, Clouds, and Organization
eMedLab Workshop, London
May, 2016
Chris Dwan
Director, Research Computing
cdwan@broadinstitute.org
@fdmts
2. Conclusions
• In order to take full advantage of cloud technologies, we
need to change not just what we do, but also how we do
it.
• Organizations need to fundamentally rethink how they
engage with technology and technologists in order to
remain relevant.
• The groups who get good at collaboration in this new
world will lead the next decade of biomedical science.
3. • The Broad Institute is a non-profit biomedical
research institute founded in 2004
• Fifty core faculty members, from MIT and Harvard,
plus hundreds of associate members.
• ~1000 directly affiliated personnel
• ~2,400+ associated researchers
Programs and Initiatives
focused on specific disease or biology areas
Cancer
Genome Biology
Cell Circuits
Psychiatric Disease
Metabolism
Medical and Population Genetics
Infectious Disease
Epigenomics
Platforms
focused technological innovation and application
Genomics
Data Sciences
Therapeutics
Imaging
Metabolite Profiling
Proteomics
Genetic Perturbation
The Broad Institute
4. • The Broad Institute is a non-profit biomedical
research institute founded in 2004
• Fifty core faculty members and hundreds of associate
members from MIT and Harvard
• ~1000 research and administrative personnel, plus
~2,400+ associated researchers
• ~1.4 x 106 genotyped samples
Programs and Initiatives
focused on specific disease or biology areas
Cancer
Genome Biology
Cell Circuits
Psychiatric Disease
Metabolism
Medical and Population Genetics
Infectious Disease
Epigenomics
Platforms
focused technological innovation and application
Genomics
Data Sciences
Therapeutics
Imaging
Metabolite Profiling
Proteomics
Genetic Perturbation
The Broad Institute
“This generation has a historic opportunity and responsibility
to transform medicine by using systematic approaches in the
biological sciences to dramatically accelerate the
understanding and cure of disease”
7. WGS / day: ~120 140 .. (plus other products)
Data generation: ~ 0.5PB/mo (200 MB/s)
Network: ~1.6Gb/sec
This is not going to slow down any time soon.
8. WGS / day: ~120 140 …
Data storage: ~200 MB/s (0.5PB/mo)
Network: ~1.6Gb/sec
This is not going to slow down any time soon.
Colocated File Storage: ~30P
Colocated HPC: ~14k cores
Colocated Object Storage Capacity: ~5P
Public cloud data: ~7P
Public cloud cores: ~15k cores steady state
Internal network: 10Gb/sec
External network: 100Gb/sec
10. The future is already here – it’s just not very well
distributed
William
Gibson
11. A lot of technology has happened since
we were all worried about “data
tsunamis” in 2007.
12. Amazon’s innovation
2002:
All sharing of data, provisioning of services, configuration
of infrastructure – everything is via programmatic call (API)
APIs must be written to be called by external customers.
Anyone who does not do this will be fired, have a nice day.
2004:
Amazon launches a product with which I can provision
servers and storage as easily as I buy books.
16. Avere (June 2015): A cloud gateway for files.
• Data uploaded 4 PB and counting
• Compression and client side encryption in-line (push-button)
• Simple enough that we’re out in front of the computational capabilities ($$)
Broad Data Center Google Cloud Services
Cloud
Bucket
Physical
Avere
Cluster
Virtual
Avere
Cluster
Physical
Compute
Hosts
Virtual
Compute
HostsPhysical
Data Store
Free
Expensive
17. Liberation from the location of metal
The billing API is the best way to get usage
information out of google’s cloud offerings.
Eight Exabytes Free
18. File based storage: The
Information Limits
• Single namespace filers hit real-world limits at:
– ~5PB (restriping times, operational hotspots, MTBF headaches)
– ~109 files: Directories must either be wider or deeper than human
brains can handle.
• Filesystem paths are presumed to persist forever
– Forests of symbolic links
– “Charlotte’s web”
• Access semantics are fundamentally inadequate.
– We need complex, dynamic, context sensitive semantics including
consent for research use.
– File hierarchies will never scale to a federated world.
19. 3rd Party Companies Fill Cloud Feature Gaps
Cloudhealth dashboard atop the billing API
Storage $$
Network $$
21. Genomes on the Cloud (April 2016)
Testing the
genome analysis
pipeline
“Go-live”
22. “To be without method is deplorable, but to depend
entirely on method is worse.”
The Mustard Seed Garden Manual of Painting, 1679
23. Most laboratory and clinical work
Consumer of analysis
User of GUI and visual tools
A Technology Engagement Spectrum
“Users”
24. Most laboratory and clinical work
Consumer of analysis
User of GUI and visual tools
Author of scripts and workflows for
personal use
Author of scripts and command line
tools for use by others
A Technology Engagement Spectrum
“Users”
Well served by
traditional “research
computing”
25. Most laboratory and clinical work
Manager of compute infrastructure
for use by others.
Consumer of analysis
User of GUI and visual tools
Author of scripts and workflows for
personal use
Author of scripts and command line
tools for use by others
Manager of compute infrastructure
for personal use
A Technology Engagement Spectrum
“Users”
“Shadow IT”
Well served by
traditional “research
computing”
26. Most laboratory and clinical work
Manager of compute infrastructure
for use by others.
Consumer of analysis
User of GUI and visual tools
Author of scripts and workflows for
personal use
Author of scripts and command line
tools for use by others
Manager of compute infrastructure
for personal use
A Technology Engagement Spectrum
“Users”
“Shadow IT”
Well served by
traditional “research
computing”
To The Cloud!
27. Most laboratory and clinical work
Manager of compute infrastructure
for use by others.
Consumer of analysis
User of GUI and visual tools
Author of scripts and workflows for
personal use
Author of scripts and command line
tools for use by others
Manager of compute infrastructure
for personal use
A Technology Engagement Spectrum
“Users”
“Shadow IT”
Well served by
traditional “research
computing”
To The Cloud!
To The Other Cloud!
28. Most laboratory and clinical work
Manager of compute infrastructure
for use by others.
Consumer of analysis
User of GUI and visual tools
Author of scripts and workflows for
personal use
Author of scripts and command line
tools for use by others
Manager of compute infrastructure
for personal use
A Technology Engagement Spectrum
“Users”
“Shadow IT”
Well served by
traditional “research
computing”
To The Cloud!
To The Other Cloud!
Already happily
off-prem, PaaS,
etc.
29. Most laboratory and clinical work
Manager of compute infrastructure
for use by others.
Consumer of analysis
User of GUI and visual tools
Author of scripts and workflows for
personal use
Author of scripts and command line
tools for use by others
Manager of compute infrastructure
for personal use
ToolBuilding
Training/Access
Shifting how we
engage with
technology
A Technology Engagement Spectrum
“Users”
“Shadow IT”
Well served by
traditional “research
computing”
30. What does “cloud” mean to me?
• Engineering and Design Approach:
– All infrastructure and technology choices are
seamlessly available, as necessary, to every project
and product.
• Integrative Organizing Principle*
– Technologists directly engaged and accessible
– Shared accountability for business / project goals.
Organizations who fail to integrate in this way will be
routed around.
*DevOps
32. Product
(increased connection
with architectural and
infrastructure design)
User Services
(workstations,
laptops, printers)
Run the Business
(HR, Finance, …)
Infrastructure
Business Priorities
Internal Service CatalogDevOps
(direct engagement w/
teams through entire
product lifecycle)
The beginnings of a DevOps transition, characterized by teams
named “DevOps,” that serve particular projects
33. Business units
dive into
infrastructure as
they need,
partnering with
technologists to
achieve business
goals
A mature DevOps IT organization composed of the same staff,
working in a fundamentally different way.
Business Priorities
34. Clouds open new possibilities for IT Services
Traditional IT:
• Globally shared services
• NFS, AD / LDAP, DNS, …
• Many services provided using
public clouds
Responsibility: CIO
35. Clouds open new possibilities for IT Services
Traditional IT:
• Globally shared services
• NFS, AD / LDAP, DNS, …
• Many services provided using
public clouds
Responsibility: CIO
Cancer Genome Analysis Connectivity Map
Billing Support:
• IT provides coordination between internal cost
objects and cloud vendor “projects” or “roles”
• No shared services
Responsibility: User
37. Clouds open new possibilities for IT Services
Traditional IT:
• Globally shared services
• NFS, AD / LDAP, DNS, …
• Many services provided using
public clouds
Responsibility: CIO
Cancer Genome Analysis Connectivity Map
Billing Support:
• IT provides coordination between internal cost
objects and cloud vendor “projects” or “roles”
• No shared services
Responsibility: User
Cloud / Hybrid Model
• Granular shared services
• VPN used to expose selected
services to particular projects
Responsibility: Project / Service Lead
BITS DevOps DSDE Dev Cloud Pilot
API API API
38.
39. The Cloud Future (where we are going)
• We are not so special:
• Dozens to hundreds of businesses have multiple exabytes of data.
• Health care / life sciences is playing catch-up.
• Objects, not files:
• Engineer like an MMORPG* designer.
• Do not copy files. Access APIs.
• Avere gets around this by turning objects back into files.
• Cloud aware access patterns:
• Data egress is expensive.
• Do computing adjacent to the data.
• Figure out a cost model to support this world.
• Everybody will not use the same cloud vendor:
• If we want to collaborate at scale, we need to stop thinking in terms of single,
monolithic solutions.
*Massively Multiplayer Online Role Playing Game
40. Funding for
specific analysis
Funding allocated by
headcount, team, or
department
Unfunded
Cost/scaleofanalysis
Large
Trivial
Moderate
Ongoing unfunded support burden
Fixed capacity on
shared use systems.
Hard choices,
limitations
Ad-hoc /
opportunistic use
Elastic capacity on
shared use
systems
MoonshotsLost opportunity
Distinct funding models
42. The Big Data Healthcare Feeding Frenzy
• “If we sequence X new patients with condition Y every year,
the sequencing data alone will take up ALL THE
EXABYTES”*
• The data storage and analysis needs of precision /
personalized / genomic medicine are not unreasonable by
comparison with major, data driven industries (100s of
Exabytes over the next decade).
• We can compensate by being thoughtful about what data we
store, how we store it, and how we share it.
* If you multiply a number by a sufficiently large number the product is a large number.
43. … people who had
nothing to do with
the design and
execution of the study …
... use another group’s data for their own ends …
… even use the data to try to disprove what the
original investigators had posited…
… some researchers have characterized as “research
parasites”
Fear, Uncertainty, and Doubt
44. What we need
• Incentive structures that reward making data accessible
and useful
– All indicators except the benefit of the patient lead to suboptimal behavior
– This will require courage.
• National / global data scale data repositories, standards,
and toolkits
– Death to walled gardens, monolithic systems, and GUIs.
– Life to APIs built for a global community (c.f. Amazon, 2002)
• Open, fearless conversation about data protection vs.
appropriate use
– Genomic data is inherently personally identifiable and should be treated as such
– “Appropriate usage” goes well beyond legal conformity
45. Standards are needed for genomic data
“The mission of the Global Alliance for Genomics
and Health is to accelerate progress in human
health by helping to establish a common framework
of harmonized approaches to enable effective and
responsible sharing of genomic and clinical data,
and by catalyzing data sharing projects that drive
and demonstrate the value of data sharing.”
Regulatory Issues
Ethical Issues
Technical Issues
46. This stuff is important
We have an opportunity to change lives and health
outcomes, and to realize the gains of genomic medicine, not
in an indefinite future, but this year.
We also have an opportunity to waste vast amounts of
money (very rapidly) and still not really help anybody.
I would like to work together with you to build a better future.
cdwan@broadinstitute.org
47. Conclusions
• In order to take full advantage of cloud technologies, we
need to change not just what we do, but also how we do
it.
• Organizations need to fundamentally rethink how they
engage with technology and technologists in order to
remain relevant.
• The groups who get good at collaboration in this new
world will lead the next decade of biomedical science.