SlideShare uma empresa Scribd logo
1 de 48
Genomes, Clouds, and Organization
eMedLab Workshop, London
May, 2016
Chris Dwan
Director, Research Computing
cdwan@broadinstitute.org
@fdmts
Conclusions
• In order to take full advantage of cloud technologies, we
need to change not just what we do, but also how we do
it.
• Organizations need to fundamentally rethink how they
engage with technology and technologists in order to
remain relevant.
• The groups who get good at collaboration in this new
world will lead the next decade of biomedical science.
• The Broad Institute is a non-profit biomedical
research institute founded in 2004
• Fifty core faculty members, from MIT and Harvard,
plus hundreds of associate members.
• ~1000 directly affiliated personnel
• ~2,400+ associated researchers
Programs and Initiatives
focused on specific disease or biology areas
Cancer
Genome Biology
Cell Circuits
Psychiatric Disease
Metabolism
Medical and Population Genetics
Infectious Disease
Epigenomics
Platforms
focused technological innovation and application
Genomics
Data Sciences
Therapeutics
Imaging
Metabolite Profiling
Proteomics
Genetic Perturbation
The Broad Institute
• The Broad Institute is a non-profit biomedical
research institute founded in 2004
• Fifty core faculty members and hundreds of associate
members from MIT and Harvard
• ~1000 research and administrative personnel, plus
~2,400+ associated researchers
• ~1.4 x 106 genotyped samples
Programs and Initiatives
focused on specific disease or biology areas
Cancer
Genome Biology
Cell Circuits
Psychiatric Disease
Metabolism
Medical and Population Genetics
Infectious Disease
Epigenomics
Platforms
focused technological innovation and application
Genomics
Data Sciences
Therapeutics
Imaging
Metabolite Profiling
Proteomics
Genetic Perturbation
The Broad Institute
“This generation has a historic opportunity and responsibility
to transform medicine by using systematic approaches in the
biological sciences to dramatically accelerate the
understanding and cure of disease”
People @ Broad
WGS / day: ~120 140 .. (plus other products)
Data generation: ~ 0.5PB/mo (200 MB/s)
Network: ~1.6Gb/sec
This is not going to slow down any time soon.
WGS / day: ~120 140 …
Data storage: ~200 MB/s (0.5PB/mo)
Network: ~1.6Gb/sec
This is not going to slow down any time soon.
Colocated File Storage: ~30P
Colocated HPC: ~14k cores
Colocated Object Storage Capacity: ~5P
Public cloud data: ~7P
Public cloud cores: ~15k cores steady state
Internal network: 10Gb/sec
External network: 100Gb/sec
Base pairs vs. Samples
The future is already here – it’s just not very well
distributed
William
Gibson
A lot of technology has happened since
we were all worried about “data
tsunamis” in 2007.
Amazon’s innovation
2002:
All sharing of data, provisioning of services, configuration
of infrastructure – everything is via programmatic call (API)
APIs must be written to be called by external customers.
Anyone who does not do this will be fired, have a nice day.
2004:
Amazon launches a product with which I can provision
servers and storage as easily as I buy books.
Cloudbursting (Aug, 2015)
50,000+ cores used for ~2 hours
Data Storage (May 2016)
Avere (June 2015): A cloud gateway for files.
• Data uploaded 4 PB and counting
• Compression and client side encryption in-line (push-button)
• Simple enough that we’re out in front of the computational capabilities ($$)
Broad Data Center Google Cloud Services
Cloud
Bucket
Physical
Avere
Cluster
Virtual
Avere
Cluster
Physical
Compute
Hosts
Virtual
Compute
HostsPhysical
Data Store
Free
Expensive
Liberation from the location of metal
The billing API is the best way to get usage
information out of google’s cloud offerings.
Eight Exabytes Free
File based storage: The
Information Limits
• Single namespace filers hit real-world limits at:
– ~5PB (restriping times, operational hotspots, MTBF headaches)
– ~109 files: Directories must either be wider or deeper than human
brains can handle.
• Filesystem paths are presumed to persist forever
– Forests of symbolic links
– “Charlotte’s web”
• Access semantics are fundamentally inadequate.
– We need complex, dynamic, context sensitive semantics including
consent for research use.
– File hierarchies will never scale to a federated world.
3rd Party Companies Fill Cloud Feature Gaps
Cloudhealth dashboard atop the billing API
Storage $$
Network $$
Direct storage
cost
Two kinds of
network egress
Data’s trip to the cloud should be one-way.
Genomes on the Cloud (April 2016)
Testing the
genome analysis
pipeline
“Go-live”
“To be without method is deplorable, but to depend
entirely on method is worse.”
The Mustard Seed Garden Manual of Painting, 1679
Most laboratory and clinical work
Consumer of analysis
User of GUI and visual tools
A Technology Engagement Spectrum
“Users”
Most laboratory and clinical work
Consumer of analysis
User of GUI and visual tools
Author of scripts and workflows for
personal use
Author of scripts and command line
tools for use by others
A Technology Engagement Spectrum
“Users”
Well served by
traditional “research
computing”
Most laboratory and clinical work
Manager of compute infrastructure
for use by others.
Consumer of analysis
User of GUI and visual tools
Author of scripts and workflows for
personal use
Author of scripts and command line
tools for use by others
Manager of compute infrastructure
for personal use
A Technology Engagement Spectrum
“Users”
“Shadow IT”
Well served by
traditional “research
computing”
Most laboratory and clinical work
Manager of compute infrastructure
for use by others.
Consumer of analysis
User of GUI and visual tools
Author of scripts and workflows for
personal use
Author of scripts and command line
tools for use by others
Manager of compute infrastructure
for personal use
A Technology Engagement Spectrum
“Users”
“Shadow IT”
Well served by
traditional “research
computing”
To The Cloud!
Most laboratory and clinical work
Manager of compute infrastructure
for use by others.
Consumer of analysis
User of GUI and visual tools
Author of scripts and workflows for
personal use
Author of scripts and command line
tools for use by others
Manager of compute infrastructure
for personal use
A Technology Engagement Spectrum
“Users”
“Shadow IT”
Well served by
traditional “research
computing”
To The Cloud!
To The Other Cloud!
Most laboratory and clinical work
Manager of compute infrastructure
for use by others.
Consumer of analysis
User of GUI and visual tools
Author of scripts and workflows for
personal use
Author of scripts and command line
tools for use by others
Manager of compute infrastructure
for personal use
A Technology Engagement Spectrum
“Users”
“Shadow IT”
Well served by
traditional “research
computing”
To The Cloud!
To The Other Cloud!
Already happily
off-prem, PaaS,
etc.
Most laboratory and clinical work
Manager of compute infrastructure
for use by others.
Consumer of analysis
User of GUI and visual tools
Author of scripts and workflows for
personal use
Author of scripts and command line
tools for use by others
Manager of compute infrastructure
for personal use
ToolBuilding
Training/Access
Shifting how we
engage with
technology
A Technology Engagement Spectrum
“Users”
“Shadow IT”
Well served by
traditional “research
computing”
What does “cloud” mean to me?
• Engineering and Design Approach:
– All infrastructure and technology choices are
seamlessly available, as necessary, to every project
and product.
• Integrative Organizing Principle*
– Technologists directly engaged and accessible
– Shared accountability for business / project goals.
Organizations who fail to integrate in this way will be
routed around.
*DevOps
Product
(revenue
generation)
User Services
(workstations,
laptops, printers)
Run the Business
(HR, Finance, …)
IT / Infrastructure
Internal Service Catalog
Business Priorities
A traditional IT organization, splitting infrastructure and technical
architecture away from business priorities
Product
(increased connection
with architectural and
infrastructure design)
User Services
(workstations,
laptops, printers)
Run the Business
(HR, Finance, …)
Infrastructure
Business Priorities
Internal Service CatalogDevOps
(direct engagement w/
teams through entire
product lifecycle)
The beginnings of a DevOps transition, characterized by teams
named “DevOps,” that serve particular projects
Business units
dive into
infrastructure as
they need,
partnering with
technologists to
achieve business
goals
A mature DevOps IT organization composed of the same staff,
working in a fundamentally different way.
Business Priorities
Clouds open new possibilities for IT Services
Traditional IT:
• Globally shared services
• NFS, AD / LDAP, DNS, …
• Many services provided using
public clouds
Responsibility: CIO
Clouds open new possibilities for IT Services
Traditional IT:
• Globally shared services
• NFS, AD / LDAP, DNS, …
• Many services provided using
public clouds
Responsibility: CIO
Cancer Genome Analysis Connectivity Map
Billing Support:
• IT provides coordination between internal cost
objects and cloud vendor “projects” or “roles”
• No shared services
Responsibility: User
Governance remains critical
$$ !!
Clouds open new possibilities for IT Services
Traditional IT:
• Globally shared services
• NFS, AD / LDAP, DNS, …
• Many services provided using
public clouds
Responsibility: CIO
Cancer Genome Analysis Connectivity Map
Billing Support:
• IT provides coordination between internal cost
objects and cloud vendor “projects” or “roles”
• No shared services
Responsibility: User
Cloud / Hybrid Model
• Granular shared services
• VPN used to expose selected
services to particular projects
Responsibility: Project / Service Lead
BITS DevOps DSDE Dev Cloud Pilot
API API API
The Cloud Future (where we are going)
• We are not so special:
• Dozens to hundreds of businesses have multiple exabytes of data.
• Health care / life sciences is playing catch-up.
• Objects, not files:
• Engineer like an MMORPG* designer.
• Do not copy files. Access APIs.
• Avere gets around this by turning objects back into files.
• Cloud aware access patterns:
• Data egress is expensive.
• Do computing adjacent to the data.
• Figure out a cost model to support this world.
• Everybody will not use the same cloud vendor:
• If we want to collaborate at scale, we need to stop thinking in terms of single,
monolithic solutions.
*Massively Multiplayer Online Role Playing Game
Funding for
specific analysis
Funding allocated by
headcount, team, or
department
Unfunded
Cost/scaleofanalysis
Large
Trivial
Moderate
Ongoing unfunded support burden
Fixed capacity on
shared use systems.
Hard choices,
limitations
Ad-hoc /
opportunistic use
Elastic capacity on
shared use
systems
MoonshotsLost opportunity
Distinct funding models
You move towards and become like that which you
think about.
The Big Data Healthcare Feeding Frenzy
• “If we sequence X new patients with condition Y every year,
the sequencing data alone will take up ALL THE
EXABYTES”*
• The data storage and analysis needs of precision /
personalized / genomic medicine are not unreasonable by
comparison with major, data driven industries (100s of
Exabytes over the next decade).
• We can compensate by being thoughtful about what data we
store, how we store it, and how we share it.
* If you multiply a number by a sufficiently large number the product is a large number.
… people who had
nothing to do with
the design and
execution of the study …
... use another group’s data for their own ends …
… even use the data to try to disprove what the
original investigators had posited…
… some researchers have characterized as “research
parasites”
Fear, Uncertainty, and Doubt
What we need
• Incentive structures that reward making data accessible
and useful
– All indicators except the benefit of the patient lead to suboptimal behavior
– This will require courage.
• National / global data scale data repositories, standards,
and toolkits
– Death to walled gardens, monolithic systems, and GUIs.
– Life to APIs built for a global community (c.f. Amazon, 2002)
• Open, fearless conversation about data protection vs.
appropriate use
– Genomic data is inherently personally identifiable and should be treated as such
– “Appropriate usage” goes well beyond legal conformity
Standards are needed for genomic data
“The mission of the Global Alliance for Genomics
and Health is to accelerate progress in human
health by helping to establish a common framework
of harmonized approaches to enable effective and
responsible sharing of genomic and clinical data,
and by catalyzing data sharing projects that drive
and demonstrate the value of data sharing.”
Regulatory Issues
Ethical Issues
Technical Issues
This stuff is important
We have an opportunity to change lives and health
outcomes, and to realize the gains of genomic medicine, not
in an indefinite future, but this year.
We also have an opportunity to waste vast amounts of
money (very rapidly) and still not really help anybody.
I would like to work together with you to build a better future.
cdwan@broadinstitute.org
Conclusions
• In order to take full advantage of cloud technologies, we
need to change not just what we do, but also how we do
it.
• Organizations need to fundamentally rethink how they
engage with technology and technologists in order to
remain relevant.
• The groups who get good at collaboration in this new
world will lead the next decade of biomedical science.
Thank You

Mais conteúdo relacionado

Mais procurados

So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
Ian Foster
 
Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at York
Ming Li
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
Ian Foster
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except us
mark madsen
 
Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)
mark madsen
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
eXascale Infolab
 
Thesis blending big data and cloud -epilepsy global data research and inform...
Thesis  blending big data and cloud -epilepsy global data research and inform...Thesis  blending big data and cloud -epilepsy global data research and inform...
Thesis blending big data and cloud -epilepsy global data research and inform...
Anup Singh
 

Mais procurados (20)

So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at York
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except us
 
Taming Big Science Data Growth with Converged Infrastructure
Taming Big Science Data Growth with Converged InfrastructureTaming Big Science Data Growth with Converged Infrastructure
Taming Big Science Data Growth with Converged Infrastructure
 
Big Data and Bad Analogies
Big Data and Bad AnalogiesBig Data and Bad Analogies
Big Data and Bad Analogies
 
Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...
Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...
Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...
 
Mapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the CloudMapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the Cloud
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science Teams
 
2019 BioIt World - Post cloud legacy edition
2019 BioIt World - Post cloud legacy edition2019 BioIt World - Post cloud legacy edition
2019 BioIt World - Post cloud legacy edition
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Thesis blending big data and cloud -epilepsy global data research and inform...
Thesis  blending big data and cloud -epilepsy global data research and inform...Thesis  blending big data and cloud -epilepsy global data research and inform...
Thesis blending big data and cloud -epilepsy global data research and inform...
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 

Semelhante a 2016 05 sanger

FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
Carole Goble
 
Cyverse: Extensible Cyberinfrastructure for Life Science
Cyverse: Extensible Cyberinfrastructure for Life ScienceCyverse: Extensible Cyberinfrastructure for Life Science
Cyverse: Extensible Cyberinfrastructure for Life Science
EMBL Australia Bioinformatics Resource
 
Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big Data
Stylight
 
Nectar cloud workshop ndj 20110331.2
Nectar cloud workshop ndj 20110331.2Nectar cloud workshop ndj 20110331.2
Nectar cloud workshop ndj 20110331.2
Nick Jones
 

Semelhante a 2016 05 sanger (20)

iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
The pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleThe pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an example
 
Six Principles of Software Design to Empower Scientists
Six Principles of Software Design to Empower ScientistsSix Principles of Software Design to Empower Scientists
Six Principles of Software Design to Empower Scientists
 
Cytoscape: Now and Future
Cytoscape: Now and FutureCytoscape: Now and Future
Cytoscape: Now and Future
 
CLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB LaunchCLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB Launch
 
Adoption of Cloud Computing in Scientific Research
Adoption of Cloud Computing in Scientific ResearchAdoption of Cloud Computing in Scientific Research
Adoption of Cloud Computing in Scientific Research
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
 
Cyverse: Extensible Cyberinfrastructure for Life Science
Cyverse: Extensible Cyberinfrastructure for Life ScienceCyverse: Extensible Cyberinfrastructure for Life Science
Cyverse: Extensible Cyberinfrastructure for Life Science
 
Intro to the CNCF Research User Group
Intro to the CNCF Research User GroupIntro to the CNCF Research User Group
Intro to the CNCF Research User Group
 
第1回バイオインフォマティクスデータ可視化セミナー@Riken
第1回バイオインフォマティクスデータ可視化セミナー@Riken第1回バイオインフォマティクスデータ可視化セミナー@Riken
第1回バイオインフォマティクスデータ可視化セミナー@Riken
 
Open Source Cloud
Open Source CloudOpen Source Cloud
Open Source Cloud
 
Data-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and Cloud
 
Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big Data
 
Nectar cloud workshop ndj 20110331.2
Nectar cloud workshop ndj 20110331.2Nectar cloud workshop ndj 20110331.2
Nectar cloud workshop ndj 20110331.2
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
 
Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
Cloud computing and Hadoop introduction
Cloud computing and Hadoop introductionCloud computing and Hadoop introduction
Cloud computing and Hadoop introduction
 
Accelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of GenomicsAccelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of Genomics
 

Mais de Chris Dwan

Mais de Chris Dwan (20)

Somerville Police Staffing Final Report.pdf
Somerville Police Staffing Final Report.pdfSomerville Police Staffing Final Report.pdf
Somerville Police Staffing Final Report.pdf
 
2023 Ward 2 community meeting.pdf
2023 Ward 2 community meeting.pdf2023 Ward 2 community meeting.pdf
2023 Ward 2 community meeting.pdf
 
One Size Does Not Fit All
One Size Does Not Fit AllOne Size Does Not Fit All
One Size Does Not Fit All
 
Somerville FY23 Proposed Budget
Somerville FY23 Proposed BudgetSomerville FY23 Proposed Budget
Somerville FY23 Proposed Budget
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on Production
 
#Defund thepolice
#Defund thepolice#Defund thepolice
#Defund thepolice
 
2009 cluster user training
2009 cluster user training2009 cluster user training
2009 cluster user training
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
Somerville ufc memo tree hearing
Somerville ufc memo   tree hearingSomerville ufc memo   tree hearing
Somerville ufc memo tree hearing
 
2011 career-fair
2011 career-fair2011 career-fair
2011 career-fair
 
Advocacy in the Enterprise (what works, what doesn't)
Advocacy in the Enterprise (what works, what doesn't)Advocacy in the Enterprise (what works, what doesn't)
Advocacy in the Enterprise (what works, what doesn't)
 
"The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You""The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You"
 
Introduction to HPC
Introduction to HPCIntroduction to HPC
Introduction to HPC
 
Intro bioinformatics
Intro bioinformaticsIntro bioinformatics
Intro bioinformatics
 
Proposed tree protection ordinance
Proposed tree protection ordinanceProposed tree protection ordinance
Proposed tree protection ordinance
 
Tree Ordinance Change Matrix
Tree Ordinance Change MatrixTree Ordinance Change Matrix
Tree Ordinance Change Matrix
 
Tree protection overhaul
Tree protection overhaulTree protection overhaul
Tree protection overhaul
 
Response from newport
Response from newportResponse from newport
Response from newport
 
Sacramento underpass bid_docs
Sacramento underpass bid_docsSacramento underpass bid_docs
Sacramento underpass bid_docs
 
Somerville tree stat 2019 02 12
Somerville tree stat 2019 02 12Somerville tree stat 2019 02 12
Somerville tree stat 2019 02 12
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

2016 05 sanger

  • 1. Genomes, Clouds, and Organization eMedLab Workshop, London May, 2016 Chris Dwan Director, Research Computing cdwan@broadinstitute.org @fdmts
  • 2. Conclusions • In order to take full advantage of cloud technologies, we need to change not just what we do, but also how we do it. • Organizations need to fundamentally rethink how they engage with technology and technologists in order to remain relevant. • The groups who get good at collaboration in this new world will lead the next decade of biomedical science.
  • 3. • The Broad Institute is a non-profit biomedical research institute founded in 2004 • Fifty core faculty members, from MIT and Harvard, plus hundreds of associate members. • ~1000 directly affiliated personnel • ~2,400+ associated researchers Programs and Initiatives focused on specific disease or biology areas Cancer Genome Biology Cell Circuits Psychiatric Disease Metabolism Medical and Population Genetics Infectious Disease Epigenomics Platforms focused technological innovation and application Genomics Data Sciences Therapeutics Imaging Metabolite Profiling Proteomics Genetic Perturbation The Broad Institute
  • 4. • The Broad Institute is a non-profit biomedical research institute founded in 2004 • Fifty core faculty members and hundreds of associate members from MIT and Harvard • ~1000 research and administrative personnel, plus ~2,400+ associated researchers • ~1.4 x 106 genotyped samples Programs and Initiatives focused on specific disease or biology areas Cancer Genome Biology Cell Circuits Psychiatric Disease Metabolism Medical and Population Genetics Infectious Disease Epigenomics Platforms focused technological innovation and application Genomics Data Sciences Therapeutics Imaging Metabolite Profiling Proteomics Genetic Perturbation The Broad Institute “This generation has a historic opportunity and responsibility to transform medicine by using systematic approaches in the biological sciences to dramatically accelerate the understanding and cure of disease”
  • 6.
  • 7. WGS / day: ~120 140 .. (plus other products) Data generation: ~ 0.5PB/mo (200 MB/s) Network: ~1.6Gb/sec This is not going to slow down any time soon.
  • 8. WGS / day: ~120 140 … Data storage: ~200 MB/s (0.5PB/mo) Network: ~1.6Gb/sec This is not going to slow down any time soon. Colocated File Storage: ~30P Colocated HPC: ~14k cores Colocated Object Storage Capacity: ~5P Public cloud data: ~7P Public cloud cores: ~15k cores steady state Internal network: 10Gb/sec External network: 100Gb/sec
  • 9. Base pairs vs. Samples
  • 10. The future is already here – it’s just not very well distributed William Gibson
  • 11. A lot of technology has happened since we were all worried about “data tsunamis” in 2007.
  • 12. Amazon’s innovation 2002: All sharing of data, provisioning of services, configuration of infrastructure – everything is via programmatic call (API) APIs must be written to be called by external customers. Anyone who does not do this will be fired, have a nice day. 2004: Amazon launches a product with which I can provision servers and storage as easily as I buy books.
  • 13. Cloudbursting (Aug, 2015) 50,000+ cores used for ~2 hours
  • 15.
  • 16. Avere (June 2015): A cloud gateway for files. • Data uploaded 4 PB and counting • Compression and client side encryption in-line (push-button) • Simple enough that we’re out in front of the computational capabilities ($$) Broad Data Center Google Cloud Services Cloud Bucket Physical Avere Cluster Virtual Avere Cluster Physical Compute Hosts Virtual Compute HostsPhysical Data Store Free Expensive
  • 17. Liberation from the location of metal The billing API is the best way to get usage information out of google’s cloud offerings. Eight Exabytes Free
  • 18. File based storage: The Information Limits • Single namespace filers hit real-world limits at: – ~5PB (restriping times, operational hotspots, MTBF headaches) – ~109 files: Directories must either be wider or deeper than human brains can handle. • Filesystem paths are presumed to persist forever – Forests of symbolic links – “Charlotte’s web” • Access semantics are fundamentally inadequate. – We need complex, dynamic, context sensitive semantics including consent for research use. – File hierarchies will never scale to a federated world.
  • 19. 3rd Party Companies Fill Cloud Feature Gaps Cloudhealth dashboard atop the billing API Storage $$ Network $$
  • 20. Direct storage cost Two kinds of network egress Data’s trip to the cloud should be one-way.
  • 21. Genomes on the Cloud (April 2016) Testing the genome analysis pipeline “Go-live”
  • 22. “To be without method is deplorable, but to depend entirely on method is worse.” The Mustard Seed Garden Manual of Painting, 1679
  • 23. Most laboratory and clinical work Consumer of analysis User of GUI and visual tools A Technology Engagement Spectrum “Users”
  • 24. Most laboratory and clinical work Consumer of analysis User of GUI and visual tools Author of scripts and workflows for personal use Author of scripts and command line tools for use by others A Technology Engagement Spectrum “Users” Well served by traditional “research computing”
  • 25. Most laboratory and clinical work Manager of compute infrastructure for use by others. Consumer of analysis User of GUI and visual tools Author of scripts and workflows for personal use Author of scripts and command line tools for use by others Manager of compute infrastructure for personal use A Technology Engagement Spectrum “Users” “Shadow IT” Well served by traditional “research computing”
  • 26. Most laboratory and clinical work Manager of compute infrastructure for use by others. Consumer of analysis User of GUI and visual tools Author of scripts and workflows for personal use Author of scripts and command line tools for use by others Manager of compute infrastructure for personal use A Technology Engagement Spectrum “Users” “Shadow IT” Well served by traditional “research computing” To The Cloud!
  • 27. Most laboratory and clinical work Manager of compute infrastructure for use by others. Consumer of analysis User of GUI and visual tools Author of scripts and workflows for personal use Author of scripts and command line tools for use by others Manager of compute infrastructure for personal use A Technology Engagement Spectrum “Users” “Shadow IT” Well served by traditional “research computing” To The Cloud! To The Other Cloud!
  • 28. Most laboratory and clinical work Manager of compute infrastructure for use by others. Consumer of analysis User of GUI and visual tools Author of scripts and workflows for personal use Author of scripts and command line tools for use by others Manager of compute infrastructure for personal use A Technology Engagement Spectrum “Users” “Shadow IT” Well served by traditional “research computing” To The Cloud! To The Other Cloud! Already happily off-prem, PaaS, etc.
  • 29. Most laboratory and clinical work Manager of compute infrastructure for use by others. Consumer of analysis User of GUI and visual tools Author of scripts and workflows for personal use Author of scripts and command line tools for use by others Manager of compute infrastructure for personal use ToolBuilding Training/Access Shifting how we engage with technology A Technology Engagement Spectrum “Users” “Shadow IT” Well served by traditional “research computing”
  • 30. What does “cloud” mean to me? • Engineering and Design Approach: – All infrastructure and technology choices are seamlessly available, as necessary, to every project and product. • Integrative Organizing Principle* – Technologists directly engaged and accessible – Shared accountability for business / project goals. Organizations who fail to integrate in this way will be routed around. *DevOps
  • 31. Product (revenue generation) User Services (workstations, laptops, printers) Run the Business (HR, Finance, …) IT / Infrastructure Internal Service Catalog Business Priorities A traditional IT organization, splitting infrastructure and technical architecture away from business priorities
  • 32. Product (increased connection with architectural and infrastructure design) User Services (workstations, laptops, printers) Run the Business (HR, Finance, …) Infrastructure Business Priorities Internal Service CatalogDevOps (direct engagement w/ teams through entire product lifecycle) The beginnings of a DevOps transition, characterized by teams named “DevOps,” that serve particular projects
  • 33. Business units dive into infrastructure as they need, partnering with technologists to achieve business goals A mature DevOps IT organization composed of the same staff, working in a fundamentally different way. Business Priorities
  • 34. Clouds open new possibilities for IT Services Traditional IT: • Globally shared services • NFS, AD / LDAP, DNS, … • Many services provided using public clouds Responsibility: CIO
  • 35. Clouds open new possibilities for IT Services Traditional IT: • Globally shared services • NFS, AD / LDAP, DNS, … • Many services provided using public clouds Responsibility: CIO Cancer Genome Analysis Connectivity Map Billing Support: • IT provides coordination between internal cost objects and cloud vendor “projects” or “roles” • No shared services Responsibility: User
  • 37. Clouds open new possibilities for IT Services Traditional IT: • Globally shared services • NFS, AD / LDAP, DNS, … • Many services provided using public clouds Responsibility: CIO Cancer Genome Analysis Connectivity Map Billing Support: • IT provides coordination between internal cost objects and cloud vendor “projects” or “roles” • No shared services Responsibility: User Cloud / Hybrid Model • Granular shared services • VPN used to expose selected services to particular projects Responsibility: Project / Service Lead BITS DevOps DSDE Dev Cloud Pilot API API API
  • 38.
  • 39. The Cloud Future (where we are going) • We are not so special: • Dozens to hundreds of businesses have multiple exabytes of data. • Health care / life sciences is playing catch-up. • Objects, not files: • Engineer like an MMORPG* designer. • Do not copy files. Access APIs. • Avere gets around this by turning objects back into files. • Cloud aware access patterns: • Data egress is expensive. • Do computing adjacent to the data. • Figure out a cost model to support this world. • Everybody will not use the same cloud vendor: • If we want to collaborate at scale, we need to stop thinking in terms of single, monolithic solutions. *Massively Multiplayer Online Role Playing Game
  • 40. Funding for specific analysis Funding allocated by headcount, team, or department Unfunded Cost/scaleofanalysis Large Trivial Moderate Ongoing unfunded support burden Fixed capacity on shared use systems. Hard choices, limitations Ad-hoc / opportunistic use Elastic capacity on shared use systems MoonshotsLost opportunity Distinct funding models
  • 41. You move towards and become like that which you think about.
  • 42. The Big Data Healthcare Feeding Frenzy • “If we sequence X new patients with condition Y every year, the sequencing data alone will take up ALL THE EXABYTES”* • The data storage and analysis needs of precision / personalized / genomic medicine are not unreasonable by comparison with major, data driven industries (100s of Exabytes over the next decade). • We can compensate by being thoughtful about what data we store, how we store it, and how we share it. * If you multiply a number by a sufficiently large number the product is a large number.
  • 43. … people who had nothing to do with the design and execution of the study … ... use another group’s data for their own ends … … even use the data to try to disprove what the original investigators had posited… … some researchers have characterized as “research parasites” Fear, Uncertainty, and Doubt
  • 44. What we need • Incentive structures that reward making data accessible and useful – All indicators except the benefit of the patient lead to suboptimal behavior – This will require courage. • National / global data scale data repositories, standards, and toolkits – Death to walled gardens, monolithic systems, and GUIs. – Life to APIs built for a global community (c.f. Amazon, 2002) • Open, fearless conversation about data protection vs. appropriate use – Genomic data is inherently personally identifiable and should be treated as such – “Appropriate usage” goes well beyond legal conformity
  • 45. Standards are needed for genomic data “The mission of the Global Alliance for Genomics and Health is to accelerate progress in human health by helping to establish a common framework of harmonized approaches to enable effective and responsible sharing of genomic and clinical data, and by catalyzing data sharing projects that drive and demonstrate the value of data sharing.” Regulatory Issues Ethical Issues Technical Issues
  • 46. This stuff is important We have an opportunity to change lives and health outcomes, and to realize the gains of genomic medicine, not in an indefinite future, but this year. We also have an opportunity to waste vast amounts of money (very rapidly) and still not really help anybody. I would like to work together with you to build a better future. cdwan@broadinstitute.org
  • 47. Conclusions • In order to take full advantage of cloud technologies, we need to change not just what we do, but also how we do it. • Organizations need to fundamentally rethink how they engage with technology and technologists in order to remain relevant. • The groups who get good at collaboration in this new world will lead the next decade of biomedical science.