SlideShare uma empresa Scribd logo
1 de 59
Pushing Discovery with
Internet2
Cloud to Supercomputing
in Life Sciences
DAN TAYLOR
Director, Business Development, Internet2
BIO-IT WORLD 2016
BOSTON
APRIL, 2016
2 –
8/30/
Internet2 Overview
• An advanced networking consortium
– Academia
– Corporations
– Government
• Operates a best-in-class national optical network
– 15,000 miles of dedicated fiber
– 100G routers and optical transport systems
– 8.8 Tbps capacity
• For over 20 years, our mission has been to
– Provide cost effective broadband and collaboration technologies to facilitate
frictionless research in Big Science – broad collaboration, extremely large data
sets
– Create tomorrow’s networks & a platform for networking research
– Engage stakeholders in
• Bridging the IT/Researcher gap
• Developing new technologies critical to their missions
[ 3 ]
The 4th Gen Internet2 Network
Internet2 Network
by the numbers
17 Juniper MX960 nodes
31 Brocade and Juniper
switches
49 custom colocation facilities
250+ amplification racks
15,717 miles of newly
acquired dark fiber
2,400 miles of partnered
capacity with Zayo
Communications
8.8 Tbps of optical capacity
100 Gbps of hybrid Layer 2
and Layer 3 capacity
300+ Ciena ActiveFlex 6500
network elements
Technology
• A Research Grade high speed network –
optimized for “Elephant flows”
• Layer 1 – secure point to point wavelength networking
• Advanced Layer 2 Services – Open virtual network for Life
Sciences with connectivity speeds up to 100 Gbs
• SDN Network Virtualization customer trials now
• Advanced Layer 3 Services – High speed IP connectivity to
the world
• Superior economics
• Secure sharing of online research resource
– federated identity management
system
[ 5 ]
Internet2 Members and Partners
255 Higher Education members
67 Affiliate members
41 R&E Network members
82 Industry members
65+ Int’l partners reaching over
100 Nations
93,000+ Community anchor institutions
Focused on member technology needs
since 1996
"The idea of being
able to collaborate
with anybody,
anywhere, without
constraint…"
—Jim Bottum, CIO,
Clemson University
Community
6 –
8/30/
Strong international partnerships
• Agreements with
international networking
partners offer
interoperability and
access
• Enable collaboration
between U.S. researchers
and overseas counterparts
in over 100 international R
& E networks
Community
Some of our Affiliate Members
7
[ 8 ]
*Routers
Stanford
Computer
Workstations
Berkeley, Stanford
Security
Systems
Univ of Michigan
Security
Systems
Georgia Tech
Social
Media
Harvard
Network
Caching
MIT
Search
Stanford
[ 9 ]
The Route
to Innovation
August 30, 2016 © 2016 Internet2
Abundant Bandwidth
• Raw capacity now available on
Internet2 Network a key imagination enabler
• Incent disruptive use of new, advanced
capabilities
Software Defined Networking
• Open up network layer itself to innovation
• Let innovators communicate with and program
the network itself
• Allow developers to optimize the network for
specific applications
Science DMZ
• Architect a special solution to allow
higher-performance data flows
• Include end-to-end performance monitoring
server and software
• Include SDN server to support programmability
Life Sciences Research Today
• Sharing Big Data sets (genomic, environmental, imagery) key to basic and applied research
• Reproducibility - need to capture methods as well as raw data
– High variability in analytic processes and instruments
– Inconsistent formats and standards
• Lack of metadata & standards
• Biological systems are immensely complicated and dynamic (S. Goff, CyVERSE/iPlant)
• 21k human genes can make >100k proteins
• >50% of genes are controlled by day-night cycles
• Proteins have an average half-life of 30 hours
• Several thousand metabolites are rapidly changing
• Traits are environmentally and genetically controlled
• Information Technology - High Performance Computing and Networking - now can explore
these systems through simulation
• Collaboration
– Cross Domain, Cross Discipline
– Distribution of systems and talent is global
– Resources are public, private and academic
BIO-IT Trends in the Trenches 2015
with Chris Dagdigian
Take Aways
- Science is changing faster than IT funding
cycle for data intensive computing
environments
- Forward looking 100G multi site , multi
party collaborations required
- Cloud adoption driven by capability vs cost
- Centralized data center dead; future is
distributed computing/data stores
- Big pharma security challenge has
been met
- SDN is real and happening now; part of
infrastructure automation wave
- Blast radius more important than ever:
DOE’s Science DMZ architecture is a
solution
https://youtu.be/U6i0THTxe4o
http://www.slideshare.net/chrisdag/201
5-bioit-trends-from-the-frenches
2015 Bio-IT World Conference & Expo
• Change
• Networking
• Cloud
• Decentralized Collaboration
• Security
• Mission Networks
Change
[ 12 ]
13 – 8/30/2016, © 2009 Internet2
Data Tsunami
Physics
Large Hadron Collider
Life Sciences
Next Generation Sequencers
CERN Illumina
Networking
[ 14 ]
15 –
8/30/20
2012: US – China 10 Gbps Link
Fed Ex: 2 days
Internet+ FTP: 26 hours
China ‐ US 10G Link: 30 secs
Dr. Lin Fang Dr. Dawei Lin
Sample.fa
(24GB)
NCBI/UC-Davis/BGI : First ultra high speed transfer of
genomic data between China & US, June 2012
“The 10 Gigabit network connection is even
faster than transferring data to most local hard
drives,” said Dr. Lin [of UC, Davis]. “The use of
a 10 Gigabit network connection will be
groundbreaking, very much like email
replacing hand delivered mail for
communication. It will enable scientists in the
genomics-related fields to communicate and
transfer data more rapidly and conveniently,
and bring the best minds together to better
explore the mysteries of life science.” (BGI
press release)
Life Sciences Engagement
16 Community
Forward Looking 100G Networks & Multi Site Multi
Party Collaboration
Accelerating Discovery:
USDA ARS Science
Network
8/30/2016, © 2016
Internet2
[ 18 ]
USDA Agriculture Research Services Science Network
• USDA scope is far beyond human
USDA Agricultural Research Services
Use Cases
• Drought (Soil Moisture) Project – Challenging Volumes
of Data
– NASA satellite data storage - 7 TB/mo., 36mo mission
– ARS Hydrology and Remote Sensing Lab analysis - 108 TB
– Data completely re-process 3 to 5 times
• Microbial Genomics Project – Computational
Bottlenecks
– Individual Strains of bacteria and microorganism communities
related to
Food Safety
Animal Health
Feed Efficiency
[ 20 ]
ARS Big Data Initiative
Big Data Workshop Recommendations,
(February 2013)
Three Pillars of the ARS Big Data Implementation
Plan – Network, HPC, Virtual Research Support
(April, 2014)
• Develop a Science DMZ
• Enable high-speed, low-latency transfer of
research data to HPC and storage from ARS
locations
• Virtual Researcher Support
Implementation Complete (Nov. 2015)
Clay Center, NE; Albany, CA; Beltsville
Labs/Nat’l Ag. Library, Beltsville, MD
Stoneville, MS; Ft. Collins, CO
Ames/NADC, IA
• ARS Scientific Computing
Assessment
• Final Report March 2014
SCInet Locations and Gateways
USDA AGRICULTURAL RESEARCH
SERVICE
Albany, CA
Ft. Collins, CO Clay Center, NE Ames, IA
Stoneville, MS
Beltsville, MD
100 Gb
100 Gb
100 Gb
10 Gb
10 Gb10 Gb
Cloud & Distributed Research Computing
@Scale
[ 22 ] Community
Internet2 Approach :
Agile scaling of resources and capacity
Access to multi-domain, multi-discipline expertise in one dynamic global community
Offer a bottomless toolbox for Innovation for the researcher
[ 23 ]
New High Speed Cloud Collaborations
8/30/20
16
23
10, x10G, x100G
Syngenta Science Network
Bringing Plant Potential to Life through enhanced
computing capacity
Syngenta Science Network
• Syngenta is a leading agriculture
company helping to improve global
food security by enabling millions of
farmers to make better use of
available resources.
• Key research challenge:
How to grow plants
more efficiently?
• Internet2 members, especially land
grant universities, are important
research partners.
The Challenge
– Increasing size of scientific data sets
– Growing number of useful external resources
and partners
– Complexity of genomic analyses is
increasing
– Need for big data collaborations across the
globe
– Must Innovate
– Higher data throughput
– High speed connectivity to AWS Direct Connect
Surge HPC
Collaborations with academic community
– High speed connections to best-in-class supercomputing resources
NCSA – University of Illinois
Leverage NCSA expertise in building custom R&D workflows
Leverage NCSA Industry Partnership Program
A*Star Supercomputing Center in Singapore
Supports a global, distributed, scientific computing capability
– Global scale : creating a global fabric for computing and collaboration
“I want to be 15 minutes behind NCSA and 6
months ahead of my competition”
- Keith Gray, BP
[ 28 ]
National Center for Supercomputing
Applications
[ 29 ]
*Better Designed* *More Durable* *Available Sooner*
Theoretical &
Basic Research
Prototyping &
Development
Optimization &
Robustification
Commercialization
[ 30 ]
NCSA Mayo Clinic @Scale Genome-Wide
Association Study
for Alzheimer’s disease
• NCSA Private Sector Program
– UIUC HPCBio
– Mayo Clinic
• BlueWatersteam and Swiss Institute of Bioinformatics
worked together to identify which genetic variants
interact to influence gene expression patterns that
may associate with Alzheimer’s disease
[ 31 ]
Big Data and Big Compute Problem
• 50,011,495,056 pairs of variants
• Each variant pair is tested against
181 subjects and 24,544 genic regions
• Computationally large problem,
PLINK: ~ 2 years at Mayo FastEpistasis: ~ 6 hours on BlueWaters
• Can be a big data problem:
- 500 PB if keep all results
- 4 TB when using a conservative cutoff
San Diego Supercomputing Center
[ 32
UCSC Cancer Genomics Hub: Large Data Flows to End
Users
1G
8G
15G
Cumulative TBs of CGH
Files Downloaded
Data Source: David Haussler, Brad Smith,
UCSC; Larry Smarr, CalIT2
30 PB
http://blogs.nature.com/news/2012/05/us-cancer-genome-repository-hopes-to-speed-research.html
[ 34 ]
SDSC Protein Data Base Archive
• Repository of atomic coordinates and other information describing proteins and other
important biological macromolecules. Structural biologists use methods such as X-ray
crystallography, NMR spectroscopy, and cryo-electron microscopy to determine
the location of each atom relative to each other in the molecule. Information is
annotated and publicly released into the archive by the wwPDB.
SDSC
• Expertise
– Bioinformatics programming
and applications support.
– Computational chemistry
methods.
– Compliance requirements,
e.g., for dbGaP, FISMA and
HIPAA.
– Data mining techniques,
machine learning and
predictive analytics
– HPC and storage system
architecture and design.
– Scientific workflow systems
and informatics pipelines.
• Education and Training
– Intensive Boot camps for
working professionals - Data
Mining, Graph Analytics, and
Bioinformatics and Scientific
Worflows.
– Customized, on-site training
sessions/programs.
– Data Science Certificate
program.
– “Hackathon” events in data
science and other topics.
8/30/20
16
Sherlock Cloud: A HIPAA-Compliant
Cloud
Healthcare IT Managed Services - SDSC Center of Excellence
36
• Expertise in Systems, Cyber Security, Data Management,
Analytics, Application Development, Advanced User Support and
Project Management
• Operating the first & largest FISMA Data Warehouse platform for
Medicaid fraud, waste and abuse analysis
• Leveraged FISMA experience to offer HIPAA- Compliant
managed hosting for UC and academia
• Supporting HHS CMS, NIH, UCOP and other UC Campuses
• Sherlock services : Data Lab, Analytics, Case Management
and Compliant Cloud
Lawrence Livermore National Lab
[ 37 ]
38 – 8/30/2016, ©
Internet2
Lawrence Livermore NL HPC Innovation Center
Cardioid
Electrophysiology human heart
simulations allowing exploration of
causes of
• Arrhythmia
• Sudden cardiac arrest
• Predictive drug interactions.
Depicts activation of each heart
muscle cell and the cell-to-cell
transfer of the voltage of up to 3
billion cells - in near-real time.
Metagenomic analysis with Catalyst:
• Comparing short genetic fragments in a query dataset
against a large searchable index (14 million
genomes - 3x larger than those currently in use) of
genomes to determine the threat an organism poses
to human health
Community Data Science Resources
renci RADII and GWU HIVE
Driving Infrastructure Virtualization
Enabling Reproducibility For FDA Submissions
[ 39 ]
RADII
Resource Aware Datacentric collaboratIve Infrastructure
Goal
Make data-driven collaborations a ‘turn-key’ experience for domain
researchers and a ‘commodity’ for the science community
Approach
A new cyber-infrastructure to manage data-centric collaborations based
upon natural models of collaborations that occur among scientists.
RENCI: Claris Castillo, Fan Jiang, Charles Schmidt, Paul Ruth, Anirban Mandal ,Shu Huang, Yufeng Xin, Ilya Baldin, Arcot Rajasekar
SDSC: Amit Majumdar
DUKE: Erich Huang
Workflows - especially data-driven workflows and workflow
ensembles - are becoming a centerpiece of modern computational
science.
RADII Rationale
• Multi-institutional research teams grapple with multitude of resources
– Policy-restricted large data sets
– Campus compute resources
– National compute resources
– Instruments that produce data
• Interconnected by networks
– Campus, regional, national providers
• Many options, much complexity
• Data and infrastructure are treated separately
RADII Creates
A cyberinfrastructure that integrates data and resource
management from the ground up to support data-centric research.
RADII allows scientists to easily map collaborative data-driven
activities onto a dynamically configurable cloud infrastructure.
Infrastructure management
have no visibility into data
resources
Data management solutions
have no visibility into the
infrastructure
RADII: Foundational technologies
Data-grids present distributed data under a
one single abstraction and authorization
layer
Networked Infrastructure as a Service (NIaaS)
for rapid deployment of programmable
network virtual infrastructure (clouds).
Disjoint solutions
Incompatible resource abstractions
Gap
to reduce the data-infrastructure management gap
RADII System – Virtualizing Data, Compute and Network
for Collaboration
43
Novel mechanisms to
represent data-centric
collaborations using DFD
formalism
Data-centric resource
management
mechanisms for
provisioning and de-
provisioning resources
dynamically through
out the lifecycle of
collaborations
Novel mechanisms to
map data processes,
computations, storage
and organization entities
onto infrastructure
FDA and George Washington University
Big Data Decisions:
Linking Regulatory and Industry
Organizations with
HIVE Bio-Compute Objects
[ 44 ]
Presented by: Dan Taylor, Internet 2 | Bio IT | Boston | 2016
EI
H V
From Jan 2016: Vahan Simonyan, Raja Mazumder
lecture NIH: Frontiers in Data Science Series
https://videocast.nih.gov/summary.asp?Live=18299&bhcp=1
High-performance Integrated Virtual Environment
A regulatory NGS data analysis platform
BIG DATA – From a range of samples and instruments to approval for
use
analysis and
review
sample
archival
sequencing run
file transfer
regulation
computation
pipelines
produced files
are massive in
size
transfer is
slow
too large to keep
forever; not
standardized
difficult to
validate
difficult to
visualize and
interpret
how do we
avoid
mistakes?
NGS lifecycle: from a biological sample to biomedical research and regulation
• Data Size: petabyte scale, soon exa-bytes
• Data Transfer: too slow over existing networks
• Data Archival: retaining consistent datasets across many years of mandated
evidence maintenance is difficult
• Data Standards: floating standards, multiplicity of formats, inadequate
communication protocols
• Data Complexity: sophisticated IT framework needed for complex dataflow
• Data Privacy: constrictive legal framework and ownership issues across the board
from the patient bedside to the FDA regulation
• Data Security: large number of complicated security rules and data protection tax IT
subsystems and cripple performance
• Computation Size: distributed computing, inefficiently parallelized, requires large
investment of hardware, software and human-ware
• Computation Standards: non canonical computation protocols, difficult to compare,
reproduce, rely on computations
• Computation Complexity: significant investment of time and efforts to learn
appropriate skills and avoid pitfalls in complex computational pipelines
• Interpretation: large outputs from enormous computations are difficult to visualize
and summarize
• Publication: peer review and audit requires communication by massive amount of
information
... and how do we avoid mistakes ?
software challenges and needs
HIVE is an End to End Solution
• Data retrieval from anywhere in the world
• Storage of extra large scale data
• Security approved by OIM
• Integrator platform to bring different data and analytics together
• Tailor made analytics designed around needs
• Visualization made to help in interpretation of data
• Support of the entire hard-, soft-ware and knowledge infrastructure
• Expertise accumulated in the agency
• Bio-Compute objects repository to provide reproducibility and
interoperability and long term referable storage of computations and results
HIVE is not
• an application to perform few tasks
• yet another database
• a computer cluster or a cloud or a data center
• an IT subsystem
More:
http://www.fda.gov/ScienceResearch/SpecialTopics/RegulatoryScience/ucm491
893.htm
Instantiation
DataTypeDefinitions Definitions of
metadata
types
Data Typing
Engine
Definitions of
computations
metadata
Data
Bio-compute
Definitions of
algorithms
and pipeline
descriptions
Computational
protocols
Verifiable
results
within
acceptable
uncertainty/er
ror
Scientifically
reliable
interpretation
HIVE data universe
industry FDA regulatory
analysis
2. compute
3.
submit
1. data-
forming
6.
issues
resubmits
5. regulatory
decision
4.
SOPP/prot
ocols
consumer
$ millions of dollars
7. yes
7. no
regulatory iterations
~$800 Million R&D dollars for a single
drug
~$2.6 Billion total cost
industr
y
FD
A
HIVE
public-HIVEGalaxy
CLC
DNA-nexus
2. compute
3. submit1. data-forming
6. issues
resubmits
5. bio-
compute
2. HIVE
SOPP/protocols
4.
SOPP/prot
ocols
consumer
7.
yes
7 .no
4. submit
bio-compute
integration
3.
compute
Facilitate
integration
$ millions of dollars
bio-compute as a way to link regulatory
and industry organizations
Federated Identity
[ 52 ]
[ 53 ]
Community-developed framework of
trust enables:
• Secure, streamlined sharing of
protected resources
• Consolidated management of user
identities and access
• Delivery of an integrated portfolio of
community-developed solutions
[ 53 ]
Trusted Identity in Research
The standard for over
600 higher education
institutions—and
counting!
[ 54 ]
15 425+
2 160+
0 2000+
7.8 million
Academic
Participants
Sponsored
Partners
Registered
Service Providers
Individuals served
by federated IdM
Foundation for Trust & Identity
54
®
• Eric Boyd, Internet2
• Stephen Wolff, Internet2
• Stephen Goff, PhD, CyVERSE/iPlant, University of Arizona
• Chris Dagdigian, BioTeam
• Daiwei Lin, PhD, NIAID, NIH
• Paul Gibson, USDA ARS
• Paul Travis, Syngenta
• Evan Burness, NCSA
• Sandeep Chandra, SDSC
• Jonathan Allen, PhD, Lawrence Livermore National Lab
• Claris Castillo, PhD, RENCI
• Vahan Simonyan, PhD, FDA
• Raja Mazumder, PhD, George Washington University
• Eli Dart, ESNET, US Department of Energy
• BGI
• Nature
[ 55 ]
Acknowledgements
Thank you!
Daniel Taylor, Director, Business Development
Internet2
dbt3@internet2.edu
703-517-2566
Back up slides
Science DMZ
[ 57 ]
[ 58 ]
Rising expectations
Network throughput required to move y bytes in x time.
(US Dept of Energy - http://fasterdata.es.net).
should
be easy
This
year
3/30/16, © 2016
Internet2
Science DMZ* and perfSONAR
Design pattern to address the most common bottlenecks to moving data
* fasterdata.es.net
59

Mais conteúdo relacionado

Mais procurados

The Pacific Research Platform: Building a Distributed Big-Data Machine-Learni...
The Pacific Research Platform: Building a Distributed Big-Data Machine-Learni...The Pacific Research Platform: Building a Distributed Big-Data Machine-Learni...
The Pacific Research Platform: Building a Distributed Big-Data Machine-Learni...Larry Smarr
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research PlatformLarry Smarr
 
Advanced Global-Scale Networking Supporting Data-Intensive Artificial Intelli...
Advanced Global-Scale Networking Supporting Data-Intensive Artificial Intelli...Advanced Global-Scale Networking Supporting Data-Intensive Artificial Intelli...
Advanced Global-Scale Networking Supporting Data-Intensive Artificial Intelli...Larry Smarr
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research PlatformLarry Smarr
 
Advanced Cyberinfrastructure Enabled Services and Applications in 2021
Advanced Cyberinfrastructure Enabled Services and Applications in 2021Advanced Cyberinfrastructure Enabled Services and Applications in 2021
Advanced Cyberinfrastructure Enabled Services and Applications in 2021Larry Smarr
 
The Pacific Research Platform:a Science-Driven Big-Data Freeway System
The Pacific Research Platform:a Science-Driven Big-Data Freeway SystemThe Pacific Research Platform:a Science-Driven Big-Data Freeway System
The Pacific Research Platform:a Science-Driven Big-Data Freeway SystemLarry Smarr
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research PlatformLarry Smarr
 
UC-Wide Cyberinfrastructure for Data-Intensive Research
UC-Wide Cyberinfrastructure for Data-Intensive ResearchUC-Wide Cyberinfrastructure for Data-Intensive Research
UC-Wide Cyberinfrastructure for Data-Intensive ResearchLarry Smarr
 
Building the Pacific Research Platform: Supernetworks for Big Data Science
Building the Pacific Research Platform: Supernetworks for Big Data ScienceBuilding the Pacific Research Platform: Supernetworks for Big Data Science
Building the Pacific Research Platform: Supernetworks for Big Data ScienceLarry Smarr
 
Berkeley cloud computing meetup may 2020
Berkeley cloud computing meetup may 2020Berkeley cloud computing meetup may 2020
Berkeley cloud computing meetup may 2020Larry Smarr
 
Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2Vivien Bonazzi
 
EMBL Australian Bioinformatics Resource AHM - Data Commons
EMBL Australian Bioinformatics Resource AHM   - Data CommonsEMBL Australian Bioinformatics Resource AHM   - Data Commons
EMBL Australian Bioinformatics Resource AHM - Data CommonsVivien Bonazzi
 
Global Research Platforms: Past, Present, Future
Global Research Platforms: Past, Present, FutureGlobal Research Platforms: Past, Present, Future
Global Research Platforms: Past, Present, FutureLarry Smarr
 
Peering The Pacific Research Platform With The Great Plains Network
Peering The Pacific Research Platform With The Great Plains NetworkPeering The Pacific Research Platform With The Great Plains Network
Peering The Pacific Research Platform With The Great Plains NetworkLarry Smarr
 
Pacific Research Platform Science Drivers
Pacific Research Platform Science DriversPacific Research Platform Science Drivers
Pacific Research Platform Science DriversLarry Smarr
 
Security Challenges and the Pacific Research Platform
Security Challenges and the Pacific Research PlatformSecurity Challenges and the Pacific Research Platform
Security Challenges and the Pacific Research PlatformLarry Smarr
 
Looking Back, Looking Forward NSF CI Funding 1985-2025
Looking Back, Looking Forward NSF CI Funding 1985-2025Looking Back, Looking Forward NSF CI Funding 1985-2025
Looking Back, Looking Forward NSF CI Funding 1985-2025Larry Smarr
 
Massive-Scale Analytics Applied to Real-World Problems
Massive-Scale Analytics Applied to Real-World ProblemsMassive-Scale Analytics Applied to Real-World Problems
Massive-Scale Analytics Applied to Real-World Problemsinside-BigData.com
 

Mais procurados (20)

The Pacific Research Platform: Building a Distributed Big-Data Machine-Learni...
The Pacific Research Platform: Building a Distributed Big-Data Machine-Learni...The Pacific Research Platform: Building a Distributed Big-Data Machine-Learni...
The Pacific Research Platform: Building a Distributed Big-Data Machine-Learni...
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
 
Advanced Global-Scale Networking Supporting Data-Intensive Artificial Intelli...
Advanced Global-Scale Networking Supporting Data-Intensive Artificial Intelli...Advanced Global-Scale Networking Supporting Data-Intensive Artificial Intelli...
Advanced Global-Scale Networking Supporting Data-Intensive Artificial Intelli...
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
 
Advanced Cyberinfrastructure Enabled Services and Applications in 2021
Advanced Cyberinfrastructure Enabled Services and Applications in 2021Advanced Cyberinfrastructure Enabled Services and Applications in 2021
Advanced Cyberinfrastructure Enabled Services and Applications in 2021
 
The Pacific Research Platform:a Science-Driven Big-Data Freeway System
The Pacific Research Platform:a Science-Driven Big-Data Freeway SystemThe Pacific Research Platform:a Science-Driven Big-Data Freeway System
The Pacific Research Platform:a Science-Driven Big-Data Freeway System
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
 
UC-Wide Cyberinfrastructure for Data-Intensive Research
UC-Wide Cyberinfrastructure for Data-Intensive ResearchUC-Wide Cyberinfrastructure for Data-Intensive Research
UC-Wide Cyberinfrastructure for Data-Intensive Research
 
Building the Pacific Research Platform: Supernetworks for Big Data Science
Building the Pacific Research Platform: Supernetworks for Big Data ScienceBuilding the Pacific Research Platform: Supernetworks for Big Data Science
Building the Pacific Research Platform: Supernetworks for Big Data Science
 
Berkeley cloud computing meetup may 2020
Berkeley cloud computing meetup may 2020Berkeley cloud computing meetup may 2020
Berkeley cloud computing meetup may 2020
 
Cyberistructure
CyberistructureCyberistructure
Cyberistructure
 
Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2
 
EMBL Australian Bioinformatics Resource AHM - Data Commons
EMBL Australian Bioinformatics Resource AHM   - Data CommonsEMBL Australian Bioinformatics Resource AHM   - Data Commons
EMBL Australian Bioinformatics Resource AHM - Data Commons
 
Global Research Platforms: Past, Present, Future
Global Research Platforms: Past, Present, FutureGlobal Research Platforms: Past, Present, Future
Global Research Platforms: Past, Present, Future
 
Peering The Pacific Research Platform With The Great Plains Network
Peering The Pacific Research Platform With The Great Plains NetworkPeering The Pacific Research Platform With The Great Plains Network
Peering The Pacific Research Platform With The Great Plains Network
 
Pacific Research Platform Science Drivers
Pacific Research Platform Science DriversPacific Research Platform Science Drivers
Pacific Research Platform Science Drivers
 
Security Challenges and the Pacific Research Platform
Security Challenges and the Pacific Research PlatformSecurity Challenges and the Pacific Research Platform
Security Challenges and the Pacific Research Platform
 
Looking Back, Looking Forward NSF CI Funding 1985-2025
Looking Back, Looking Forward NSF CI Funding 1985-2025Looking Back, Looking Forward NSF CI Funding 1985-2025
Looking Back, Looking Forward NSF CI Funding 1985-2025
 
Sgci esip-7-20-18
Sgci esip-7-20-18Sgci esip-7-20-18
Sgci esip-7-20-18
 
Massive-Scale Analytics Applied to Real-World Problems
Massive-Scale Analytics Applied to Real-World ProblemsMassive-Scale Analytics Applied to Real-World Problems
Massive-Scale Analytics Applied to Real-World Problems
 

Destaque

Big Process for Big Data
Big Process for Big DataBig Process for Big Data
Big Process for Big DataIan Foster
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduriRavi Madduri
 
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSExperiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSEd Dodds
 
Role of Amyloid Burden in cognitive decline
Role of Amyloid Burden in cognitive decline Role of Amyloid Burden in cognitive decline
Role of Amyloid Burden in cognitive decline Ravi Madduri
 
Jsm madduri-august-2015
Jsm madduri-august-2015Jsm madduri-august-2015
Jsm madduri-august-2015Ravi Madduri
 
Big Data and Genomics
Big Data and GenomicsBig Data and Genomics
Big Data and GenomicsAl Costa
 
ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)
ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)
ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)joseplaborda
 
Effective ansible
Effective ansibleEffective ansible
Effective ansibleWu Bigo
 
CI4CC sustainability-panel
CI4CC sustainability-panelCI4CC sustainability-panel
CI4CC sustainability-panelRavi Madduri
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsYahoo Developer Network
 

Destaque (20)

Big Process for Big Data
Big Process for Big DataBig Process for Big Data
Big Process for Big Data
 
Public.Cdsc.Middleton
Public.Cdsc.MiddletonPublic.Cdsc.Middleton
Public.Cdsc.Middleton
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
 
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSExperiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
 
Role of Amyloid Burden in cognitive decline
Role of Amyloid Burden in cognitive decline Role of Amyloid Burden in cognitive decline
Role of Amyloid Burden in cognitive decline
 
Jsm madduri-august-2015
Jsm madduri-august-2015Jsm madduri-august-2015
Jsm madduri-august-2015
 
Big Data and Genomics
Big Data and GenomicsBig Data and Genomics
Big Data and Genomics
 
HL7: Clinical Decision Support
HL7: Clinical Decision SupportHL7: Clinical Decision Support
HL7: Clinical Decision Support
 
ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)
ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)
ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)
 
Effective ansible
Effective ansibleEffective ansible
Effective ansible
 
CI4CC sustainability-panel
CI4CC sustainability-panelCI4CC sustainability-panel
CI4CC sustainability-panel
 
Supporting Barack Obama for President
Supporting Barack Obama for PresidentSupporting Barack Obama for President
Supporting Barack Obama for President
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
 
Raskar UIST Keynote 2015 November
Raskar UIST Keynote 2015 NovemberRaskar UIST Keynote 2015 November
Raskar UIST Keynote 2015 November
 
Leap Motion Development (Rohan Puri)
Leap Motion Development (Rohan Puri)Leap Motion Development (Rohan Puri)
Leap Motion Development (Rohan Puri)
 
Stereo and 3D Displays - Matt Hirsch
Stereo and 3D Displays - Matt HirschStereo and 3D Displays - Matt Hirsch
Stereo and 3D Displays - Matt Hirsch
 
What is Media in MIT Media Lab, Why 'Camera Culture'
What is Media in MIT Media Lab, Why 'Camera Culture'What is Media in MIT Media Lab, Why 'Camera Culture'
What is Media in MIT Media Lab, Why 'Camera Culture'
 
Multiview Imaging HW Overview
Multiview Imaging HW OverviewMultiview Imaging HW Overview
Multiview Imaging HW Overview
 
Google Glass Breakdown
Google Glass BreakdownGoogle Glass Breakdown
Google Glass Breakdown
 
What is SIGGRAPH NEXT? Intro by Ramesh Raskar
What is SIGGRAPH NEXT? Intro by Ramesh RaskarWhat is SIGGRAPH NEXT? Intro by Ramesh Raskar
What is SIGGRAPH NEXT? Intro by Ramesh Raskar
 

Semelhante a Internet2 Bio IT 2016 v2

An Integrated West Coast Science DMZ for Data-Intensive Research
An Integrated West Coast Science DMZ for Data-Intensive ResearchAn Integrated West Coast Science DMZ for Data-Intensive Research
An Integrated West Coast Science DMZ for Data-Intensive ResearchLarry Smarr
 
Building a Regional 100G Collaboration Infrastructure
Building a Regional 100G Collaboration InfrastructureBuilding a Regional 100G Collaboration Infrastructure
Building a Regional 100G Collaboration InfrastructureLarry Smarr
 
Application of Assent in the safe - Networkshop44
Application of Assent in the safe -  Networkshop44Application of Assent in the safe -  Networkshop44
Application of Assent in the safe - Networkshop44Jisc
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGGeoffrey Fox
 
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway SystemThe Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway SystemLarry Smarr
 
High Performance Cyberinfrastructure for Data-Intensive Research
High Performance Cyberinfrastructure for Data-Intensive ResearchHigh Performance Cyberinfrastructure for Data-Intensive Research
High Performance Cyberinfrastructure for Data-Intensive ResearchLarry Smarr
 
SKA NZ R&D BeSTGRID Infrastructure
SKA NZ R&D BeSTGRID InfrastructureSKA NZ R&D BeSTGRID Infrastructure
SKA NZ R&D BeSTGRID InfrastructureNick Jones
 
The BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative researchThe BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative researchBlue BRIDGE
 
Accelerating Science, Technology and Innovation Through Open Data and Open Sc...
Accelerating Science, Technology and Innovation Through Open Data and Open Sc...Accelerating Science, Technology and Innovation Through Open Data and Open Sc...
Accelerating Science, Technology and Innovation Through Open Data and Open Sc...African Open Science Platform
 
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway SystemThe Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway SystemLarry Smarr
 
Democratizing Science through Cyberinfrastructure - Manish Parashar
Democratizing Science through Cyberinfrastructure - Manish ParasharDemocratizing Science through Cyberinfrastructure - Manish Parashar
Democratizing Science through Cyberinfrastructure - Manish ParasharLarry Smarr
 
Shared services - the future of HPC and big data facilities for UK research
Shared services - the future of HPC and big data facilities for UK researchShared services - the future of HPC and big data facilities for UK research
Shared services - the future of HPC and big data facilities for UK researchMartin Hamilton
 
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...Larry Smarr
 
Ticer summer school_24_aug06
Ticer summer school_24_aug06Ticer summer school_24_aug06
Ticer summer school_24_aug06SayDotCom.com
 
ACC-2012, Bangalore, India, 28 July, 2012
ACC-2012, Bangalore, India, 28 July, 2012ACC-2012, Bangalore, India, 28 July, 2012
ACC-2012, Bangalore, India, 28 July, 2012Charith Perera
 
Towards a High-Performance National Research Platform Enabling Digital Research
Towards a High-Performance National Research Platform Enabling Digital ResearchTowards a High-Performance National Research Platform Enabling Digital Research
Towards a High-Performance National Research Platform Enabling Digital ResearchLarry Smarr
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Blue BRIDGE
 

Semelhante a Internet2 Bio IT 2016 v2 (20)

An Integrated West Coast Science DMZ for Data-Intensive Research
An Integrated West Coast Science DMZ for Data-Intensive ResearchAn Integrated West Coast Science DMZ for Data-Intensive Research
An Integrated West Coast Science DMZ for Data-Intensive Research
 
Building a Regional 100G Collaboration Infrastructure
Building a Regional 100G Collaboration InfrastructureBuilding a Regional 100G Collaboration Infrastructure
Building a Regional 100G Collaboration Infrastructure
 
100503 bioinfo instsymp
100503 bioinfo instsymp100503 bioinfo instsymp
100503 bioinfo instsymp
 
Application of Assent in the safe - Networkshop44
Application of Assent in the safe -  Networkshop44Application of Assent in the safe -  Networkshop44
Application of Assent in the safe - Networkshop44
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWG
 
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway SystemThe Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
 
High Performance Cyberinfrastructure for Data-Intensive Research
High Performance Cyberinfrastructure for Data-Intensive ResearchHigh Performance Cyberinfrastructure for Data-Intensive Research
High Performance Cyberinfrastructure for Data-Intensive Research
 
SKA NZ R&D BeSTGRID Infrastructure
SKA NZ R&D BeSTGRID InfrastructureSKA NZ R&D BeSTGRID Infrastructure
SKA NZ R&D BeSTGRID Infrastructure
 
The BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative researchThe BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative research
 
Accelerating Science, Technology and Innovation Through Open Data and Open Sc...
Accelerating Science, Technology and Innovation Through Open Data and Open Sc...Accelerating Science, Technology and Innovation Through Open Data and Open Sc...
Accelerating Science, Technology and Innovation Through Open Data and Open Sc...
 
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway SystemThe Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
 
Big Data
Big Data Big Data
Big Data
 
Democratizing Science through Cyberinfrastructure - Manish Parashar
Democratizing Science through Cyberinfrastructure - Manish ParasharDemocratizing Science through Cyberinfrastructure - Manish Parashar
Democratizing Science through Cyberinfrastructure - Manish Parashar
 
SomeSlides
SomeSlidesSomeSlides
SomeSlides
 
Shared services - the future of HPC and big data facilities for UK research
Shared services - the future of HPC and big data facilities for UK researchShared services - the future of HPC and big data facilities for UK research
Shared services - the future of HPC and big data facilities for UK research
 
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
 
Ticer summer school_24_aug06
Ticer summer school_24_aug06Ticer summer school_24_aug06
Ticer summer school_24_aug06
 
ACC-2012, Bangalore, India, 28 July, 2012
ACC-2012, Bangalore, India, 28 July, 2012ACC-2012, Bangalore, India, 28 July, 2012
ACC-2012, Bangalore, India, 28 July, 2012
 
Towards a High-Performance National Research Platform Enabling Digital Research
Towards a High-Performance National Research Platform Enabling Digital ResearchTowards a High-Performance National Research Platform Enabling Digital Research
Towards a High-Performance National Research Platform Enabling Digital Research
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
 

Internet2 Bio IT 2016 v2

  • 1. Pushing Discovery with Internet2 Cloud to Supercomputing in Life Sciences DAN TAYLOR Director, Business Development, Internet2 BIO-IT WORLD 2016 BOSTON APRIL, 2016
  • 2. 2 – 8/30/ Internet2 Overview • An advanced networking consortium – Academia – Corporations – Government • Operates a best-in-class national optical network – 15,000 miles of dedicated fiber – 100G routers and optical transport systems – 8.8 Tbps capacity • For over 20 years, our mission has been to – Provide cost effective broadband and collaboration technologies to facilitate frictionless research in Big Science – broad collaboration, extremely large data sets – Create tomorrow’s networks & a platform for networking research – Engage stakeholders in • Bridging the IT/Researcher gap • Developing new technologies critical to their missions
  • 3. [ 3 ] The 4th Gen Internet2 Network Internet2 Network by the numbers 17 Juniper MX960 nodes 31 Brocade and Juniper switches 49 custom colocation facilities 250+ amplification racks 15,717 miles of newly acquired dark fiber 2,400 miles of partnered capacity with Zayo Communications 8.8 Tbps of optical capacity 100 Gbps of hybrid Layer 2 and Layer 3 capacity 300+ Ciena ActiveFlex 6500 network elements
  • 4. Technology • A Research Grade high speed network – optimized for “Elephant flows” • Layer 1 – secure point to point wavelength networking • Advanced Layer 2 Services – Open virtual network for Life Sciences with connectivity speeds up to 100 Gbs • SDN Network Virtualization customer trials now • Advanced Layer 3 Services – High speed IP connectivity to the world • Superior economics • Secure sharing of online research resource – federated identity management system
  • 5. [ 5 ] Internet2 Members and Partners 255 Higher Education members 67 Affiliate members 41 R&E Network members 82 Industry members 65+ Int’l partners reaching over 100 Nations 93,000+ Community anchor institutions Focused on member technology needs since 1996 "The idea of being able to collaborate with anybody, anywhere, without constraint…" —Jim Bottum, CIO, Clemson University Community
  • 6. 6 – 8/30/ Strong international partnerships • Agreements with international networking partners offer interoperability and access • Enable collaboration between U.S. researchers and overseas counterparts in over 100 international R & E networks Community
  • 7. Some of our Affiliate Members 7
  • 8. [ 8 ] *Routers Stanford Computer Workstations Berkeley, Stanford Security Systems Univ of Michigan Security Systems Georgia Tech Social Media Harvard Network Caching MIT Search Stanford
  • 9. [ 9 ] The Route to Innovation August 30, 2016 © 2016 Internet2 Abundant Bandwidth • Raw capacity now available on Internet2 Network a key imagination enabler • Incent disruptive use of new, advanced capabilities Software Defined Networking • Open up network layer itself to innovation • Let innovators communicate with and program the network itself • Allow developers to optimize the network for specific applications Science DMZ • Architect a special solution to allow higher-performance data flows • Include end-to-end performance monitoring server and software • Include SDN server to support programmability
  • 10. Life Sciences Research Today • Sharing Big Data sets (genomic, environmental, imagery) key to basic and applied research • Reproducibility - need to capture methods as well as raw data – High variability in analytic processes and instruments – Inconsistent formats and standards • Lack of metadata & standards • Biological systems are immensely complicated and dynamic (S. Goff, CyVERSE/iPlant) • 21k human genes can make >100k proteins • >50% of genes are controlled by day-night cycles • Proteins have an average half-life of 30 hours • Several thousand metabolites are rapidly changing • Traits are environmentally and genetically controlled • Information Technology - High Performance Computing and Networking - now can explore these systems through simulation • Collaboration – Cross Domain, Cross Discipline – Distribution of systems and talent is global – Resources are public, private and academic
  • 11. BIO-IT Trends in the Trenches 2015 with Chris Dagdigian Take Aways - Science is changing faster than IT funding cycle for data intensive computing environments - Forward looking 100G multi site , multi party collaborations required - Cloud adoption driven by capability vs cost - Centralized data center dead; future is distributed computing/data stores - Big pharma security challenge has been met - SDN is real and happening now; part of infrastructure automation wave - Blast radius more important than ever: DOE’s Science DMZ architecture is a solution https://youtu.be/U6i0THTxe4o http://www.slideshare.net/chrisdag/201 5-bioit-trends-from-the-frenches 2015 Bio-IT World Conference & Expo • Change • Networking • Cloud • Decentralized Collaboration • Security • Mission Networks
  • 13. 13 – 8/30/2016, © 2009 Internet2 Data Tsunami Physics Large Hadron Collider Life Sciences Next Generation Sequencers CERN Illumina
  • 15. 15 – 8/30/20 2012: US – China 10 Gbps Link Fed Ex: 2 days Internet+ FTP: 26 hours China ‐ US 10G Link: 30 secs Dr. Lin Fang Dr. Dawei Lin Sample.fa (24GB)
  • 16. NCBI/UC-Davis/BGI : First ultra high speed transfer of genomic data between China & US, June 2012 “The 10 Gigabit network connection is even faster than transferring data to most local hard drives,” said Dr. Lin [of UC, Davis]. “The use of a 10 Gigabit network connection will be groundbreaking, very much like email replacing hand delivered mail for communication. It will enable scientists in the genomics-related fields to communicate and transfer data more rapidly and conveniently, and bring the best minds together to better explore the mysteries of life science.” (BGI press release) Life Sciences Engagement 16 Community
  • 17. Forward Looking 100G Networks & Multi Site Multi Party Collaboration Accelerating Discovery: USDA ARS Science Network 8/30/2016, © 2016 Internet2
  • 18. [ 18 ] USDA Agriculture Research Services Science Network • USDA scope is far beyond human
  • 19. USDA Agricultural Research Services Use Cases • Drought (Soil Moisture) Project – Challenging Volumes of Data – NASA satellite data storage - 7 TB/mo., 36mo mission – ARS Hydrology and Remote Sensing Lab analysis - 108 TB – Data completely re-process 3 to 5 times • Microbial Genomics Project – Computational Bottlenecks – Individual Strains of bacteria and microorganism communities related to Food Safety Animal Health Feed Efficiency
  • 20. [ 20 ] ARS Big Data Initiative Big Data Workshop Recommendations, (February 2013) Three Pillars of the ARS Big Data Implementation Plan – Network, HPC, Virtual Research Support (April, 2014) • Develop a Science DMZ • Enable high-speed, low-latency transfer of research data to HPC and storage from ARS locations • Virtual Researcher Support Implementation Complete (Nov. 2015) Clay Center, NE; Albany, CA; Beltsville Labs/Nat’l Ag. Library, Beltsville, MD Stoneville, MS; Ft. Collins, CO Ames/NADC, IA • ARS Scientific Computing Assessment • Final Report March 2014
  • 21. SCInet Locations and Gateways USDA AGRICULTURAL RESEARCH SERVICE Albany, CA Ft. Collins, CO Clay Center, NE Ames, IA Stoneville, MS Beltsville, MD 100 Gb 100 Gb 100 Gb 10 Gb 10 Gb10 Gb
  • 22. Cloud & Distributed Research Computing @Scale [ 22 ] Community Internet2 Approach : Agile scaling of resources and capacity Access to multi-domain, multi-discipline expertise in one dynamic global community Offer a bottomless toolbox for Innovation for the researcher
  • 23. [ 23 ] New High Speed Cloud Collaborations 8/30/20 16 23 10, x10G, x100G
  • 24. Syngenta Science Network Bringing Plant Potential to Life through enhanced computing capacity
  • 25. Syngenta Science Network • Syngenta is a leading agriculture company helping to improve global food security by enabling millions of farmers to make better use of available resources. • Key research challenge: How to grow plants more efficiently? • Internet2 members, especially land grant universities, are important research partners.
  • 26. The Challenge – Increasing size of scientific data sets – Growing number of useful external resources and partners – Complexity of genomic analyses is increasing – Need for big data collaborations across the globe – Must Innovate
  • 27. – Higher data throughput – High speed connectivity to AWS Direct Connect Surge HPC Collaborations with academic community – High speed connections to best-in-class supercomputing resources NCSA – University of Illinois Leverage NCSA expertise in building custom R&D workflows Leverage NCSA Industry Partnership Program A*Star Supercomputing Center in Singapore Supports a global, distributed, scientific computing capability – Global scale : creating a global fabric for computing and collaboration
  • 28. “I want to be 15 minutes behind NCSA and 6 months ahead of my competition” - Keith Gray, BP [ 28 ] National Center for Supercomputing Applications
  • 29. [ 29 ] *Better Designed* *More Durable* *Available Sooner* Theoretical & Basic Research Prototyping & Development Optimization & Robustification Commercialization
  • 30. [ 30 ] NCSA Mayo Clinic @Scale Genome-Wide Association Study for Alzheimer’s disease • NCSA Private Sector Program – UIUC HPCBio – Mayo Clinic • BlueWatersteam and Swiss Institute of Bioinformatics worked together to identify which genetic variants interact to influence gene expression patterns that may associate with Alzheimer’s disease
  • 31. [ 31 ] Big Data and Big Compute Problem • 50,011,495,056 pairs of variants • Each variant pair is tested against 181 subjects and 24,544 genic regions • Computationally large problem, PLINK: ~ 2 years at Mayo FastEpistasis: ~ 6 hours on BlueWaters • Can be a big data problem: - 500 PB if keep all results - 4 TB when using a conservative cutoff
  • 33. UCSC Cancer Genomics Hub: Large Data Flows to End Users 1G 8G 15G Cumulative TBs of CGH Files Downloaded Data Source: David Haussler, Brad Smith, UCSC; Larry Smarr, CalIT2 30 PB http://blogs.nature.com/news/2012/05/us-cancer-genome-repository-hopes-to-speed-research.html
  • 34. [ 34 ] SDSC Protein Data Base Archive • Repository of atomic coordinates and other information describing proteins and other important biological macromolecules. Structural biologists use methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy to determine the location of each atom relative to each other in the molecule. Information is annotated and publicly released into the archive by the wwPDB.
  • 35. SDSC • Expertise – Bioinformatics programming and applications support. – Computational chemistry methods. – Compliance requirements, e.g., for dbGaP, FISMA and HIPAA. – Data mining techniques, machine learning and predictive analytics – HPC and storage system architecture and design. – Scientific workflow systems and informatics pipelines. • Education and Training – Intensive Boot camps for working professionals - Data Mining, Graph Analytics, and Bioinformatics and Scientific Worflows. – Customized, on-site training sessions/programs. – Data Science Certificate program. – “Hackathon” events in data science and other topics.
  • 36. 8/30/20 16 Sherlock Cloud: A HIPAA-Compliant Cloud Healthcare IT Managed Services - SDSC Center of Excellence 36 • Expertise in Systems, Cyber Security, Data Management, Analytics, Application Development, Advanced User Support and Project Management • Operating the first & largest FISMA Data Warehouse platform for Medicaid fraud, waste and abuse analysis • Leveraged FISMA experience to offer HIPAA- Compliant managed hosting for UC and academia • Supporting HHS CMS, NIH, UCOP and other UC Campuses • Sherlock services : Data Lab, Analytics, Case Management and Compliant Cloud
  • 38. 38 – 8/30/2016, © Internet2 Lawrence Livermore NL HPC Innovation Center Cardioid Electrophysiology human heart simulations allowing exploration of causes of • Arrhythmia • Sudden cardiac arrest • Predictive drug interactions. Depicts activation of each heart muscle cell and the cell-to-cell transfer of the voltage of up to 3 billion cells - in near-real time. Metagenomic analysis with Catalyst: • Comparing short genetic fragments in a query dataset against a large searchable index (14 million genomes - 3x larger than those currently in use) of genomes to determine the threat an organism poses to human health
  • 39. Community Data Science Resources renci RADII and GWU HIVE Driving Infrastructure Virtualization Enabling Reproducibility For FDA Submissions [ 39 ]
  • 40. RADII Resource Aware Datacentric collaboratIve Infrastructure Goal Make data-driven collaborations a ‘turn-key’ experience for domain researchers and a ‘commodity’ for the science community Approach A new cyber-infrastructure to manage data-centric collaborations based upon natural models of collaborations that occur among scientists. RENCI: Claris Castillo, Fan Jiang, Charles Schmidt, Paul Ruth, Anirban Mandal ,Shu Huang, Yufeng Xin, Ilya Baldin, Arcot Rajasekar SDSC: Amit Majumdar DUKE: Erich Huang Workflows - especially data-driven workflows and workflow ensembles - are becoming a centerpiece of modern computational science.
  • 41. RADII Rationale • Multi-institutional research teams grapple with multitude of resources – Policy-restricted large data sets – Campus compute resources – National compute resources – Instruments that produce data • Interconnected by networks – Campus, regional, national providers • Many options, much complexity • Data and infrastructure are treated separately RADII Creates A cyberinfrastructure that integrates data and resource management from the ground up to support data-centric research. RADII allows scientists to easily map collaborative data-driven activities onto a dynamically configurable cloud infrastructure.
  • 42. Infrastructure management have no visibility into data resources Data management solutions have no visibility into the infrastructure RADII: Foundational technologies Data-grids present distributed data under a one single abstraction and authorization layer Networked Infrastructure as a Service (NIaaS) for rapid deployment of programmable network virtual infrastructure (clouds). Disjoint solutions Incompatible resource abstractions Gap to reduce the data-infrastructure management gap
  • 43. RADII System – Virtualizing Data, Compute and Network for Collaboration 43 Novel mechanisms to represent data-centric collaborations using DFD formalism Data-centric resource management mechanisms for provisioning and de- provisioning resources dynamically through out the lifecycle of collaborations Novel mechanisms to map data processes, computations, storage and organization entities onto infrastructure
  • 44. FDA and George Washington University Big Data Decisions: Linking Regulatory and Industry Organizations with HIVE Bio-Compute Objects [ 44 ] Presented by: Dan Taylor, Internet 2 | Bio IT | Boston | 2016
  • 45. EI H V From Jan 2016: Vahan Simonyan, Raja Mazumder lecture NIH: Frontiers in Data Science Series https://videocast.nih.gov/summary.asp?Live=18299&bhcp=1 High-performance Integrated Virtual Environment A regulatory NGS data analysis platform
  • 46. BIG DATA – From a range of samples and instruments to approval for use analysis and review sample archival sequencing run file transfer regulation computation pipelines produced files are massive in size transfer is slow too large to keep forever; not standardized difficult to validate difficult to visualize and interpret how do we avoid mistakes? NGS lifecycle: from a biological sample to biomedical research and regulation
  • 47. • Data Size: petabyte scale, soon exa-bytes • Data Transfer: too slow over existing networks • Data Archival: retaining consistent datasets across many years of mandated evidence maintenance is difficult • Data Standards: floating standards, multiplicity of formats, inadequate communication protocols • Data Complexity: sophisticated IT framework needed for complex dataflow • Data Privacy: constrictive legal framework and ownership issues across the board from the patient bedside to the FDA regulation • Data Security: large number of complicated security rules and data protection tax IT subsystems and cripple performance • Computation Size: distributed computing, inefficiently parallelized, requires large investment of hardware, software and human-ware • Computation Standards: non canonical computation protocols, difficult to compare, reproduce, rely on computations • Computation Complexity: significant investment of time and efforts to learn appropriate skills and avoid pitfalls in complex computational pipelines • Interpretation: large outputs from enormous computations are difficult to visualize and summarize • Publication: peer review and audit requires communication by massive amount of information ... and how do we avoid mistakes ? software challenges and needs
  • 48. HIVE is an End to End Solution • Data retrieval from anywhere in the world • Storage of extra large scale data • Security approved by OIM • Integrator platform to bring different data and analytics together • Tailor made analytics designed around needs • Visualization made to help in interpretation of data • Support of the entire hard-, soft-ware and knowledge infrastructure • Expertise accumulated in the agency • Bio-Compute objects repository to provide reproducibility and interoperability and long term referable storage of computations and results HIVE is not • an application to perform few tasks • yet another database • a computer cluster or a cloud or a data center • an IT subsystem More: http://www.fda.gov/ScienceResearch/SpecialTopics/RegulatoryScience/ucm491 893.htm
  • 49. Instantiation DataTypeDefinitions Definitions of metadata types Data Typing Engine Definitions of computations metadata Data Bio-compute Definitions of algorithms and pipeline descriptions Computational protocols Verifiable results within acceptable uncertainty/er ror Scientifically reliable interpretation HIVE data universe
  • 50. industry FDA regulatory analysis 2. compute 3. submit 1. data- forming 6. issues resubmits 5. regulatory decision 4. SOPP/prot ocols consumer $ millions of dollars 7. yes 7. no regulatory iterations ~$800 Million R&D dollars for a single drug ~$2.6 Billion total cost
  • 51. industr y FD A HIVE public-HIVEGalaxy CLC DNA-nexus 2. compute 3. submit1. data-forming 6. issues resubmits 5. bio- compute 2. HIVE SOPP/protocols 4. SOPP/prot ocols consumer 7. yes 7 .no 4. submit bio-compute integration 3. compute Facilitate integration $ millions of dollars bio-compute as a way to link regulatory and industry organizations
  • 53. [ 53 ] Community-developed framework of trust enables: • Secure, streamlined sharing of protected resources • Consolidated management of user identities and access • Delivery of an integrated portfolio of community-developed solutions [ 53 ] Trusted Identity in Research The standard for over 600 higher education institutions—and counting!
  • 54. [ 54 ] 15 425+ 2 160+ 0 2000+ 7.8 million Academic Participants Sponsored Partners Registered Service Providers Individuals served by federated IdM Foundation for Trust & Identity 54 ®
  • 55. • Eric Boyd, Internet2 • Stephen Wolff, Internet2 • Stephen Goff, PhD, CyVERSE/iPlant, University of Arizona • Chris Dagdigian, BioTeam • Daiwei Lin, PhD, NIAID, NIH • Paul Gibson, USDA ARS • Paul Travis, Syngenta • Evan Burness, NCSA • Sandeep Chandra, SDSC • Jonathan Allen, PhD, Lawrence Livermore National Lab • Claris Castillo, PhD, RENCI • Vahan Simonyan, PhD, FDA • Raja Mazumder, PhD, George Washington University • Eli Dart, ESNET, US Department of Energy • BGI • Nature [ 55 ] Acknowledgements
  • 56. Thank you! Daniel Taylor, Director, Business Development Internet2 dbt3@internet2.edu 703-517-2566
  • 58. [ 58 ] Rising expectations Network throughput required to move y bytes in x time. (US Dept of Energy - http://fasterdata.es.net). should be easy This year
  • 59. 3/30/16, © 2016 Internet2 Science DMZ* and perfSONAR Design pattern to address the most common bottlenecks to moving data * fasterdata.es.net 59

Notas do Editor

  1. Greetings I’m Dan Taylor from Internet2 – thanks for joining us. I’m going to talk a bit about internet2 and the work we’re doing with clouds and other compute resources in our community. There are a lot of slides and I’ll move quickly so pls stop by our booth or download the slides if you have questions.
  2. Internet2 is the Research and education network for the US. We’re a membership consortium of academia , government and corporations. Internet2 is an advanced networking consortium comprised of 221 U.S. universities, in cooperation with 45 leading corporations, 66 government agencies, laboratories and other institutions of higher learning, 35 regional and state research and education networks and more than 100 national research and education networking organizations representing over 50 countries Internet2 actively engages our stakeholders in the development of important new technologies including middleware, security, network research and performance measurement capabilities which are critical to the achievement of the mission goals of our members. Throughout our first 15 years, Internet2 has served a unique role among networking organizations, pioneering the use of advanced network applications and technologies, and facilitating their development—to facilitate the work of the research community. Internet2 operates an advanced national optical network based on 17,500 miles of dedicated fiber and utilizes the latest 100G routers and optical transport systems with 8.8Tbps of system capacity
  3. Goal: Deepen and extend, advance, sustain, advance digital resources ecosystem. Value: Growing portfolio of resources and services: advanced computing, high-end visualization, data analysis, and other resources and services. Interoperability with other infrastructures.
  4. membership numbers as of 2014-03-27 Campus Champions (200 at 175 institutions) 14,000 participants in training workshops (online and in person).
  5. Absolutely key to our success is the global partnerships we have formed. [>>] Internet2 partners with over 50 national research and education networks including our friends in Canada to enable connectivity to more than 100 international networks. These partnerships provide the basis for understanding how to facilitate collaborations between the US Internet2 community and counterparts in other countries Our global partnerships have yielded important developments in new technologies. For example - the DICE collaborative is a partnership between GEANT, Internet2, CANARIE and ESnet which provides a joint forum for North American and European investment in advanced networking leadership Our collaboration has led to the development of world-leading tools like PerfSONAR and dynamic circuit networking – which I will touch on later. Our focus in 2010 is to deliver direct services to our members as a result of our development investments
  6. Our community has a track record of IT successes ; we haven’t looked at life sciences yet but I’m pretty sure the Internet2 community’s impact is even greater there
  7. R&E must keep constructing the conditions that spur innovation Give innovators an environment where they’re free to try new, untested, unpopular, ridiculously challenging things Innovation requires a big playground An innovation platform must encourage utilization, not limit it
  8. Life sciences research shares many of the trends we see else where in big science - data set sizes growing rapidly, increased need for collaboration – but we also see a new ecosystem fueling research. At the same time , however, diminishing R&D $ are pressuring the industry and government .
  9. Chris Dagdigian does a great job detailing how IT deals with the changes in Life sciences research. I have a couple of takeways from his talks, its useful to see how Internet2 addresses whats going on Scientific instrument technology – which generates scinetifc data – is changing faster than the IT refresh cycle. Organizations see the big data wave coming and are now implementing 100 G networks to get ahead of the rising tide Organizations are going to the cloud to be able to do things they’ve can’t do on theyir own, not just to save monney Centralization will not wliminate the need to move data Security concerns with high speed transfers and collaboration can be addressed Virtualized infrastructure is moving to the wide area Big science flows are more disruptive than ever to enterprise networks – theres a trend toward separating business and research networks
  10. One of the things we used to in the R&E community is change in scientific data growth
  11. The internet2 community has dealt with the data tsunami for many years now. The LHC shut down for 2 years to upgrade its power – annual output has jumped from 13 to 30 petabytes a year. This data is distributed thru out the world by the R&E networks. In Life Sciences driver is NGS, falling in price rapidly and a proliferation of devices generating data all over the world http://www.nature.com/news/large-hadron-collider-the-big-reboot-1.16095
  12. Our network has responded
  13. Back in 2012 we showed how a 10G link from beijing to UC Davis could change the game. A 24 GB file that would take 26 hours to traverse the internet was transferred in 30 seconds
  14. Researchers likened the difference in collaboration like going from letters to email
  15. So we’re seeing organizations get ahead of the tsunami by getting bigger networks. I recently helped the Department of Agriculture’s Agriculture Research Service do just this.
  16. I like to show this slide to illustrate how much llfe there is beyond humans, and USDA ARS has to deal with many of them – and how they impact our world. It shows the size of genomes of various species, with the x axis being a log scale. Humans are there at the top, one of a number of mammals the usda is interested in . But they are also interested in birds, crustaceans, fish, fungi , algae , bacteria and protozoans – and of course plants. And, some are extremely complex – you see the size of the wheat genome is orders of magnitude larger than the human.
  17. Beyond genomics , these kinds of projects create huge volumes of data as well as computational bottlenecks
  18. To attack this problem they gathered requirements in 2013, hired bioTeam to do an assessment and we actually completed a 6 node science network of 10 and 100G links by the end of 2015. that was fast!!
  19. R&E collaborations are handled at the 100G links on the coasts and another 100g feeds the new HPC in Ames Iowa
  20. You can view Internet2 as the medium for all the data and computing resources, forming a problem solving community around these high speed connections
  21. Syngenta , a life sciences company , is a great example of an organization making the most of these connections
  22. They are an agribusiness with a mission to improve plant productivity, they stay on the leading edge of science thru their internal research and their collaboration with the academic community
  23. Syngenta was challenged by many of the issues USDA saw, but on a global scale and even more pressure to innovate.
  24. We installed a 10G Layer 2 service that provided high speed Direct Connect access to AWS where they could do surge HPC and retrieve sequencing data outsourced to the academic community. They also could connect to NCSA to build and run custom pipelines. They can also use the connection to work with A*Star supercomputer center in Singapore , where they intend on building an asian genomic center. Finally we expect to bring up locations in switzerland and GB, completing a global research network.
  25. I just mentioned NCSA and this resource deserves a few seconds. NCSA does a lot of work with industry , and a comment from a VP at BP says it all….
  26. Leveraging its talent and one of the fastest computers in the world, NCSA provides companies with a full range of services to help the innovate
  27. They do a lot of work in the life sciences ; the one I’ll note here is an alzheimers gwas study with Mayo clinic
  28. In this one they handled an enormous amount of data and kind of strong armed the computational challenge – what wouldve taken 2 years at Mayo was done in 6 hrs on Blue Waters
  29. Another incredible resource in the community is SDSC
  30. You may know them as the home of CGHUB which holds the cancer genome atlas. Note the bits/second growth from 1g to 15 G from 2012 -2015 CGHub is a large-scale data repository and portal for the National Cancer Institute’s Cancer Genome Research Programs Current capacity is 5 Petabytes, scalable to 20 Petabytes. The Cancer Genome Atlas, one data collection of many in the CGH, by itself could produce 10 PB in the next four years As an illustration of how Internet2 is making network resources accessible, consider the the UCSC Cancer Genomics Hub, operated by the University of California at Santa Cruz and located at the San Diego Supercomputer Center co-location facility. Without the “big pipes” provided between SDSC and Internet2, the CGH would not be able to keep pace with demand for its data. As both users and data in the repository grew over a three year period, the bandwidth needed to support the activity grew by 15x.
  31. SDSC also has other important data sources like the Protein Data base archive
  32. They also have consulting services very much focused to support life sciences research
  33. I’d also note the cloud environment they built for HHS CMS – FISMA compliant and HIPAA ready.
  34. The National Labs are also a huge part of the community
  35. Whenever I run into a Metagenomics problem I reference Jonathan Allen’s huge microbiome work with metagenomics
  36. We also have a number of interesting efforts to facilitate collaboration and reproducibility.
  37. RADII is an exciting project that virtualizes clouds leveraging iRODS and virtual networks. The idea is to allow researchers, not IT, to spin up and monitor local and cloud resources, compute and network infrastructure on demand. So for example when I need to complete collaborative a workflow and move data and compute over a number of compute resources
  38. Radii allows you to represent data-centric collaborations using standard modeling mechanisms; map data processes, computations, storage, and organizational entities onto the physical infrastructure with the click of a button provision and de-provision infrastructure dynamically throughout the lifecycle of the collaboration.
  39. Radii builds on the data management of irods and infrastructure virtualization of ORCA and Exogeni to give researchers control over the infrastructure that’s necessary for collaboration
  40. Here’s an example of this virtualization, with researchers at Duke UNC and Scripps sharing data and workflows on SDSC compute resources. Ease of use, Improve end to end performance perceived by the scientists To enable this vision we need two technologies with high level of programmability and automation.
  41. A collaboration between the FDA and Gw is looking improve reproducibility by using biocompute objects. This should accelerate regulatory approvals and reduce costs.
  42. This represents the process for FDA submissions supported by NGS. There is a lot of opportunities for making mistakes along the way. These mistakes result in delays and costly resubmissions
  43. Of the challenges in gaining agreement at the end of this process, many of which are addressed by HIVE, its potential to impact reproducibility is the most exciting
  44. The HIVE platform is big data analysis solution used by the FDA and available to industry. The bio compute objects repository is key to reproducibility
  45. To get to better reproducibility, HIVE relies on a data typing engine to define meta data for the data , computations and both algorithms and pipelines to create a biocompute object related to the submission that’s reusable by the FDA. Data typing engine- facility allowing to register structure, syntax and ontologies of the information fields of objects. Metadata type- descriptive information on the structure of data files or electronic records. Computation metadata- Description of arguments and parameters (not values) for computational analysis. Definitions of algorithms and pipeline descriptions- descriptions of the characteristics for executable applications. Data- collection of actual values observed and accumulated during experimentation by a device or an observer. Computational protocol- well parameterized computational pipeline designed to produce scientifically meritable outcomes with appropriate data. Bio-compute- instance of an actual execution of the computational protocols on a given set of data with actual values of parameters generating identifiable outcomes/results.
  46. HIVE would help by recording the parameters of the analysis as biocompute objects (or use existing ones in the public repository) and share them with FDA so they can verify that analysis. Data forming is done using a public hive and integrated with your usual analytic tools. The resulting biocompute objects are submitted to FDA; these biocompute objects are used in the FDA HIVe to validate the results of the submission.
  47. Finally I ‘ll say a few words about federated identity.
  48. Over 10 yrs ago the R&E Community recognized the importance of trust in collaborations and created the InCommon federated identity management solution.
  49. We now have a leading solution with around 8MM users. Pls stop by the booth for more information.