Internet2 Bio IT 2016 v2

Pushing Discovery with
Internet2
Cloud to Supercomputing
in Life Sciences
DAN TAYLOR
Director, Business Development, Internet2
BIO-IT WORLD 2016
BOSTON
APRIL, 2016

2 –
8/30/
Internet2 Overview
• An advanced networking consortium
– Academia
– Corporations
– Government
• Operates a best-in-class national optical network
– 15,000 miles of dedicated fiber
– 100G routers and optical transport systems
– 8.8 Tbps capacity
• For over 20 years, our mission has been to
– Provide cost effective broadband and collaboration technologies to facilitate
frictionless research in Big Science – broad collaboration, extremely large data
sets
– Create tomorrow’s networks & a platform for networking research
– Engage stakeholders in
• Bridging the IT/Researcher gap
• Developing new technologies critical to their missions

[ 3 ]
The 4th Gen Internet2 Network
Internet2 Network
by the numbers
17 Juniper MX960 nodes
31 Brocade and Juniper
switches
49 custom colocation facilities
250+ amplification racks
15,717 miles of newly
acquired dark fiber
2,400 miles of partnered
capacity with Zayo
Communications
8.8 Tbps of optical capacity
100 Gbps of hybrid Layer 2
and Layer 3 capacity
300+ Ciena ActiveFlex 6500
network elements

Technology
• A Research Grade high speed network –
optimized for “Elephant flows”
• Layer 1 – secure point to point wavelength networking
• Advanced Layer 2 Services – Open virtual network for Life
Sciences with connectivity speeds up to 100 Gbs
• SDN Network Virtualization customer trials now
• Advanced Layer 3 Services – High speed IP connectivity to
the world
• Superior economics
• Secure sharing of online research resource
– federated identity management
system

[ 5 ]
Internet2 Members and Partners
255 Higher Education members
67 Affiliate members
41 R&E Network members
82 Industry members
65+ Int’l partners reaching over
100 Nations
93,000+ Community anchor institutions
Focused on member technology needs
since 1996
"The idea of being
able to collaborate
with anybody,
anywhere, without
constraint…"
—Jim Bottum, CIO,
Clemson University
Community

6 –
8/30/
Strong international partnerships
• Agreements with
international networking
partners offer
interoperability and
access
• Enable collaboration
between U.S. researchers
and overseas counterparts
in over 100 international R
& E networks
Community

Some of our Affiliate Members
7

[ 8 ]
*Routers
Stanford
Computer
Workstations
Berkeley, Stanford
Security
Systems
Univ of Michigan
Security
Systems
Georgia Tech
Social
Media
Harvard
Network
Caching
MIT
Search
Stanford

[ 9 ]
The Route
to Innovation
August 30, 2016 © 2016 Internet2
Abundant Bandwidth
• Raw capacity now available on
Internet2 Network a key imagination enabler
• Incent disruptive use of new, advanced
capabilities
Software Defined Networking
• Open up network layer itself to innovation
• Let innovators communicate with and program
the network itself
• Allow developers to optimize the network for
specific applications
Science DMZ
• Architect a special solution to allow
higher-performance data flows
• Include end-to-end performance monitoring
server and software
• Include SDN server to support programmability

Life Sciences Research Today
• Sharing Big Data sets (genomic, environmental, imagery) key to basic and applied research
• Reproducibility - need to capture methods as well as raw data
– High variability in analytic processes and instruments
– Inconsistent formats and standards
• Lack of metadata & standards
• Biological systems are immensely complicated and dynamic (S. Goff, CyVERSE/iPlant)
• 21k human genes can make >100k proteins
• >50% of genes are controlled by day-night cycles
• Proteins have an average half-life of 30 hours
• Several thousand metabolites are rapidly changing
• Traits are environmentally and genetically controlled
• Information Technology - High Performance Computing and Networking - now can explore
these systems through simulation
• Collaboration
– Cross Domain, Cross Discipline
– Distribution of systems and talent is global
– Resources are public, private and academic

BIO-IT Trends in the Trenches 2015
with Chris Dagdigian
Take Aways
- Science is changing faster than IT funding
cycle for data intensive computing
environments
- Forward looking 100G multi site , multi
party collaborations required
- Cloud adoption driven by capability vs cost
- Centralized data center dead; future is
distributed computing/data stores
- Big pharma security challenge has
been met
- SDN is real and happening now; part of
infrastructure automation wave
- Blast radius more important than ever:
DOE’s Science DMZ architecture is a
solution
https://youtu.be/U6i0THTxe4o
http://www.slideshare.net/chrisdag/201
5-bioit-trends-from-the-frenches
2015 Bio-IT World Conference & Expo
• Change
• Networking
• Cloud
• Decentralized Collaboration
• Security
• Mission Networks

13 – 8/30/2016, © 2009 Internet2
Data Tsunami
Physics
Large Hadron Collider
Life Sciences
Next Generation Sequencers
CERN Illumina

15 –
8/30/20
2012: US – China 10 Gbps Link
Fed Ex: 2 days
Internet+ FTP: 26 hours
China ‐ US 10G Link: 30 secs
Dr. Lin Fang Dr. Dawei Lin
Sample.fa
(24GB)

NCBI/UC-Davis/BGI : First ultra high speed transfer of
genomic data between China & US, June 2012
“The 10 Gigabit network connection is even
faster than transferring data to most local hard
drives,” said Dr. Lin [of UC, Davis]. “The use of
a 10 Gigabit network connection will be
groundbreaking, very much like email
replacing hand delivered mail for
communication. It will enable scientists in the
genomics-related fields to communicate and
transfer data more rapidly and conveniently,
and bring the best minds together to better
explore the mysteries of life science.” (BGI
press release)
Life Sciences Engagement
16 Community

Forward Looking 100G Networks & Multi Site Multi
Party Collaboration
Accelerating Discovery:
USDA ARS Science
Network
8/30/2016, © 2016
Internet2

[ 18 ]
USDA Agriculture Research Services Science Network
• USDA scope is far beyond human

USDA Agricultural Research Services
Use Cases
• Drought (Soil Moisture) Project – Challenging Volumes
of Data
– NASA satellite data storage - 7 TB/mo., 36mo mission
– ARS Hydrology and Remote Sensing Lab analysis - 108 TB
– Data completely re-process 3 to 5 times
• Microbial Genomics Project – Computational
Bottlenecks
– Individual Strains of bacteria and microorganism communities
related to
Food Safety
Animal Health
Feed Efficiency

[ 20 ]
ARS Big Data Initiative
Big Data Workshop Recommendations,
(February 2013)
Three Pillars of the ARS Big Data Implementation
Plan – Network, HPC, Virtual Research Support
(April, 2014)
• Develop a Science DMZ
• Enable high-speed, low-latency transfer of
research data to HPC and storage from ARS
locations
• Virtual Researcher Support
Implementation Complete (Nov. 2015)
Clay Center, NE; Albany, CA; Beltsville
Labs/Nat’l Ag. Library, Beltsville, MD
Stoneville, MS; Ft. Collins, CO
Ames/NADC, IA
• ARS Scientific Computing
Assessment
• Final Report March 2014

SCInet Locations and Gateways
USDA AGRICULTURAL RESEARCH
SERVICE
Albany, CA
Ft. Collins, CO Clay Center, NE Ames, IA
Stoneville, MS
Beltsville, MD
100 Gb
100 Gb
100 Gb
10 Gb
10 Gb10 Gb

Cloud & Distributed Research Computing
@Scale
[ 22 ] Community
Internet2 Approach :
Agile scaling of resources and capacity
Access to multi-domain, multi-discipline expertise in one dynamic global community
Offer a bottomless toolbox for Innovation for the researcher

[ 23 ]
New High Speed Cloud Collaborations
8/30/20
16
23
10, x10G, x100G

Syngenta Science Network
Bringing Plant Potential to Life through enhanced
computing capacity

Syngenta Science Network
• Syngenta is a leading agriculture
company helping to improve global
food security by enabling millions of
farmers to make better use of
available resources.
• Key research challenge:
How to grow plants
more efficiently?
• Internet2 members, especially land
grant universities, are important
research partners.

The Challenge
– Increasing size of scientific data sets
– Growing number of useful external resources
and partners
– Complexity of genomic analyses is
increasing
– Need for big data collaborations across the
globe
– Must Innovate

– Higher data throughput
– High speed connectivity to AWS Direct Connect
Surge HPC
Collaborations with academic community
– High speed connections to best-in-class supercomputing resources
NCSA – University of Illinois
Leverage NCSA expertise in building custom R&D workflows
Leverage NCSA Industry Partnership Program
A*Star Supercomputing Center in Singapore
Supports a global, distributed, scientific computing capability
– Global scale : creating a global fabric for computing and collaboration

“I want to be 15 minutes behind NCSA and 6
months ahead of my competition”
- Keith Gray, BP
[ 28 ]
National Center for Supercomputing
Applications

[ 29 ]
*Better Designed* *More Durable* *Available Sooner*
Theoretical &
Basic Research
Prototyping &
Development
Optimization &
Robustification
Commercialization

[ 30 ]
NCSA Mayo Clinic @Scale Genome-Wide
Association Study
for Alzheimer’s disease
• NCSA Private Sector Program
– UIUC HPCBio
– Mayo Clinic
• BlueWatersteam and Swiss Institute of Bioinformatics
worked together to identify which genetic variants
interact to influence gene expression patterns that
may associate with Alzheimer’s disease

[ 31 ]
Big Data and Big Compute Problem
• 50,011,495,056 pairs of variants
• Each variant pair is tested against
181 subjects and 24,544 genic regions
• Computationally large problem,
PLINK: ~ 2 years at Mayo FastEpistasis: ~ 6 hours on BlueWaters
• Can be a big data problem:
- 500 PB if keep all results
- 4 TB when using a conservative cutoff

San Diego Supercomputing Center
[ 32

UCSC Cancer Genomics Hub: Large Data Flows to End
Users
1G
8G
15G
Cumulative TBs of CGH
Files Downloaded
Data Source: David Haussler, Brad Smith,
UCSC; Larry Smarr, CalIT2
30 PB
http://blogs.nature.com/news/2012/05/us-cancer-genome-repository-hopes-to-speed-research.html

[ 34 ]
SDSC Protein Data Base Archive
• Repository of atomic coordinates and other information describing proteins and other
important biological macromolecules. Structural biologists use methods such as X-ray
crystallography, NMR spectroscopy, and cryo-electron microscopy to determine
the location of each atom relative to each other in the molecule. Information is
annotated and publicly released into the archive by the wwPDB.

SDSC
• Expertise
– Bioinformatics programming
and applications support.
– Computational chemistry
methods.
– Compliance requirements,
e.g., for dbGaP, FISMA and
HIPAA.
– Data mining techniques,
machine learning and
predictive analytics
– HPC and storage system
architecture and design.
– Scientific workflow systems
and informatics pipelines.
• Education and Training
– Intensive Boot camps for
working professionals - Data
Mining, Graph Analytics, and
Bioinformatics and Scientific
Worflows.
– Customized, on-site training
sessions/programs.
– Data Science Certificate
program.
– “Hackathon” events in data
science and other topics.

8/30/20
16
Sherlock Cloud: A HIPAA-Compliant
Cloud
Healthcare IT Managed Services - SDSC Center of Excellence
36
• Expertise in Systems, Cyber Security, Data Management,
Analytics, Application Development, Advanced User Support and
Project Management
• Operating the first & largest FISMA Data Warehouse platform for
Medicaid fraud, waste and abuse analysis
• Leveraged FISMA experience to offer HIPAA- Compliant
managed hosting for UC and academia
• Supporting HHS CMS, NIH, UCOP and other UC Campuses
• Sherlock services : Data Lab, Analytics, Case Management
and Compliant Cloud

Lawrence Livermore National Lab
[ 37 ]

38 – 8/30/2016, ©
Internet2
Lawrence Livermore NL HPC Innovation Center
Cardioid
Electrophysiology human heart
simulations allowing exploration of
causes of
• Arrhythmia
• Sudden cardiac arrest
• Predictive drug interactions.
Depicts activation of each heart
muscle cell and the cell-to-cell
transfer of the voltage of up to 3
billion cells - in near-real time.
Metagenomic analysis with Catalyst:
• Comparing short genetic fragments in a query dataset
against a large searchable index (14 million
genomes - 3x larger than those currently in use) of
genomes to determine the threat an organism poses
to human health

Community Data Science Resources
renci RADII and GWU HIVE
Driving Infrastructure Virtualization
Enabling Reproducibility For FDA Submissions
[ 39 ]

RADII
Resource Aware Datacentric collaboratIve Infrastructure
Goal
Make data-driven collaborations a ‘turn-key’ experience for domain
researchers and a ‘commodity’ for the science community
Approach
A new cyber-infrastructure to manage data-centric collaborations based
upon natural models of collaborations that occur among scientists.
RENCI: Claris Castillo, Fan Jiang, Charles Schmidt, Paul Ruth, Anirban Mandal ,Shu Huang, Yufeng Xin, Ilya Baldin, Arcot Rajasekar
SDSC: Amit Majumdar
DUKE: Erich Huang
Workflows - especially data-driven workflows and workflow
ensembles - are becoming a centerpiece of modern computational
science.

RADII Rationale
• Multi-institutional research teams grapple with multitude of resources
– Policy-restricted large data sets
– Campus compute resources
– National compute resources
– Instruments that produce data
• Interconnected by networks
– Campus, regional, national providers
• Many options, much complexity
• Data and infrastructure are treated separately
RADII Creates
A cyberinfrastructure that integrates data and resource
management from the ground up to support data-centric research.
RADII allows scientists to easily map collaborative data-driven
activities onto a dynamically configurable cloud infrastructure.

Infrastructure management
have no visibility into data
resources
Data management solutions
have no visibility into the
infrastructure
RADII: Foundational technologies
Data-grids present distributed data under a
one single abstraction and authorization
layer
Networked Infrastructure as a Service (NIaaS)
for rapid deployment of programmable
network virtual infrastructure (clouds).
Disjoint solutions
Incompatible resource abstractions
Gap
to reduce the data-infrastructure management gap

RADII System – Virtualizing Data, Compute and Network
for Collaboration
43
Novel mechanisms to
represent data-centric
collaborations using DFD
formalism
Data-centric resource
management
mechanisms for
provisioning and de-
provisioning resources
dynamically through
out the lifecycle of
collaborations
Novel mechanisms to
map data processes,
computations, storage
and organization entities
onto infrastructure

FDA and George Washington University
Big Data Decisions:
Linking Regulatory and Industry
Organizations with
HIVE Bio-Compute Objects
[ 44 ]
Presented by: Dan Taylor, Internet 2 | Bio IT | Boston | 2016

EI
H V
From Jan 2016: Vahan Simonyan, Raja Mazumder
lecture NIH: Frontiers in Data Science Series
https://videocast.nih.gov/summary.asp?Live=18299&bhcp=1
High-performance Integrated Virtual Environment
A regulatory NGS data analysis platform

BIG DATA – From a range of samples and instruments to approval for
use
analysis and
review
sample
archival
sequencing run
file transfer
regulation
computation
pipelines
produced files
are massive in
size
transfer is
slow
too large to keep
forever; not
standardized
difficult to
validate
difficult to
visualize and
interpret
how do we
avoid
mistakes?
NGS lifecycle: from a biological sample to biomedical research and regulation

• Data Size: petabyte scale, soon exa-bytes
• Data Transfer: too slow over existing networks
• Data Archival: retaining consistent datasets across many years of mandated
evidence maintenance is difficult
• Data Standards: floating standards, multiplicity of formats, inadequate
communication protocols
• Data Complexity: sophisticated IT framework needed for complex dataflow
• Data Privacy: constrictive legal framework and ownership issues across the board
from the patient bedside to the FDA regulation
• Data Security: large number of complicated security rules and data protection tax IT
subsystems and cripple performance
• Computation Size: distributed computing, inefficiently parallelized, requires large
investment of hardware, software and human-ware
• Computation Standards: non canonical computation protocols, difficult to compare,
reproduce, rely on computations
• Computation Complexity: significant investment of time and efforts to learn
appropriate skills and avoid pitfalls in complex computational pipelines
• Interpretation: large outputs from enormous computations are difficult to visualize
and summarize
• Publication: peer review and audit requires communication by massive amount of
information
... and how do we avoid mistakes ?
software challenges and needs

HIVE is an End to End Solution
• Data retrieval from anywhere in the world
• Storage of extra large scale data
• Security approved by OIM
• Integrator platform to bring different data and analytics together
• Tailor made analytics designed around needs
• Visualization made to help in interpretation of data
• Support of the entire hard-, soft-ware and knowledge infrastructure
• Expertise accumulated in the agency
• Bio-Compute objects repository to provide reproducibility and
interoperability and long term referable storage of computations and results
HIVE is not
• an application to perform few tasks
• yet another database
• a computer cluster or a cloud or a data center
• an IT subsystem
More:
http://www.fda.gov/ScienceResearch/SpecialTopics/RegulatoryScience/ucm491
893.htm

Instantiation
DataTypeDefinitions Definitions of
metadata
types
Data Typing
Engine
Definitions of
computations
metadata
Data
Bio-compute
Definitions of
algorithms
and pipeline
descriptions
Computational
protocols
Verifiable
results
within
acceptable
uncertainty/er
ror
Scientifically
reliable
interpretation
HIVE data universe

industry FDA regulatory
analysis
2. compute
3.
submit
1. data-
forming
6.
issues
resubmits
5. regulatory
decision
4.
SOPP/prot
ocols
consumer
$ millions of dollars
7. yes
7. no
regulatory iterations
~$800 Million R&D dollars for a single
drug
~$2.6 Billion total cost

industr
y
FD
A
HIVE
public-HIVEGalaxy
CLC
DNA-nexus
2. compute
3. submit1. data-forming
6. issues
resubmits
5. bio-
compute
2. HIVE
SOPP/protocols
4.
SOPP/prot
ocols
consumer
7.
yes
7 .no
4. submit
bio-compute
integration
3.
compute
Facilitate
integration
$ millions of dollars
bio-compute as a way to link regulatory
and industry organizations

[ 53 ]
Community-developed framework of
trust enables:
• Secure, streamlined sharing of
protected resources
• Consolidated management of user
identities and access
• Delivery of an integrated portfolio of
community-developed solutions
[ 53 ]
Trusted Identity in Research
The standard for over
600 higher education
institutions—and
counting!

[ 54 ]
15 425+
2 160+
0 2000+
7.8 million
Academic
Participants
Sponsored
Partners
Registered
Service Providers
Individuals served
by federated IdM
Foundation for Trust & Identity
54
®

• Eric Boyd, Internet2
• Stephen Wolff, Internet2
• Stephen Goff, PhD, CyVERSE/iPlant, University of Arizona
• Chris Dagdigian, BioTeam
• Daiwei Lin, PhD, NIAID, NIH
• Paul Gibson, USDA ARS
• Paul Travis, Syngenta
• Evan Burness, NCSA
• Sandeep Chandra, SDSC
• Jonathan Allen, PhD, Lawrence Livermore National Lab
• Claris Castillo, PhD, RENCI
• Vahan Simonyan, PhD, FDA
• Raja Mazumder, PhD, George Washington University
• Eli Dart, ESNET, US Department of Energy
• BGI
• Nature
[ 55 ]
Acknowledgements

Thank you!
Daniel Taylor, Director, Business Development
Internet2
dbt3@internet2.edu
703-517-2566

Back up slides
Science DMZ
[ 57 ]

[ 58 ]
Rising expectations
Network throughput required to move y bytes in x time.
(US Dept of Energy - http://fasterdata.es.net).
should
be easy
This
year

3/30/16, © 2016
Internet2
Science DMZ* and perfSONAR
Design pattern to address the most common bottlenecks to moving data
* fasterdata.es.net
59

Internet2 Bio IT 2016 v2

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Internet2 Bio IT 2016 v2

Semelhante a Internet2 Bio IT 2016 v2 (20)

Internet2 Bio IT 2016 v2

Notas do Editor