Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Internet2 Bio IT 2016 v2
1. Pushing Discovery with
Internet2
Cloud to Supercomputing
in Life Sciences
DAN TAYLOR
Director, Business Development, Internet2
BIO-IT WORLD 2016
BOSTON
APRIL, 2016
2. 2 –
8/30/
Internet2 Overview
• An advanced networking consortium
– Academia
– Corporations
– Government
• Operates a best-in-class national optical network
– 15,000 miles of dedicated fiber
– 100G routers and optical transport systems
– 8.8 Tbps capacity
• For over 20 years, our mission has been to
– Provide cost effective broadband and collaboration technologies to facilitate
frictionless research in Big Science – broad collaboration, extremely large data
sets
– Create tomorrow’s networks & a platform for networking research
– Engage stakeholders in
• Bridging the IT/Researcher gap
• Developing new technologies critical to their missions
3. [ 3 ]
The 4th Gen Internet2 Network
Internet2 Network
by the numbers
17 Juniper MX960 nodes
31 Brocade and Juniper
switches
49 custom colocation facilities
250+ amplification racks
15,717 miles of newly
acquired dark fiber
2,400 miles of partnered
capacity with Zayo
Communications
8.8 Tbps of optical capacity
100 Gbps of hybrid Layer 2
and Layer 3 capacity
300+ Ciena ActiveFlex 6500
network elements
4. Technology
• A Research Grade high speed network –
optimized for “Elephant flows”
• Layer 1 – secure point to point wavelength networking
• Advanced Layer 2 Services – Open virtual network for Life
Sciences with connectivity speeds up to 100 Gbs
• SDN Network Virtualization customer trials now
• Advanced Layer 3 Services – High speed IP connectivity to
the world
• Superior economics
• Secure sharing of online research resource
– federated identity management
system
5. [ 5 ]
Internet2 Members and Partners
255 Higher Education members
67 Affiliate members
41 R&E Network members
82 Industry members
65+ Int’l partners reaching over
100 Nations
93,000+ Community anchor institutions
Focused on member technology needs
since 1996
"The idea of being
able to collaborate
with anybody,
anywhere, without
constraint…"
—Jim Bottum, CIO,
Clemson University
Community
6. 6 –
8/30/
Strong international partnerships
• Agreements with
international networking
partners offer
interoperability and
access
• Enable collaboration
between U.S. researchers
and overseas counterparts
in over 100 international R
& E networks
Community
10. Life Sciences Research Today
• Sharing Big Data sets (genomic, environmental, imagery) key to basic and applied research
• Reproducibility - need to capture methods as well as raw data
– High variability in analytic processes and instruments
– Inconsistent formats and standards
• Lack of metadata & standards
• Biological systems are immensely complicated and dynamic (S. Goff, CyVERSE/iPlant)
• 21k human genes can make >100k proteins
• >50% of genes are controlled by day-night cycles
• Proteins have an average half-life of 30 hours
• Several thousand metabolites are rapidly changing
• Traits are environmentally and genetically controlled
• Information Technology - High Performance Computing and Networking - now can explore
these systems through simulation
• Collaboration
– Cross Domain, Cross Discipline
– Distribution of systems and talent is global
– Resources are public, private and academic
11. BIO-IT Trends in the Trenches 2015
with Chris Dagdigian
Take Aways
- Science is changing faster than IT funding
cycle for data intensive computing
environments
- Forward looking 100G multi site , multi
party collaborations required
- Cloud adoption driven by capability vs cost
- Centralized data center dead; future is
distributed computing/data stores
- Big pharma security challenge has
been met
- SDN is real and happening now; part of
infrastructure automation wave
- Blast radius more important than ever:
DOE’s Science DMZ architecture is a
solution
https://youtu.be/U6i0THTxe4o
http://www.slideshare.net/chrisdag/201
5-bioit-trends-from-the-frenches
2015 Bio-IT World Conference & Expo
• Change
• Networking
• Cloud
• Decentralized Collaboration
• Security
• Mission Networks
15. 15 –
8/30/20
2012: US – China 10 Gbps Link
Fed Ex: 2 days
Internet+ FTP: 26 hours
China ‐ US 10G Link: 30 secs
Dr. Lin Fang Dr. Dawei Lin
Sample.fa
(24GB)
16. NCBI/UC-Davis/BGI : First ultra high speed transfer of
genomic data between China & US, June 2012
“The 10 Gigabit network connection is even
faster than transferring data to most local hard
drives,” said Dr. Lin [of UC, Davis]. “The use of
a 10 Gigabit network connection will be
groundbreaking, very much like email
replacing hand delivered mail for
communication. It will enable scientists in the
genomics-related fields to communicate and
transfer data more rapidly and conveniently,
and bring the best minds together to better
explore the mysteries of life science.” (BGI
press release)
Life Sciences Engagement
16 Community
18. [ 18 ]
USDA Agriculture Research Services Science Network
• USDA scope is far beyond human
19. USDA Agricultural Research Services
Use Cases
• Drought (Soil Moisture) Project – Challenging Volumes
of Data
– NASA satellite data storage - 7 TB/mo., 36mo mission
– ARS Hydrology and Remote Sensing Lab analysis - 108 TB
– Data completely re-process 3 to 5 times
• Microbial Genomics Project – Computational
Bottlenecks
– Individual Strains of bacteria and microorganism communities
related to
Food Safety
Animal Health
Feed Efficiency
20. [ 20 ]
ARS Big Data Initiative
Big Data Workshop Recommendations,
(February 2013)
Three Pillars of the ARS Big Data Implementation
Plan – Network, HPC, Virtual Research Support
(April, 2014)
• Develop a Science DMZ
• Enable high-speed, low-latency transfer of
research data to HPC and storage from ARS
locations
• Virtual Researcher Support
Implementation Complete (Nov. 2015)
Clay Center, NE; Albany, CA; Beltsville
Labs/Nat’l Ag. Library, Beltsville, MD
Stoneville, MS; Ft. Collins, CO
Ames/NADC, IA
• ARS Scientific Computing
Assessment
• Final Report March 2014
21. SCInet Locations and Gateways
USDA AGRICULTURAL RESEARCH
SERVICE
Albany, CA
Ft. Collins, CO Clay Center, NE Ames, IA
Stoneville, MS
Beltsville, MD
100 Gb
100 Gb
100 Gb
10 Gb
10 Gb10 Gb
22. Cloud & Distributed Research Computing
@Scale
[ 22 ] Community
Internet2 Approach :
Agile scaling of resources and capacity
Access to multi-domain, multi-discipline expertise in one dynamic global community
Offer a bottomless toolbox for Innovation for the researcher
23. [ 23 ]
New High Speed Cloud Collaborations
8/30/20
16
23
10, x10G, x100G
25. Syngenta Science Network
• Syngenta is a leading agriculture
company helping to improve global
food security by enabling millions of
farmers to make better use of
available resources.
• Key research challenge:
How to grow plants
more efficiently?
• Internet2 members, especially land
grant universities, are important
research partners.
26. The Challenge
– Increasing size of scientific data sets
– Growing number of useful external resources
and partners
– Complexity of genomic analyses is
increasing
– Need for big data collaborations across the
globe
– Must Innovate
27. – Higher data throughput
– High speed connectivity to AWS Direct Connect
Surge HPC
Collaborations with academic community
– High speed connections to best-in-class supercomputing resources
NCSA – University of Illinois
Leverage NCSA expertise in building custom R&D workflows
Leverage NCSA Industry Partnership Program
A*Star Supercomputing Center in Singapore
Supports a global, distributed, scientific computing capability
– Global scale : creating a global fabric for computing and collaboration
28. “I want to be 15 minutes behind NCSA and 6
months ahead of my competition”
- Keith Gray, BP
[ 28 ]
National Center for Supercomputing
Applications
30. [ 30 ]
NCSA Mayo Clinic @Scale Genome-Wide
Association Study
for Alzheimer’s disease
• NCSA Private Sector Program
– UIUC HPCBio
– Mayo Clinic
• BlueWatersteam and Swiss Institute of Bioinformatics
worked together to identify which genetic variants
interact to influence gene expression patterns that
may associate with Alzheimer’s disease
31. [ 31 ]
Big Data and Big Compute Problem
• 50,011,495,056 pairs of variants
• Each variant pair is tested against
181 subjects and 24,544 genic regions
• Computationally large problem,
PLINK: ~ 2 years at Mayo FastEpistasis: ~ 6 hours on BlueWaters
• Can be a big data problem:
- 500 PB if keep all results
- 4 TB when using a conservative cutoff
33. UCSC Cancer Genomics Hub: Large Data Flows to End
Users
1G
8G
15G
Cumulative TBs of CGH
Files Downloaded
Data Source: David Haussler, Brad Smith,
UCSC; Larry Smarr, CalIT2
30 PB
http://blogs.nature.com/news/2012/05/us-cancer-genome-repository-hopes-to-speed-research.html
34. [ 34 ]
SDSC Protein Data Base Archive
• Repository of atomic coordinates and other information describing proteins and other
important biological macromolecules. Structural biologists use methods such as X-ray
crystallography, NMR spectroscopy, and cryo-electron microscopy to determine
the location of each atom relative to each other in the molecule. Information is
annotated and publicly released into the archive by the wwPDB.
35. SDSC
• Expertise
– Bioinformatics programming
and applications support.
– Computational chemistry
methods.
– Compliance requirements,
e.g., for dbGaP, FISMA and
HIPAA.
– Data mining techniques,
machine learning and
predictive analytics
– HPC and storage system
architecture and design.
– Scientific workflow systems
and informatics pipelines.
• Education and Training
– Intensive Boot camps for
working professionals - Data
Mining, Graph Analytics, and
Bioinformatics and Scientific
Worflows.
– Customized, on-site training
sessions/programs.
– Data Science Certificate
program.
– “Hackathon” events in data
science and other topics.
36. 8/30/20
16
Sherlock Cloud: A HIPAA-Compliant
Cloud
Healthcare IT Managed Services - SDSC Center of Excellence
36
• Expertise in Systems, Cyber Security, Data Management,
Analytics, Application Development, Advanced User Support and
Project Management
• Operating the first & largest FISMA Data Warehouse platform for
Medicaid fraud, waste and abuse analysis
• Leveraged FISMA experience to offer HIPAA- Compliant
managed hosting for UC and academia
• Supporting HHS CMS, NIH, UCOP and other UC Campuses
• Sherlock services : Data Lab, Analytics, Case Management
and Compliant Cloud
39. Community Data Science Resources
renci RADII and GWU HIVE
Driving Infrastructure Virtualization
Enabling Reproducibility For FDA Submissions
[ 39 ]
40. RADII
Resource Aware Datacentric collaboratIve Infrastructure
Goal
Make data-driven collaborations a ‘turn-key’ experience for domain
researchers and a ‘commodity’ for the science community
Approach
A new cyber-infrastructure to manage data-centric collaborations based
upon natural models of collaborations that occur among scientists.
RENCI: Claris Castillo, Fan Jiang, Charles Schmidt, Paul Ruth, Anirban Mandal ,Shu Huang, Yufeng Xin, Ilya Baldin, Arcot Rajasekar
SDSC: Amit Majumdar
DUKE: Erich Huang
Workflows - especially data-driven workflows and workflow
ensembles - are becoming a centerpiece of modern computational
science.
41. RADII Rationale
• Multi-institutional research teams grapple with multitude of resources
– Policy-restricted large data sets
– Campus compute resources
– National compute resources
– Instruments that produce data
• Interconnected by networks
– Campus, regional, national providers
• Many options, much complexity
• Data and infrastructure are treated separately
RADII Creates
A cyberinfrastructure that integrates data and resource
management from the ground up to support data-centric research.
RADII allows scientists to easily map collaborative data-driven
activities onto a dynamically configurable cloud infrastructure.
42. Infrastructure management
have no visibility into data
resources
Data management solutions
have no visibility into the
infrastructure
RADII: Foundational technologies
Data-grids present distributed data under a
one single abstraction and authorization
layer
Networked Infrastructure as a Service (NIaaS)
for rapid deployment of programmable
network virtual infrastructure (clouds).
Disjoint solutions
Incompatible resource abstractions
Gap
to reduce the data-infrastructure management gap
43. RADII System – Virtualizing Data, Compute and Network
for Collaboration
43
Novel mechanisms to
represent data-centric
collaborations using DFD
formalism
Data-centric resource
management
mechanisms for
provisioning and de-
provisioning resources
dynamically through
out the lifecycle of
collaborations
Novel mechanisms to
map data processes,
computations, storage
and organization entities
onto infrastructure
44. FDA and George Washington University
Big Data Decisions:
Linking Regulatory and Industry
Organizations with
HIVE Bio-Compute Objects
[ 44 ]
Presented by: Dan Taylor, Internet 2 | Bio IT | Boston | 2016
45. EI
H V
From Jan 2016: Vahan Simonyan, Raja Mazumder
lecture NIH: Frontiers in Data Science Series
https://videocast.nih.gov/summary.asp?Live=18299&bhcp=1
High-performance Integrated Virtual Environment
A regulatory NGS data analysis platform
46. BIG DATA – From a range of samples and instruments to approval for
use
analysis and
review
sample
archival
sequencing run
file transfer
regulation
computation
pipelines
produced files
are massive in
size
transfer is
slow
too large to keep
forever; not
standardized
difficult to
validate
difficult to
visualize and
interpret
how do we
avoid
mistakes?
NGS lifecycle: from a biological sample to biomedical research and regulation
47. • Data Size: petabyte scale, soon exa-bytes
• Data Transfer: too slow over existing networks
• Data Archival: retaining consistent datasets across many years of mandated
evidence maintenance is difficult
• Data Standards: floating standards, multiplicity of formats, inadequate
communication protocols
• Data Complexity: sophisticated IT framework needed for complex dataflow
• Data Privacy: constrictive legal framework and ownership issues across the board
from the patient bedside to the FDA regulation
• Data Security: large number of complicated security rules and data protection tax IT
subsystems and cripple performance
• Computation Size: distributed computing, inefficiently parallelized, requires large
investment of hardware, software and human-ware
• Computation Standards: non canonical computation protocols, difficult to compare,
reproduce, rely on computations
• Computation Complexity: significant investment of time and efforts to learn
appropriate skills and avoid pitfalls in complex computational pipelines
• Interpretation: large outputs from enormous computations are difficult to visualize
and summarize
• Publication: peer review and audit requires communication by massive amount of
information
... and how do we avoid mistakes ?
software challenges and needs
48. HIVE is an End to End Solution
• Data retrieval from anywhere in the world
• Storage of extra large scale data
• Security approved by OIM
• Integrator platform to bring different data and analytics together
• Tailor made analytics designed around needs
• Visualization made to help in interpretation of data
• Support of the entire hard-, soft-ware and knowledge infrastructure
• Expertise accumulated in the agency
• Bio-Compute objects repository to provide reproducibility and
interoperability and long term referable storage of computations and results
HIVE is not
• an application to perform few tasks
• yet another database
• a computer cluster or a cloud or a data center
• an IT subsystem
More:
http://www.fda.gov/ScienceResearch/SpecialTopics/RegulatoryScience/ucm491
893.htm
49. Instantiation
DataTypeDefinitions Definitions of
metadata
types
Data Typing
Engine
Definitions of
computations
metadata
Data
Bio-compute
Definitions of
algorithms
and pipeline
descriptions
Computational
protocols
Verifiable
results
within
acceptable
uncertainty/er
ror
Scientifically
reliable
interpretation
HIVE data universe
50. industry FDA regulatory
analysis
2. compute
3.
submit
1. data-
forming
6.
issues
resubmits
5. regulatory
decision
4.
SOPP/prot
ocols
consumer
$ millions of dollars
7. yes
7. no
regulatory iterations
~$800 Million R&D dollars for a single
drug
~$2.6 Billion total cost
51. industr
y
FD
A
HIVE
public-HIVEGalaxy
CLC
DNA-nexus
2. compute
3. submit1. data-forming
6. issues
resubmits
5. bio-
compute
2. HIVE
SOPP/protocols
4.
SOPP/prot
ocols
consumer
7.
yes
7 .no
4. submit
bio-compute
integration
3.
compute
Facilitate
integration
$ millions of dollars
bio-compute as a way to link regulatory
and industry organizations
53. [ 53 ]
Community-developed framework of
trust enables:
• Secure, streamlined sharing of
protected resources
• Consolidated management of user
identities and access
• Delivery of an integrated portfolio of
community-developed solutions
[ 53 ]
Trusted Identity in Research
The standard for over
600 higher education
institutions—and
counting!
54. [ 54 ]
15 425+
2 160+
0 2000+
7.8 million
Academic
Participants
Sponsored
Partners
Registered
Service Providers
Individuals served
by federated IdM
Foundation for Trust & Identity
54
®
55. • Eric Boyd, Internet2
• Stephen Wolff, Internet2
• Stephen Goff, PhD, CyVERSE/iPlant, University of Arizona
• Chris Dagdigian, BioTeam
• Daiwei Lin, PhD, NIAID, NIH
• Paul Gibson, USDA ARS
• Paul Travis, Syngenta
• Evan Burness, NCSA
• Sandeep Chandra, SDSC
• Jonathan Allen, PhD, Lawrence Livermore National Lab
• Claris Castillo, PhD, RENCI
• Vahan Simonyan, PhD, FDA
• Raja Mazumder, PhD, George Washington University
• Eli Dart, ESNET, US Department of Energy
• BGI
• Nature
[ 55 ]
Acknowledgements
58. [ 58 ]
Rising expectations
Network throughput required to move y bytes in x time.
(US Dept of Energy - http://fasterdata.es.net).
should
be easy
This
year
Greetings I’m Dan Taylor from Internet2 – thanks for joining us. I’m going to talk a bit about internet2 and the work we’re doing with clouds and other compute resources in our community. There are a lot of slides and I’ll move quickly so pls stop by our booth or download the slides if you have questions.
Internet2 is the Research and education network for the US. We’re a membership consortium of academia , government and corporations.
Internet2 is an advanced networking consortium comprised of 221 U.S. universities, in cooperation with 45 leading corporations, 66 government agencies, laboratories and other institutions of higher learning, 35 regional and state research and education networks and more than 100 national research and education networking organizations representing over 50 countries
Internet2 actively engages our stakeholders in the development of important new technologies including middleware, security, network research and performance measurement capabilities which are critical to the achievement of the mission goals of our members.
Throughout our first 15 years, Internet2 has served a unique role among networking organizations, pioneering the use of advanced network applications and technologies, and facilitating their development—to facilitate the work of the research community.
Internet2 operates an advanced national optical network based on 17,500 miles of dedicated fiber and utilizes the latest 100G routers and optical transport systems with 8.8Tbps of system capacity
Goal: Deepen and extend, advance, sustain, advance digital resources ecosystem.Value: Growing portfolio of resources and services: advanced computing, high-end visualization, data analysis, and other resources and services. Interoperability with other infrastructures.
membership numbers as of 2014-03-27
Campus Champions (200 at 175 institutions)14,000 participants in training workshops (online and in person).
Absolutely key to our success is the global partnerships we have formed.
[>>] Internet2 partners with over 50 national research and education networks including our friends in Canada to enable connectivity to more than 100 international networks.
These partnerships provide the basis for understanding how to facilitate collaborations between the US Internet2 community and counterparts in other countries
Our global partnerships have yielded important developments in new technologies. For example - the DICE collaborative is a partnership between GEANT, Internet2, CANARIE and ESnet which provides a joint forum for North American and European investment in advanced networking leadership
Our collaboration has led to the development of world-leading tools like PerfSONAR and dynamic circuit networking – which I will touch on later. Our focus in 2010 is to deliver direct services to our members as a result of our development investments
Our community has a track record of IT successes ; we haven’t looked at life sciences yet but I’m pretty sure the Internet2 community’s impact is even greater there
R&E must keep constructing the conditions that spur innovation
Give innovators an environment where they’re free to try new, untested, unpopular, ridiculously challenging things
Innovation requires a big playground
An innovation platform must encourage utilization, not limit it
Life sciences research shares many of the trends we see else where in big science - data set sizes growing rapidly, increased need for collaboration – but we also see a new ecosystem fueling research. At the same time , however, diminishing R&D $ are pressuring the industry and government .
Chris Dagdigian does a great job detailing how IT deals with the changes in Life sciences research. I have a couple of takeways from his talks, its useful to see how
Internet2 addresses whats going on
Scientific instrument technology – which generates scinetifc data – is changing faster than the IT refresh cycle.
Organizations see the big data wave coming and are now implementing 100 G networks to get ahead of the rising tide
Organizations are going to the cloud to be able to do things they’ve can’t do on theyir own, not just to save monney
Centralization will not wliminate the need to move data
Security concerns with high speed transfers and collaboration can be addressed
Virtualized infrastructure is moving to the wide area
Big science flows are more disruptive than ever to enterprise networks – theres a trend toward separating business and research networks
One of the things we used to in the R&E community is change in scientific data growth
The internet2 community has dealt with the data tsunami for many years now. The LHC shut down for 2 years to upgrade its power – annual output has jumped from 13 to 30 petabytes a year. This data is distributed thru out the world by the R&E networks. In Life Sciences driver is NGS, falling in price rapidly and a proliferation of devices generating data all over the world
http://www.nature.com/news/large-hadron-collider-the-big-reboot-1.16095
Our network has responded
Back in 2012 we showed how a 10G link from beijing to UC Davis could change the game. A 24 GB file that would take 26 hours to traverse the internet was transferred in 30 seconds
Researchers likened the difference in collaboration like going from letters to email
So we’re seeing organizations get ahead of the tsunami by getting bigger networks. I recently helped the Department of Agriculture’s Agriculture Research Service do just this.
I like to show this slide to illustrate how much llfe there is beyond humans, and USDA ARS has to deal with many of them – and how they impact our world. It shows the size of genomes of various species, with the x axis being a log scale. Humans are there at the top, one of a number of mammals the usda is interested in .
But they are also interested in birds, crustaceans, fish, fungi , algae , bacteria and protozoans – and of course plants. And, some are extremely complex – you see the size of the wheat genome is orders of magnitude larger than the human.
Beyond genomics , these kinds of projects create huge volumes of data as well as computational bottlenecks
To attack this problem they gathered requirements in 2013, hired bioTeam to do an assessment and we actually completed a 6 node science network of 10 and 100G links by the end of 2015. that was fast!!
R&E collaborations are handled at the 100G links on the coasts and another 100g feeds the new HPC in Ames Iowa
You can view Internet2 as the medium for all the data and computing resources, forming a problem solving community around these high speed connections
Syngenta , a life sciences company , is a great example of an organization making the most of these connections
They are an agribusiness with a mission to improve plant productivity, they stay on the leading edge of science thru their internal research and their collaboration with the academic community
Syngenta was challenged by many of the issues USDA saw, but on a global scale and even more pressure to innovate.
We installed a 10G Layer 2 service that provided high speed Direct Connect access to AWS where they could do surge HPC and retrieve sequencing data outsourced to the academic community. They also could connect to NCSA to build and run custom pipelines. They can also use the connection to work with A*Star supercomputer center in Singapore , where they intend on building an asian genomic center. Finally we expect to bring up locations in switzerland and GB, completing a global research network.
I just mentioned NCSA and this resource deserves a few seconds. NCSA does a lot of work with industry , and a comment from a VP at BP says it all….
Leveraging its talent and one of the fastest computers in the world, NCSA provides companies with a full range of services to help the innovate
They do a lot of work in the life sciences ; the one I’ll note here is an alzheimers gwas study with Mayo clinic
In this one they handled an enormous amount of data and kind of strong armed the computational challenge – what wouldve taken 2 years at Mayo was done in 6 hrs on Blue Waters
Another incredible resource in the community is SDSC
You may know them as the home of CGHUB which holds the cancer genome atlas. Note the bits/second growth from 1g to 15 G from 2012 -2015
CGHub is a large-scale data repository and portal for the National Cancer Institute’s Cancer Genome Research Programs
Current capacity is 5 Petabytes, scalable to 20 Petabytes.
The Cancer Genome Atlas, one data collection of many in the CGH, by itself could produce 10 PB in the next four years
As an illustration of how Internet2 is making network resources accessible, consider the the UCSC Cancer Genomics Hub, operated by the University of California at Santa Cruz and located at the San Diego Supercomputer Center co-location facility. Without the “big pipes” provided between SDSC and Internet2, the CGH would not be able to keep pace with demand for its data.
As both users and data in the repository grew over a three year period, the bandwidth needed to support the activity grew by 15x.
SDSC also has other important data sources like the Protein Data base archive
They also have consulting services very much focused to support life sciences research
I’d also note the cloud environment they built for HHS CMS – FISMA compliant and HIPAA ready.
The National Labs are also a huge part of the community
Whenever I run into a Metagenomics problem I reference Jonathan Allen’s huge microbiome work with metagenomics
We also have a number of interesting efforts to facilitate collaboration and reproducibility.
RADII is an exciting project that virtualizes clouds leveraging iRODS and virtual networks. The idea is to allow researchers, not IT, to spin up and monitor local and cloud resources, compute and network infrastructure on demand. So for example when I need to complete collaborative a workflow and move data and compute over a number of compute resources
Radii allows you to
represent data-centric collaborations using standard modeling mechanisms;
map data processes, computations, storage, and organizational entities onto the physical infrastructure with the click of a button
provision and de-provision infrastructure dynamically throughout the lifecycle of the collaboration.
Radii builds on the data management of irods and infrastructure virtualization of ORCA and Exogeni to give researchers control over the infrastructure that’s necessary for collaboration
Here’s an example of this virtualization, with researchers at Duke UNC and Scripps sharing data and workflows on SDSC compute resources.
Ease of use,
Improve end to end performance perceived by the scientists
To enable this vision we need two technologies with high level of programmability and automation.
A collaboration between the FDA and Gw is looking improve reproducibility by using biocompute objects. This should accelerate regulatory approvals and reduce costs.
This represents the process for FDA submissions supported by NGS. There is a lot of opportunities for making mistakes along the way. These mistakes result in delays and costly resubmissions
Of the challenges in gaining agreement at the end of this process, many of which are addressed by HIVE, its potential to impact reproducibility is the most exciting
The HIVE platform is big data analysis solution used by the FDA and available to industry. The bio compute objects repository is key to reproducibility
To get to better reproducibility, HIVE relies on a data typing engine to define meta data for the data , computations and both algorithms and pipelines to create a biocompute object related to the submission that’s reusable by the FDA.
Data typing engine- facility allowing to register structure, syntax and ontologies of the information fields of objects.
Metadata type- descriptive information on the structure of data files or electronic records.
Computation metadata- Description of arguments and parameters (not values) for computational analysis.
Definitions of algorithms and pipeline descriptions- descriptions of the characteristics for executable applications.
Data- collection of actual values observed and accumulated during experimentation by a device or an observer.
Computational protocol- well parameterized computational pipeline designed to produce scientifically meritable outcomes with appropriate data.
Bio-compute- instance of an actual execution of the computational protocols on a given set of data with actual values of parameters generating identifiable outcomes/results.
HIVE would help by recording the parameters of the analysis as biocompute objects (or use existing ones in the public repository) and share them with FDA so they can verify that analysis.
Data forming is done using a public hive and integrated with your usual analytic tools. The resulting biocompute objects are submitted to FDA; these biocompute objects are used in the FDA HIVe to validate the results of the submission.
Finally I ‘ll say a few words about federated identity.
Over 10 yrs ago the R&E Community recognized the importance of trust in collaborations and created the InCommon federated identity management solution.
We now have a leading solution with around 8MM users. Pls stop by the booth for more information.