Advances in genomics and data analytics create new opportunities for cancer research and personalized medical treatment via large-scale federation of genomic, clinical, imaging and other data from many thousands of patients across institutions around the world. Despite these opportunities and promising early results, cancer research is often stymied by information technology barriers. One major barrier is a lack of tools for the reliable, secure, rapid, and easy transfer, sharing, and management of large collections of human data. In the absence of such tools, security and performance concerns often prevent sharing altogether or force researchers to resort to slow and error prone shipping of physical media. If data are received, timely analysis is further impeded by the difficulties inherent in verifying data integrity and managing who can access data and for what purpose. I will discuss how the mature Globus data management platform addresses these obstacles to discovery and explain how its intuitive, web-based interfaces enable use by researchers without specialized IT knowledge. I also describe how Globus technologies can be extended to meet the security requirements of human data so as to enable use in data-intensive cancer research.
6. Cloud: Outsourcing and automation
6
Software as a service: SaaS
Infrastructure as a service: IaaS
Platform as a service: PaaS
(web & mobile apps)
7. Cloud: Outsourcing and automation
7
Software as a service: SaaS
Infrastructure as a service: IaaS
Platform as a service: PaaS
(web & mobile apps)Saas for
science
9. Inherited hematological malignancies
Impact:
• Familial blood cancer syndromes are being included in the 2016 revision of World Health Organization
Classification of Hematological Malignancies; NCCN guidelines; European LeukemiaNet
• Identification of germline mutations is important for prevention/intervention and early diagnosis, and
may change treatment (e.g., stem cell transplant from related donor w/o mutation or matched
unrelated donor)
Background:
• Familial predisposition to blood cancers has not been widely appreciated,
like some solid cancers
• Identifying the genes involved informs understanding of biology and may
impact patient care (prevention, diagnosis and treatment)
Jane Churpek, MD Lucy Godley, MD, PhD
Research highlights:
• With samples from >500 families, the team has identified novel germline
mutations that predispose to familial myelodysplastic syndromes and leukemia
• These mutations are much more common than previously known
• Specific genes with identified mutations include RUNX1, ETV6, DDX41, ANKRD26
11. Notable areas of friction
• Moving data rapidly, securely, and reliably from lab to lab
• Accessing data at other labs
• Controlling who can access data
• Tracking what data is where
• Discovering available data within a rapidly growing haystack
• Computing at scale
• Complying with rules on personal health information
• Archive and backup
11
14. Researcher
initiates transfer
request; or requested
automatically by script,
science gateway
Publication
repository
Personal Computer
1
Sequencing center Compute facility
15. Researcher
initiates transfer
request; or requested
automatically by script,
science gateway
Compute facilityGlobus transfers
files reliably,
securely
2
Personal Computer
Transfer
1
Sequencing center
Publication
repository
16. Researcher
initiates transfer
request; or requested
automatically by script,
science gateway
Researcher
selects files to
share, selects user
or group, and sets
access permissions
Publication
repository
Personal Computer
1 3
Share
Compute facilityGlobus transfers
files reliably,
securely
2
Transfer
Sequencing center
17. Researcher
initiates transfer
request; or requested
automatically by script,
science gateway
Globus controls access to
shared files on existing
storage; no need to move
files to cloud storage!
Researcher
selects files to
share, selects user
or group, and sets
access permissions
Publication
repository
Personal Computer
1 3
Share
4
Compute facilityGlobus transfers
files reliably,
securely
2
Transfer
Sequencing center
18. Researcher
initiates transfer
request; or requested
automatically by script,
science gateway
Researcher
selects files to
share, selects user
or group, and sets
access permissions
Collaborator logs in
to access shared
files; no local
account needed;
download via
Globus
Publication
repository
Personal Computer
1 3
Share
5
Compute facilityGlobus transfers
files reliably,
securely
2
Transfer
Sequencing center
Globus controls access to
shared files on existing
storage; no need to move
files to cloud storage!
4
19. Researcher
initiates transfer
request; or requested
automatically by script,
science gateway
Researcher
selects files to
share, selects user
or group, and sets
access permissions
Collaborator logs in
to access shared
files; no local
account needed;
download via
Globus
Researcher
assembles data set;
attaches metadata
(Dublin core,
domain-specific) Publication
repository
Personal Computer
1 3
Share
Publish
5
6
6
Compute facilityGlobus transfers
files reliably,
securely
2
Transfer
Sequencing center
Globus controls access to
shared files on existing
storage; no need to move
files to cloud storage!
4
20. Researcher
initiates transfer
request; or requested
automatically by script,
science gateway
Curator reviews and
approves; data set published
on campus or other system
Researcher
selects files to
share, selects user
or group, and sets
access permissions
Collaborator logs in
to access shared
files; no local
account needed;
download via
Globus
Researcher
assembles data set;
attaches metadata
(Dublin core,
domain-specific) Publication
repository
Personal Computer
1 3
Share
Publish
5
6
6
7
Compute facilityGlobus transfers
files reliably,
securely
2
Transfer
Globus controls access to
shared files on existing
storage; no need to move
files to cloud storage!
4
Sequencing center
21. Researcher
initiates transfer
request; or requested
automatically by script,
science gateway
Curator reviews and
approves; data set published
on campus or other system
Researcher
selects files to
share, selects user
or group, and sets
access permissions
Collaborator logs in
to access shared
files; no local
account needed;
download via
Globus
Researcher
assembles data set;
attaches metadata
(Dublin core,
domain-specific)
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
Publication
repository
Personal Computer
1 3
Share
Publish
Discover
5
6
6
7
8
Compute facilityGlobus transfers
files reliably,
securely
2
Transfer
Globus controls access to
shared files on existing
storage; no need to move
files to cloud storage!
4
Sequencing center
22. Researcher
initiates transfer
request; or requested
automatically by script,
science gateway
Curator reviews and
approves; data set published
on campus or other system
Researcher
selects files to
share, selects user
or group, and sets
access permissions
Collaborator logs in
to access shared
files; no local
account needed;
download via
Globus
Researcher
assembles data set;
attaches metadata
(Dublin core,
domain-specific)
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
Publication
repository
Personal Computer
• Only Web browser required
• Use any storage system
• Access using any credential
1 3
Share
Publish
Discover
5
6
6
7
8
Compute facilityGlobus transfers
files reliably,
securely
2
Transfer
Sequencing center
Globus controls access to
shared files on existing
storage; no need to move
files to cloud storage!
4
23. How Globus adds value…
• Ease of use, consistent user interface across systems
• “Fire-and-forget” reliable file transfer
• Low-overhead external collaboration
• Secure access, multi-tier security model
• Maximized wide area network throughput
• Rapid deployment via standard packages
• Highly automatable: CLI, RESTful API
23
37. Globus is widely used
4
major services
13
national labs
190 PB
transferred
10,000
active endpoints
20 billion
files processed
10,000
active users
50,000
registered users
99.9%
uptime
35+
institutional
subscribers
1 PB
largest single
transfer to date
3 months
longest
continuously
managed transfer
130
federated
campus identities
42. Cloud: Outsourcing and automation
42
Software as a service: SaaS
Infrastructure as a service: IaaS
Platform as a service: PaaS
(web & mobile apps)
PaaS for
science
48. Prototypical research data portal
• Move portal storage
into Science DMZ,
with Globus endpoint
• Leave portal web
server behind firewall
• Globus handles
security and data
heavy lifting
48
Desktop
Globus Cloud
Firewall
Science DMZ
Globus
Transfer
Service
Portal Web
Server (Client)
Globus Auth
Browser
User’s
Endpoint
(optional)
Portal
Endpoint
Other
Endpoints
HTTPS
GridFTP
REST Other
Services
Globus Web
Widgets
52. Workflows can be easily defined
and automated with integrated
Galaxy Platform capabilities
Data movement is streamlined
with integrated Globus transfer
Resources can be provisioned on-
demand with Amazon Web Services
cloud based infrastructure
Globus Genomics: Genomics analysis as a service
Ravi Madduri et al., University of Chicago
53. Globus Genomics use cases
A profile of inherited predisposition to breast cancer among Nigerian women
Y. Zheng, T. Walsh, F. Yoshimatsu, M. Lee, S. Gulsuner, S. Casadei, A. Rodriguez, T. Ogundiran, C. Babalola, O.
Ojengbede, D. Sighoko, R. Madduri, M.-C. King, O. Olopade
A case study for high throughput analysis of NGS data for translational research
using Globus Genomics
D. Sulakhe, A. Rodriguez, K. Bhuvaneshwar, Y. Gusev, R. Madduri, L. Lacinski, U. Dave, I. Foster, S. Madhavan
54. Globus Genomics at a glance
30
institutions, groups
10s
million core hours
2 PBs
raw sequence
analyzed
1,500
analysis tools
10,000
genomes
processed
50
workflows
99%
uptime over the past
two years
1 PB
data generated
43
steps in longest
pipeline
100s
species
75
largest user group
5 days
longest running
workflow
55. Cost-aware provisioning on cloud resources
55
1. Filter instance types with profiles
2. Determine price for each instance
type across all availability zones
3. Rank potential requests
4. Make requests and monitor
5. Cancel or repurpose excess active
requests once one is fulfilled
Can reduce costs by 95% or more!
$$$
???
R. Chard et al. Cost-aware cloud provisioning, 11th IEEE International Conference on e-Science (e-Science), 2015.
56. What’s coming soon: Richer endpoints
HTTPS access to endpoints
• Enhanced use of research storage:
• Asynchronous, bulk transfer: GridFTP
• Synchronous remote access: HTTPS
• Enhanced Globus web app
• Browser-based upload/download
• Inline file viewer
• Integration with clients, web apps
56
GridFTP
57. What’s coming soon: Richer endpoints
57
GridFTP
Collections
• Groupings of files that are to be
treated as logical units
• Can be named and described
HTTPS access to endpoints
• Enhanced use of research storage:
• Asynchronous, bulk transfer: GridFTP
• Synchronous remote access: HTTPS
• Enhanced Globus web app
• Browser-based upload/download
• Inline file viewer
• Integration with clients, web apps
58. What’s coming soon: Richer endpoints
58
Data search
• Automated metadata harvesting
• From Globus endpoints
• Submitted via REST API
• Rich search capabilities
• Free text, faceted, boosted
GridFTP
HTTPS access to endpoints
• Enhanced use of research storage:
• Asynchronous, bulk transfer: GridFTP
• Synchronous remote access: HTTPS
• Enhanced Globus web app
• Browser-based upload/download
• Inline file viewer
• Integration with clients, web apps
Collections
• Groupings of files that are to be
treated as logical units
• Can be named and described
59. Thank you to our sponsors
U . S . D E P A R T M E N T O F
ENERGY
59
Thanks to: Rachana Ananthakrishnan, Kyle Chard, Ravi Madduri,
Brigitte Raumann, Steve Tuecke, Vas Vasiliadis,
and others in the Globus team at the University of Chicago
60. Globus provides a new global-scale data fabric that can accelerate
discovery by streamlining scientific data sharing and analysis
• Globus-enabled storage systems enable robust, secure access
• Globus cloud services implement transfer, sharing, publication,
discovery, and other capabilities
This fabric is:
• Being applied in cancer research
• Spreading rapidly by word of mouth (scientists like it!)
• Widely deployed across universities and labs (thanks, NSF & DOE)
• On a path to sustainability based on subscriptions
• Being integrated into research infrastructures and applications 60
61. To accelerate impact in biomedicine:
•Integrate biomedical research facilities into the fabric
•Encourage subscriptions to address sustainability
•Provide HIPAA compliance for applications involving PHI
•Cultivate an ecosystem of data portals and applications
that leverage the platform
•Continue to add capabilities
61
www.globus.org foster@uchicago.edu
Editor's Notes
Colonel Dwight D. Eisenhower
Amazon VPC
Microsoft one
Add consumers
Consumers, SMBs, large enterprises
Amazon VPC
Microsoft one
Add consumers
Consumers, SMBs, large enterprises
Data Publication and Discovery
Amazon VPC
Microsoft one
Add consumers
Consumers, SMBs, large enterprises
We built this pipeline to create high quality variants using multiple genotyping algorithms