AstraZeneca is a global science-led biopharmaceutical company developing innovative medicines used by millions of patients worldwide. With AWS, AstraZeneca processed more exomes in 20 days than during the previous 3 years, enabling scientists to receive results more quickly, develop medicines faster, and treat more patients sooner. AstraZeneca was able to identify ~20% more patients with actionable variants of major cancer types by using VarDict, an internally developed, open source contributed variant caller, at scale in the Cloud. Learn how AstraZeneca used the AWS Cloud Services to rapidly develop and scale an asynchronous architecture to meet an urgent business opportunity, accelerating the speed of scientific discovery in a cost effective manner.
HTML Injection Attacks: Impact and Mitigation Strategies
Â
AWS re:Invent 2016: 20k in 20 Days - Agile Genomic Analysis (ENT320)
1. Š 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Chris Johnson, Solutions Architect
Ronen Artzi, R&D Architect and Cloud Evangelist
Michael S. Heimlich, Ph.D., Solution Delivery Manager
December 1, 2016
20k in 20 days â Agile Genomic Analysis
Decoupled Architectures
ENT320
2. What to Expect from the Session
⢠Learn about AWS Well-Architected
⢠Review decoupled architectures in AWS
⢠Consider how event-based architectures improve this
model
⢠Learn how AstraZeneca used Agile methods to analyze
20,000 Exomes in 20 days
3. What is the Well-Architected Program?
Reliability
Ensuring that a given system is
architected to meet operational
thresholds during a specific
period of time.
Performance
Ensuring a system delivers maximal
performance for a set of resources
(instances, storage, database and
space/time).
Cost Optimization
Cost Optimization helps achieve the lowest
price for a workload or set of workloads
while taking into account fluctuating needs.
Security
Complimenting the AWS Security
Best Practices whitepaper, our
Security pillar reviews definitions
and compliance best practices.
5 Architectural Pillars
Operational Excellence
Focusing around operational practices and
procedures used to manage production
workloads.
4. Reliability Pillar Dimensions
Pillar Area of Focus
⢠Ensuring a given system is architected
to meet operational thresholds, during
a specific period of time
⢠Meet increased workload demands
⢠Recover from failures with minimal
disruption or no disruption
Well-Architecting
For âReliabilityâ
Service Limits
Multi-
AZ/Region
Scalability
Health
Checking &
Monitoring
Networking
Self-Healing
Automation/DR
/HA/Backup
5. Security Pillar Dimensions
Pillar Area of Focus
⢠Adhering and complimenting the AWS
Security Best Practices whitepaper
⢠Review of definitions and compliance
best practices and methodologies
⢠Review of enforcement and
governance best practices and
methodologies
Well-Architecting
For âSecurityâ
Identity/Key
Management
Encryption
Security
Monitoring &
Logging
Dedicated
Instances
Compliance
Governance
6. Operational Excellence Pillar Dimensions
Pillar Area of Focus
⢠Achieve highly automated and
resilient deployment pipelines for code
changes
⢠Orchestrate operational requirements
once the workloads are deployed into
production
Well-Architecting
For âOperational Excellenceâ
Amazon
CloudWatch
Change
Auditing
CI/CD
Pipeline
Configuration
Management
Service
Catalog
AWS SDKs
7. Cost Optimization Pillar Dimensions
Pillar Area of Focus
⢠Achieve the lowest price for a
system/workload or set of
systems/workloads
⢠Optimize cost while taking into
account fluctuating needs
Well-Architecting
For âCost Optimizationâ
Spot/RI
Environment/
Volume
Tuning
Service
Selection
Account
Management
Consolidated
Billing
Decommission
Resources
8. Performance Pillar Dimensions
Pillar Area of Focus
⢠Ensuring a system/workload delivers
maximum performance for a set of
AWS resources utilized (instances,
storage, database and locality)
⢠Provide optimal performance
efficiency best practices and
guidelines
Well-Architecting
For âPerformanceâ
Right AWS
Services
Resource
Utilization
Storage
Architecture
Caching
Latency
Requirements
Planning &
Benchmarking
9. Letâs consider Performance Optimization. . .
In the beginning there was Amazon SQS:
⢠First service offered by AWS
⢠Cornerstone for service-oriented architectures
⢠Enables decoupling of application components
⢠Allows for asynchronous processing
11. Classic Decoupled Architecture
âPullâ
⢠Uses a queue (like SQS) for passing information between
systems
⢠Messages are pulled off a queue
⢠Requires a process that periodically polls the queue for
messages
⢠Widely used pattern in service-oriented architecture
12. The Classic Decoupled AWS Architecture
Advantages
⢠Decoupled
⢠Clients donât know about workers and vice versa
⢠Scalable
⢠Easy to add workers and queues
⢠Asynchronous
⢠Long running jobs are processed in the background
⢠Highly Available
⢠Workers and queues run across Availability Zones in an AWS Region
13. Classic Decoupled AWS Architecture
Aspects to think about âŚ
⢠AMI Baking
⢠Use AMIs so that youâre using a preconfigured image for your worker
tier
⢠Auto Scale Groups
⢠Use Auto Scaling Groups minimize idle Amazon EC2 instances, and
consider the right metric for triggers
⢠Undifferentiated Heavy Lifting
⢠Message Validity and Timeout configurations to manage your queues
and workers effectively
14. Event-Driven Architecture
What is it?
⢠A way of responding to certain actions that occur in an AWS
service
⢠Provides hooks for an application to execute code in response to
an event
⢠Eliminates the âundifferentiated heavy liftingâ of managing and
scaling compute resources and queues to execute event
handling code
16. Event-Driven Architecture - S3 Notifications
S3 will send notifications when certain events happen in a bucket.
Notifications can be published to different targets:
⢠Amazon SNS
⢠Useful when the occurrence of an event must be broadcast to a large
number of clients
⢠Amazon SQS
⢠Useful when a worker process needs to respond to the S3 event
asynchronously
⢠AWS Lambda
⢠Automatically executes code when an S3 event occurs
18. Event-Driven Architecture & Lambda
âPushâ
⢠S3 Notification pushes an event to Lambda
⢠Lambda service executes code in response to an event
19. Event-Driven Architecture
Advantages
⢠Reduces operational complexity
⢠No need to manage fleets of EC2 instances that process messages off
a queue
⢠Cost-Effective
⢠Pay only for the number of executions of event handling code
⢠No need to fine-tune auto scaling rules to limit idle CPU cycles
20. Event-Driven Architecture with Lambda
Aspects to think about âŚ
⢠Virtual wiring!
⢠Wire up Lambda using SNS, keep Inputs/Outputs simple and within 5
minute durations
⢠Scratch space
⢠Each Lambda function receives 500 MB of nonpersistent disk space in
its own /tmp directory. If you need more, consider other persistent
stores
⢠Remove Kernel/OS dependencies
⢠Embrace the simplification of no longer worrying about the underlying
OS
21. Pricing Comparison
Decoupled Event Driven
Compute Pay for each hour of EC2
compute time.
Pay for each execution of
a Lambda function.
Scaling Pay for more EC2
instances or for larger
instances.
No cost to scale up â
handled by the Lambda
service.
Storage Pay for Amazon EBS or
S3 usage.
Pay for S3 usage.
Network Normal data transfer
charges.
Normal data transfer
charges.
23. We are a global, science-led
biopharmaceutical business
pushing the boundaries of science
to deliver life-changing medicines.
$24.7bn
2015 Revenue*
100+
Countries
25. Genomic Sequencing â Digitizing the Human
Process and align
to a reference
Fragment the DNA
to create a library
Amplify and read the
fragments with a sequencer
26. Whole Exome vs. Whole Genome
Whole Exome Advantages Whole Genome Advantages
Protein coding regions, 2% of genome Contains everything
~ 85% of disease causing mutations More reliable sequence coverage
Lower sequencing cost at a high depth Better coverage uniformity
Reduces storage and analysis costs Minimal amplification needed
Increased number of samples Universal for all species
28. 20 days âŚ
âGive me six hours to chop down a tree and I
will spend the first four sharpening the axe.â
â Abraham Lincoln
âChampions do not become champions when they win
the event, but in the hours, weeks, months and years
they spend preparing for it. The victorious performance
itself is merely the demonstration of their championship
character.â
â Alan Armstrong
29. The Journey toward 20 Days âŚ
Program
Management
Global Privacy
Office
Technologies
IT
Quality
IT Security
Business
Science Units
Legal
Urgency
.. The Pragmatic way
Do the Right things
⌠Adjusting to new concepts
Do things right
âŚThe Cloud mindset
30. Stacked Security and Compliance ModelStackedResponsibility
APIs
Service Endpoint
Storage
Compute
Networking Management
Services
Regions
Availability Zones
Edge Locations
Amazon
Web
Services
Client Side Encryption Server Side Encryption Network Communication Protection
Operating System Network and Instance Firewall Platform Logging/Monitoring
Identify and Access Management
AZ R&D
Cloud
Project Researcher Collaborator Application End User / App
Data Owner
31. Genomics has challenged our existing
approaches to data privacy:
Genomes are Sensitive Personal Data
31
De-identification and patient consent
Standards & processes
⢠Your Genome, your fingerprint
⢠But ⌠not only yours âŚ
32. Applied Stacked Security and Compliance Model
dbGap / Genomic Data
Provider
Protecting against risk
associated with
releasing individualâs
genomes.
AstraZeneca
(QA/Privacy/Security)
Protecting AZ patient
Information security
and privacy for risk of
exposure
HIPPA
Protecting against risks
in releasing Personal
Information(PHI)
APIs
Service Endpoint
Storage
Compute
Networking
Management
Services
Regions
Availability Zones
Edge Locations
Client/Client Side Encryption Network Communication Protection
Operating System Network and Instance Firewall Platform Logging/Monitoring
Identify and Access Management
Genomic Project
StackedResponsibility
Project
Team
AZ
Science
Cloud
Platform
AWS
33. Redefining the Landscape for Genetic Drivers in Cancer
Motivation for Re-Analysis of TCGA Exome Data
33
Improvements in tools, references, and resources allow us to better define the causes of cancer
⢠hg38 Reference Genome
⢠Updated from hg19 â more accurate mutation detection
⢠VarDict Variant Caller
⢠20% better sensitivity finding mutations compared to the best current algorithms
⢠Developed by AstraZeneca, now open source
⢠Better Computational Resources
⢠Quickly re-analyze the data at scale
⢠Bring computational resources to the data, enabling us to succeed
34. A Project is Born
34
Project: Re-analyze 20k TCGA Exomes in 20 days
⢠Use the new hg38 reference genome
⢠Utilize VarDict to improve the variant calls
⢠Complete in ~20 days (Between Thanksgiving and Christmas)
Challenges:
20K Exomes = 270TB of Raw Genomic files
⢠Current Storage
⢠500 TB, mostly filled
⢠Not enough space!
⢠HPC utilization
⢠Used for ongoing projects
⢠Stop everything else for ~ 1 year!
⢠Internet Connection
⢠1Gbps Fastline
⢠~25 days just to download!
⢠Experience to Date
⢠On-premises: < 3,000 exomes (3 years)
⢠Process ~ 7x as many in < 1 month!
35. The Informatics Workflow â Paradigm Shift
The Old Way
⢠All at once
⢠Coupled / Synchronous
Analyze Post-Process
The New Way
⢠Process when available
⢠Decoupled / Asynchronous
Download
Project
Project
Download
Analyze
Post-Process
36. PARALLEL INGESTION
AT SCALE
ASYNCHRONOUS
PARALLEL ANALYSIS
AT SCALE
Alignment
Slice Index and Stats
Variant Call
UPLOAD TO
LOCAL HPC FOR
DOWNSTREAM
ANALYSIS
Simplify: Turn It to a Loosely Coupled At Scale Solution
36
PARALLEL INGESTION ASYNCHRONOUS PARALLEL ANALYSIS UPLOAD RESULTS
37. Simplify: Turn It to a Loosely Coupled At Scale Solution
37
Science
Requested
Target
Samples
CGHub Local Science HPCDevOps Bucket
INGESTION S3 Bucket
Ingestion
Request
Worker
RESULTS S3
Bucket
Automation
Worker VLAD
Automation Step
Workers VLAD
Loader Worker
INGESTION
Request Queue
ANALYSIS
Request Queue
PIPELINE AUTOMATION
Results Queue
PIPELINE
Results Queue
Queue
Process
Storage
PIPELINE S3 Bucket
BAM, BAI
VCF
Analysis
Request Worker
PARALLEL INGESTION
AT SCALE
ASYNCHRONOUS
PARALLEL ANALYSIS
AT SCALE
Alignment
Slice Index and Stats
Variant Call
UPLOAD TO
LOCAL HPC FOR
DOWNSTREAM
ANALYSIS
PARALLEL INGESTION ASYNCHRONOUS PARALLEL ANALYSIS UPLOAD RESULTS
ASYNCHRONOUS
PARALLEL ANALYSIS
AT SCALE
Alignment
Slice Index and Stats
Variant Call
UPLOAD TO
LOCAL HPC FOR
DOWNSTREAM
ANALYSIS
38. Intelligent Scale
Ingestion Process
Analysis:
- 20K Exomes -> ~270TB (BAMs, BAIs)
- Distribution : most files <10GB, 10s 50GB, couple >200GB
- Theoretical best rate to download 270TB: ~25 days
Parallelize/Scale opportunity:
- CGHUBâs GeneTorrent Client uses multiple threads as it
brings a single file. We can run several Clients in parallel.
- S3 allows significant parallel throughput (use TCGA uuid as
randomizer for object name)
- AWS Network Bandwidth is superb
- Spot Instances will provide ingestion at scale with reduced
cost ( $0.2 / $0.08)
- Server Less via Lambda is perfect to trigger post load
Analysis ( generate Analysis Request Message )
Results: We got 1.5-2 TB/Hour with ~300
Workers/Instances !!
IngestionWorkers
N Workers per
Instance
Analysis Request
Queue
ReadyToAnalyze
Queue
80GB/160GB/
320GB
Launch
ConfigGroups
39. The Data Ingestion Story
39
~52TB/6 Days
~85TB/
2 Days
~115TB
~63TB/
1.5 Days
Cautiously test the water Bullish!!! Play to Win
40. Simplify: Turn It to a Loosely Coupled At Scale Solution
40
Science
Requested
Target
Samples
CGHub Local Science HPCDevOps Bucket
INGESTION S3 Bucket
Ingestion
Request
Worker
RESULTS S3
Bucket
Automation
Worker VLAD
Automation Step
Workers VLAD
Loader Worker
INGESTION
Request Queue
ANALYSIS
Request Queue
PIPELINE AUTOMATION
Results Queue
PIPELINE
Results Queue
Queue
Process
Storage
PIPELINE S3 Bucket
BAM, BAI
VCF
Analysis
Request Worker
PARALLEL INGESTION
AT SCALE
ASYNCHRONOUS
PARALLEL ANALYSIS
AT SCALE
Alignment
Slice Index and Stats
Variant Call
UPLOAD TO
LOCAL HPC FOR
DOWNSTREAM
ANALYSIS
PARALLEL INGESTION ASYNCHRONOUS PARALLEL ANALYSIS UPLOAD RESULTS
PARALLEL
INGESTION
AT SCALE
UPLOAD TO LOCAL
HPC FOR
DOWNSTREAM
ANALYSIS
41. Pipeline Analysis
41
Design Principles :
- Use all available resources (Local/Cloud)
- Optimize Env to the task
- Align Environment to target goals (time, cost)
Parallelize/Scale opportunity:
- S3 allows significant parallel throughput (use)
- AWS Network Bandwidth is superb
- Single Orchestration Control Plane (Queue Base)
⢠Use all Available Resources:
⢠Two Local Pipeline Analysis Engines
⢠Three Cloud-Based Pipeline Engines
- Optimized Task-Tuned Clusters:
- AWS R3.8xlarge based RAVE Platform
- AWS C3.8xlarge based RAVE Platform
Results: Orchestrated Hybrid On-Demand
Auto Scaled Pipeline.
Univa Based
Grid Engine
On Premise
GPFS
NextGen BcBio
S3
Bina Rave
Template/Automation
42. Post Alignment Slicing
Post Realignment :
- Slice BAM and extract Gene Of Interest List
- Generate Stats
- Index BAM File
Parallelize/Scale opportunity:
- Operation can be done on every realigned BAM file
while itâs being used for further pipeline activities.
- Slice, Stats, Index can be uploaded to local HPC
regardless to Variant call results as long are
distributed to the right canonical named located
(based on tcga uuid)
- S3 bucket can hold âinfiniteâ number of objects
without âfoldersâ structure
- Spot Instances are perfect for cost reduction
Results:
- When Running at 700 Workers (100 nodes) we did
~6000 Slice ,Stats , Index in less than 3 hours
Post Align
N Workers per
Instances
Pipeline Analysis
Queue
ReadyToUpload
Queue
80GB/160GB/
320GB
Launch
ConfigGroup
S3
43. Simplify: Turn It to a Loosely Coupled At Scale Solution
43
Science
Requested
Target
Samples
CGHub Local Science HPCDevOps Bucket
INGESTION S3 Bucket
Ingestion
Request
Worker
RESULTS S3
Bucket
Automation
Worker VLAD
Automation Step
Workers VLAD
Loader Worker
INGESTION
Request Queue
ANALYSIS
Request Queue
PIPELINE AUTOMATION
Results Queue
PIPELINE
Results Queue
Queue
Process
Storage
PIPELINE S3 Bucket
BAM, BAI
VCF
Analysis
Request Worker
PARALLEL INGESTION
AT SCALE
ASYNCHRONOUS
PARALLEL ANALYSIS
AT SCALE
Alignment
Slice Index and Stats
Variant Call
UPLOAD TO
LOCAL HPC FOR
DOWNSTREAM
ANALYSIS
PARALLEL INGESTION ASYNCHRONOUS PARALLEL ANALYSIS UPLOAD RESULTS
INGESTION
AT SCALE
ASYNCHRONOUS
PARALLEL ANALYSIS
AT SCALE
Alignment
Slice Index and Stats
Variant Call
Cleanup and Bring in What Matters
Loader
Developed
+270TB of raw data ď ~9TB Meaningful Analysis
44. Time Shrinking Machine â Getting Results Faster
Months
01 03 05 07 09 11 13
Start
New Way
1/21/2016
Old Way
2/8/2017
Genomic File Ingestion
1/1/2016 - 1/25/2016
Data Analysis, Exclusive Use
1/26/2016 - 1/25/2017
Post-Processing
1/26/2017 - 2/1/2017
File Cleanup
2/2/2017 - 2/8/2017
Genomic File Ingestion (270 TB @ 5 Gbps) 5 Days
1/1/2016 - 1/5/2016
Data Analysis (7680 CPU Cores)
1/1/2016 - 1/21/2016
Post-Processing
1/1/2016 - 1/21/2016
File Cleanup
1/1/2016 - 1/21/2016
46. 270 TB
Data Reduction
9 TB
SoâŚHow Did We Do?
46
19,690
TCGA EXOMES
RE-ALIGNED TO
HG38
>270 TB
RAW DATA
UPLOADED AT
1.5-2 TB/HR WITH
300 LOADERS
7680 CPU
FOR ALIGNMENT
AND VARIANT
CALLING
>2000/HR
FOR SLICE
STATS AND INDEX
W/ 700 WORKERS
$4
PER EXOME INCL.
COMPUTE,
STORAGE,
NETWORK
47. Lessons Learned from the 20k in 20 Days Project
⢠Plan ahead as much as you can ď
⢠Work first on the foundation and enabling factors
⢠Security and Privacy foundation
⢠DevOps and Automation
⢠Understand the details of:
⢠Process
⢠Computation and Data Scale
⢠Leverage the AWS XaaS
⢠Measure Everything and associate it to Business Outcome
48. Acknowledgements
AstraZeneca
⢠Vlad Saveliev
⢠Ronen Artzi
⢠Michael Heimlich
⢠Tristan Lubinski
⢠Jonathan Dry @DrySci
⢠Danielle Greenawalt @BostonBioinfX
⢠Justin Johnson @BioInfo
⢠Zhongwu Lai @ZhongwuL
⢠Stefanie Rintoul
⢠Vitaly Rozenman
⢠Carl Barrett
⢠Brian Dougherty
⢠Bryan Takasaki
AstraZeneca Security/Privacy
⢠Richard Paul
⢠Gayle Pearce
⢠Lee Ann Heckathorn
⢠Stephen Weil
⢠Sheri Arnell
⢠Tommy Farrell
⢠Victoria Southern
Bina/Roche
⢠Andi Broka
⢠Engineering Team
⢠Product Management Team
⢠Science Team