2. Talk Outline
Brief Overview: Cloud Computing on AWS
Why use AWS for HPC?
HPC Building Block in AWS
ü Compute
ü Networking
ü Storage
ü Deployment Tools
Customer Success Stories and Example Use Cases
4. Global Infrastructure
Coming soon
69 Availability Zones
within 22 geographic
Regions around the world
We add the equivalent of an entire Fortune 500 company’s compute capacity every day
5. AWS Availability Zones, Data Centers, Servers
AZ
AZ
AZ AZ AZ
Tran
sit
Transit
over 50,000 servers & often over 80,000
7. Amazing Amount of Compute Capacity – Economy of Scale
A virtually unlimited number of architecture options
Instance Types, OS, Traditional Cluster, Auto Scaling Clusters, Serverless, GPUs
Extensive deployment options – “Infrastructure as Code”
Console, Configuration Control, Automated, SDK, Bash/CLI, AWS CloudFormation
Lots of useful services
Amazon DynamoDB, Amazon CloudWatch, Amazon Glacier, and much more!
instance with
Amazon
CloudWatch
Auto Scaling
template
Amazon
DynamoDB
Why use AWS for HPC?
AWS
Lambda
8. Great Features for HPC Workloads
Experimentation without Fear!
Activate Multiple Compute Clusters Simultaneously!
A Supercomputer at the Fingertip of EACH Scientist!
Start and stop instances or entire clusters!
Take Advantage of Spot Pricing
Receive Continuous Updates
Compute, Network, Storage
All Services
Immediate Access to latest Technology
9. The Life of an Average HPC Code on a Supercomputer:
The average number of cores: 14
The average wall-clock time: 1.69 hrs
The average queue wait time: 4.4 days!
Cloud Improves “Workload Throughput”
Think: “Workload Throughput”
https://xdmod.ccr.buffalo.edu/#main_tab_panel:tg_summary
§ The job queue becomes the capacity buffer
§ Job completion times are hard to predict
§ Users are frustrated and run fewer jobs
§ Innovation is throttled by fixed IT resources
Run many Jobs in Parallel, Use it when you need it
Pay only for what you use
Right-size clusters and resources
Optimize each workload for performance
10. Time-to-results Efficiency
2
2 2
4
2
1
1
3
7
7
4
9
5
7
6 6
7
7
4
8
4
Cores
8
2
1
9
5
4
5
3
1
2
3
6
1
9
4
8
1
2
8
7
7
6
Fixed data center
capacity limit
Cores
Finite capacity, usually with
long queues to wait in
Massive capacity when needed to speed up time
to results, and agile environment when additional
hardware and software experimentation is needed
“For every $1 spent on
HPC, businesses see
$463 in incremental
revenues and $44 in
incremental profit.”
11. What this can look like…
25,000
50,000
75,000
# of Cores
0
Time: +00h
Scale using Elastic Capacity
<1,000 cores
12. What this can look like…
25,000
50,000
75,000
# of Cores
0
Time: +24h
Scale using Elastic Capacity
>75,000 memory optimized cores
13. What this can look like…
25,000
50,000
75,000
# of Cores
0
Time: +72h
Scale using Elastic Capacity
<1,000 cores
14. What this can look like…
25,000
50,000
75,000
# of Cores
0
Time: +120h
Scale using Elastic Capacity
>30,000 GPU optimized cores
17. “What did you say???” – The AWS Lingo
Term Meaning
AWS Amazon Web Services
EC2 Elastic Cloud Compute; an AWS service providing virtual machines
AMI Amazon Machine Image, a virtual image for a virtual machine
Instance One launched virtual server
EBS Elastic Block Storage, data storage attached to an EC2 Instance
VPC Virtual Private Cloud, your private piece of the cloud
S3 Simple Storage Service, amazing object storage service
Security Group The instance firewall
22. High network bandwidth compute instances: C5n, P3dn, i3en
C5n
§ First “network optimized” instances on AWS
§ Will deliver up to 100Gbps network throughput
§ Instances based on C5/P3/i3 instances:
§ Intel Skylake/Broadwell CPUs
§ Nitro System (hypervisor and ENA)
§ Intended for network-intensive applications including HPC
23. High bandwidth compute instances: C5n
HPC stack on AWS
3D graphics virtual workstation
License managers and cluster head
nodes with job schedulers
Cloud-based, auto-scaling HPC clusters
Shared file storage Storage cache
Massively scalable performance
• C5n Instances will offer up to 100 Gbps of network bandwidth
• Significant improvements in maximum bandwidth, packet per
seconds, and packets processing
• Custom designed Nitro network cards
• Purpose-built to run network bound workloads including
distributed cluster and database workloads, HPC, real-time
communications and video streaming
Featuring
24. High bandwidth compute instances: P3dn
HPC stack on AWS
3D graphics virtual workstation
License managers and cluster head
nodes with job schedulers
Cloud-based, auto-scaling HPC clusters
Shared file storage Storage cache
Optimized for distributed ML training
• One of the most powerful GPU instance available in the cloud
• Distributed machine learning training across multiple
GPU instances
• 100 Gbps of networking throughput
• Based on NVIDIA’s latest GPU Tesla V100 with 32GB
of memory each
• The largest Amazon Elastic Compute Cloud (Amazon EC2)
P3 instance size available
25. High clock speed compute instances: Z1d
HPC stack on AWS
3D graphics virtual workstation
License managers and cluster head
nodes with job schedulers
Cloud-based, auto-scaling HPC clusters
Shared file storage Storage cache
Up to 4 GHz sustained, all-turbo performance
• Z1d instances are optimized for memory-intensive, compute-
intensive applications
• Custom Intel Xeon Scalable processor
• Up to 4 GHz sustained, all-turbo performance
• Up to 385GiB DDR4 memory
• Enhanced networking, up to 25 GB throughput
Featuring
27. AWS is Committed to Networking at Scale
• The AWS Network is Custom Built
– Full bi-section bandwidth in placement groups
– Designed such that all ports can run flat out
– No blocking, no oversubscription
– Continuously improving
– Commodity parts on a Moore’s Law Pace
• Enhanced Networking
– Reduced instance-to-instance latency
– Reduced jitter
• Amazon Elastic Network Adapter
– New PCI network device developed for EC2
– Available on newer instances, including C5, M5, R4, C5, R5, Z1d
– Ability to scale across a variety of bandwidths
• 10 and 20 Gbps today
“We love where we are right now,” AND ”It will only get better!”
28. ELASTIC NETWORK ADAPTER
§ Latest generation of Enhanced Networking
§ Hardware Checksums
§ Multi-Queue Support
§ Receive Side Steering
§ 25Gbps in a Placement Group
§ Open Source Amazon Network Driver
29. Remember this?
AZ
AZ
AZ AZ AZ
Tran
sit
Transit
Our Network Needs to Scale!
22 Regions – 69 Availability Zones – 87 Edge Locations
30. Elastic Fabric Adapter (EFA)
C5n P3dn
EFA
Elastic Fabric Adapter,
best for large HPC workloads
Scale tightly-coupled
HPC applications on AWS
i3en
31. AWS Elastic Fabric Adapter – EFA
§ Proprietary, AWS-designed fabric network
§ Built on top of network optimized instances on AWS
§ Delivering up to 100Gbps network throughput
§ Delivering below 15µs latencies for HPC applications
§ Optimized for OpenMPI and other MPI libraries
§ Supported on C5n, R5n, M5n, and P3n instances
Amazon Confidential – provided under NDA
36. Comprehensive portfolio of storage options for HPC
Block storage File storage Object storage
Amazon EBS Amazon EFS Amazon S3
Elastic, high performance
block storage
at any scale
Petabyte-scale, elastic file storage
sharable across applications,
instances and servers
Low cost, highly scalable
cloud storage with
99.999999999% durability
37. Amazon FSx for Lustre:
Fully managed high performance parallel shared file system
HPC stack on AWS
3D graphics virtual workstation
License managers and cluster head
nodes with job schedulers
Cloud-based, auto-scaling HPC clusters
Shared file storage Storage cache
High
performing
Parallel
distributed
file system
Tune complex
performance
parameters
Massively scalable performance
100+ GiB/s throughput
Millions of IOPS
Consistent low latencies
38. High and scalable performance
Each terabyte (TB) of storage provides 200 MB/second of file system throughput and ~5,000 IOPS
High and
scalable
performance
Parallel File System
100+ GiB/s throughput
Millions of IOPS
Consistent sub-millisecond latencies
Supports concurrent
access from hundreds of
thousands of cores
SSD-based
39. File system throughput & IOPS scale linearly with storage capacity
Each TB of storage provides 200 MB/s of baseline throughput,
and up to 12x burst throughput
File systems can scale to hundreds of GB/s and millions of IOPS
Capacity Baseline throughput Burst throughput
1TB 200 MB/s up to 2.4 GB/s
10TB 2 GB/s up to 24 GB/s
50TB 10 GB/s up to 120 GB/s
100TB 20 GB/s up to 240 GB/s
1PB 200 GB/s at least 240 GB/s
41. Easy cluster management: AWS ParallelCluster
Simplifies deployment of HPC in the
cloud, including integrating with popular
HPC schedulers
Integrated with AWS Batch, Amazon FSx
for Lustre and
Elastic Fabric Adapter
42. AWS Parallel Cluster
§ Simplifies deployment of HPC Clusters in the cloud
§ Integrates with popular HPC schedulers including such as:
§ SLURM, Grid Engine, Torque
§ Built on AWS CloudFormation
§ Easy to modify to meet specific application or project requirements
• Latest Features:
– Multiple EBS volumes
– Custom AMI support
• Bring your own custom AMI, not just build from our default AMI
– Easy use of EFS and Lustre use
• Launch Templates support (for EFA)
– Support for C5n and other new instance types
– AWS Batch Integration
• Open Source available on GitHub
43. AWS Batch
• Dynamically provisions resources
• Plans, schedules, and executes
• No additional components to install
Event
Changes in
data state
Requests
to endpoints
Services (anything)
Scheduled
triggers
Compute
Execution
Your code
Auto Scaling
Job queue
44. Efficient job scheduling: Multi-node parallel job
support on AWS Batch
HPC stack on AWS
3D graphics virtual workstation
License managers and cluster head
nodes with job schedulers
Cloud-based, auto-scaling HPC clusters
Shared file storage Storage cache
Simplify your compute clusters and scale jobs
across multiple instances with AWS Batch support
for Multi-node Parallel (MNP) jobs
Container 2
Container 4Container 3
Instance 1
Container 1
My job
Instance 2
My job
My job My job
Instance 3 Instance 4
45. Orchestration tools include support for capacity and
cost optimization
Use Reserved Instances for
known/steady-state workloads
Scale using Spot, On-Demand, or both Evaluate the trade-off of time
to solution vs. cost for scaling
47. Running HPC applications
at extreme scale
“Storage technology is amazingly complex and we’re constantly pushing the
limits of physics and engineering to deliver next-generation capacities and
technical innovation. This successful collaboration with AWS shows the extreme
scale, power and agility of cloud-based HPC to help us run complex simulations
for future storage architecture analysis and materials science explorations.
Using AWS to easily shrink simulation time from 20 days to 8 hours allows
Western Digital R&D teams to explore new designs and innovations at a pace
un-imaginable just a short time ago.” —Steve Phillpott, CIO, Western Digital
single HPC cluster of 1 million vCPUs
Accelerating time to innovation
20 days à 8 hours
48. Descartes Labs makes the Top500 List running on AWS
https://medium.com/descarteslabs-team/thunder-from-the-cloud-40-000-cores-
running-in-concert-on-aws-bf1610679978
49.
50. 3 million core-hours of Amazon EC2 Spot capacity
https://www.nature.com/articles/s41588-018-0153-5
Complete sequencing of
3.24 billion base pairs
51. Manage 50X the number of securities
4,000 times faster
In hours, instead of months
Run risk models
Helping financial institutions
model investment risks
53. Flexible configuration and virtually unlimited scalability
to grow and shrink your infrastructure as your HPC
workloads dictate, not the other way around
HPC on AWS