High Performance Computing in AWS, Immersion Day Huntsville 2019

Brad Dispensa
Pr. Security and Compliance SA
WW Public Sector
HPC on AWS
Wednesday, October 9, 2019
© 2019 Amazon Web Services, Inc. and its affiliates. All rights served. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon Web Services, Inc.

Talk Outline
Brief Overview: Cloud Computing on AWS
Why use AWS for HPC?
HPC Building Block in AWS
ü Compute
ü Networking
ü Storage
ü Deployment Tools
Customer Success Stories and Example Use Cases

1
What is Cloud Computing on
AWS?

Global Infrastructure
Coming soon
69 Availability Zones
within 22 geographic
Regions around the world
We add the equivalent of an entire Fortune 500 company’s compute capacity every day

AWS Availability Zones, Data Centers, Servers
AZ
AZ
AZ AZ AZ
Tran
sit
Transit
over 50,000 servers & often over 80,000

Amazing Amount of Compute Capacity – Economy of Scale
A virtually unlimited number of architecture options
Instance Types, OS, Traditional Cluster, Auto Scaling Clusters, Serverless, GPUs
Extensive deployment options – “Infrastructure as Code”
Console, Configuration Control, Automated, SDK, Bash/CLI, AWS CloudFormation
Lots of useful services
Amazon DynamoDB, Amazon CloudWatch, Amazon Glacier, and much more!
instance with
Amazon
CloudWatch
Auto Scaling
template
Amazon
DynamoDB
Why use AWS for HPC?
AWS
Lambda

Great Features for HPC Workloads
Experimentation without Fear!
Activate Multiple Compute Clusters Simultaneously!
A Supercomputer at the Fingertip of EACH Scientist!
Start and stop instances or entire clusters!
Take Advantage of Spot Pricing
Receive Continuous Updates
Compute, Network, Storage
All Services
Immediate Access to latest Technology

The Life of an Average HPC Code on a Supercomputer:
The average number of cores: 14
The average wall-clock time: 1.69 hrs
The average queue wait time: 4.4 days!
Cloud Improves “Workload Throughput”
Think: “Workload Throughput”
https://xdmod.ccr.buffalo.edu/#main_tab_panel:tg_summary
§ The job queue becomes the capacity buffer
§ Job completion times are hard to predict
§ Users are frustrated and run fewer jobs
§ Innovation is throttled by fixed IT resources
Run many Jobs in Parallel, Use it when you need it
Pay only for what you use
Right-size clusters and resources
Optimize each workload for performance

Time-to-results Efficiency
2
2 2
4
2
1
1
3
7
7
4
9
5
7
6 6
7
7
4
8
4
Cores
8
2
1
9
5
4
5
3
1
2
3
6
1
9
4
8
1
2
8
7
7
6
Fixed data center
capacity limit
Cores
Finite capacity, usually with
long queues to wait in
Massive capacity when needed to speed up time
to results, and agile environment when additional
hardware and software experimentation is needed
“For every $1 spent on
HPC, businesses see
$463 in incremental
revenues and $44 in
incremental profit.”

What this can look like…
25,000
50,000
75,000
# of Cores
0
Time: +00h
Scale using Elastic Capacity
<1,000 cores

25,000
50,000
75,000
# of Cores
0
Time: +24h
>75,000 memory optimized cores

25,000
50,000
75,000
# of Cores
0
Time: +72h
<1,000 cores

25,000
50,000
75,000
# of Cores
0
Time: +120h
>30,000 GPU optimized cores

AWS HPC Building Blocks Outline
Compute
Storage/Data Management
Networking
Deployment Tools

“What did you say???” – The AWS Lingo
Term Meaning
AWS Amazon Web Services
EC2 Elastic Cloud Compute; an AWS service providing virtual machines
AMI Amazon Machine Image, a virtual image for a virtual machine
Instance One launched virtual server
EBS Elastic Block Storage, data storage attached to an EC2 Instance
VPC Virtual Private Cloud, your private piece of the cloud
S3 Simple Storage Service, amazing object storage service
Security Group The instance firewall

CHOICE OF AWS INSTANCES FOR HPC
M4,5
General
purpose
Compute
Optimized,
Core Count
Storage and IO
optimized
GPU, FPGA
accelerated
Memory
optimized
X1 F1
P3dn
I3 D2
R4, 5
C5(n)
C4
P2
z1d

Instance
Generation
c4.large
Instance
Family
Instance
Size
Vertical Scaling
Amazon EC2 Instances
c4.8xlarge
c4.4xlarge
≈
c4.2xlarge
≈
c4.xlarge
≈

Selecting an instance type for an HPC
Instance Type vCPU
Memory
(GiB)
Storage
(GB)
Networking
Performance Physical Processor
Vector
Engine
Clock
Speed
(GHz) Hypervisor
c4.8xlarge 36 60 EBS Only 10 Gigabit Intel Xeon V3 AVX2 2.9 Xen based
c5(n).18xlarge 72 144 EBS Only 25/100 Gigabit Intel Xeon Platinum AVX512 3.5 Nitro
m5(d).24xlarge 48 384 EBS/SSD 25 Gigabit Intel Xeon Platinum AVX512 2.5 Nitro
r4.16xlarge 32 488 EBS Only 25 Gigabit Intel Xeon V4 AVX2 2.3 Xen based
r5(d).24xlarge 96 768 EBS/SSD 25 Gigabit Intel Xeon Platinum AVX512 3.1 Nitro
x1.32xlarge 128 1,952 SSD 25 Gigabit Intel Xeon V3 AVX2 2.3 Nitro
z1d.12xlarge 48 384 SSD 25 Gigabit Intel Xeon Platinum AVX512 4.0 Nitro

High network bandwidth compute instances: C5n, P3dn, i3en
C5n
§ First “network optimized” instances on AWS
§ Will deliver up to 100Gbps network throughput
§ Instances based on C5/P3/i3 instances:
§ Intel Skylake/Broadwell CPUs
§ Nitro System (hypervisor and ENA)
§ Intended for network-intensive applications including HPC

High bandwidth compute instances: C5n
HPC stack on AWS
3D graphics virtual workstation
License managers and cluster head
nodes with job schedulers
Cloud-based, auto-scaling HPC clusters
Shared file storage Storage cache
Massively scalable performance
• C5n Instances will offer up to 100 Gbps of network bandwidth
• Significant improvements in maximum bandwidth, packet per
seconds, and packets processing
• Custom designed Nitro network cards
• Purpose-built to run network bound workloads including
distributed cluster and database workloads, HPC, real-time
communications and video streaming
Featuring

High bandwidth compute instances: P3dn
HPC stack on AWS
Optimized for distributed ML training
• One of the most powerful GPU instance available in the cloud
• Distributed machine learning training across multiple
GPU instances
• 100 Gbps of networking throughput
• Based on NVIDIA’s latest GPU Tesla V100 with 32GB
of memory each
• The largest Amazon Elastic Compute Cloud (Amazon EC2)
P3 instance size available

High clock speed compute instances: Z1d
HPC stack on AWS
Up to 4 GHz sustained, all-turbo performance
• Z1d instances are optimized for memory-intensive, compute-
intensive applications
• Custom Intel Xeon Scalable processor
• Up to 4 GHz sustained, all-turbo performance
• Up to 385GiB DDR4 memory
• Enhanced networking, up to 25 GB throughput
Featuring

Network
“It’s as fast as it’s SLOWEST Component”

AWS is Committed to Networking at Scale
• The AWS Network is Custom Built
– Full bi-section bandwidth in placement groups
– Designed such that all ports can run flat out
– No blocking, no oversubscription
– Continuously improving
– Commodity parts on a Moore’s Law Pace
• Enhanced Networking
– Reduced instance-to-instance latency
– Reduced jitter
• Amazon Elastic Network Adapter
– New PCI network device developed for EC2
– Available on newer instances, including C5, M5, R4, C5, R5, Z1d
– Ability to scale across a variety of bandwidths
• 10 and 20 Gbps today
“We love where we are right now,” AND ”It will only get better!”

ELASTIC NETWORK ADAPTER
§ Latest generation of Enhanced Networking
§ Hardware Checksums
§ Multi-Queue Support
§ Receive Side Steering
§ 25Gbps in a Placement Group
§ Open Source Amazon Network Driver

Remember this?
AZ
AZ
AZ AZ AZ
Tran
sit
Transit
Our Network Needs to Scale!
22 Regions – 69 Availability Zones – 87 Edge Locations

Elastic Fabric Adapter (EFA)
C5n P3dn
EFA
Elastic Fabric Adapter,
best for large HPC workloads
Scale tightly-coupled
HPC applications on AWS
i3en

AWS Elastic Fabric Adapter – EFA
§ Proprietary, AWS-designed fabric network
§ Built on top of network optimized instances on AWS
§ Delivering up to 100Gbps network throughput
§ Delivering below 15µs latencies for HPC applications
§ Optimized for OpenMPI and other MPI libraries
§ Supported on C5n, R5n, M5n, and P3n instances
Amazon Confidential – provided under NDA

HPC software stack in Amazon EC2
Userspace
Kernel
Without EFA With EFA

What can EFA do?
Thanks to Metacomp Technologies and the Klingon Empire.

OpenFoam benchmark MotorBike 140M

Comprehensive portfolio of storage options for HPC
Block storage File storage Object storage
Amazon EBS Amazon EFS Amazon S3
Elastic, high performance
block storage
at any scale
Petabyte-scale, elastic file storage
sharable across applications,
instances and servers
Low cost, highly scalable
cloud storage with
99.999999999% durability

Amazon FSx for Lustre:
Fully managed high performance parallel shared file system
HPC stack on AWS
High
performing
Parallel
distributed
file system
Tune complex
performance
parameters
Massively scalable performance
100+ GiB/s throughput
Millions of IOPS
Consistent low latencies

High and scalable performance
Each terabyte (TB) of storage provides 200 MB/second of file system throughput and ~5,000 IOPS
High and
scalable
performance
Parallel File System
100+ GiB/s throughput
Millions of IOPS
Consistent sub-millisecond latencies
Supports concurrent
access from hundreds of
thousands of cores
SSD-based

File system throughput & IOPS scale linearly with storage capacity
Each TB of storage provides 200 MB/s of baseline throughput,
and up to 12x burst throughput
File systems can scale to hundreds of GB/s and millions of IOPS
Capacity Baseline throughput Burst throughput
1TB 200 MB/s up to 2.4 GB/s
10TB 2 GB/s up to 24 GB/s
1PB 200 GB/s at least 240 GB/s

Deployment Tools - Orchestration

Easy cluster management: AWS ParallelCluster
Simplifies deployment of HPC in the
cloud, including integrating with popular
HPC schedulers
Integrated with AWS Batch, Amazon FSx
for Lustre and
Elastic Fabric Adapter

AWS Parallel Cluster
§ Simplifies deployment of HPC Clusters in the cloud
§ Integrates with popular HPC schedulers including such as:
§ SLURM, Grid Engine, Torque
§ Built on AWS CloudFormation
§ Easy to modify to meet specific application or project requirements
• Latest Features:
– Multiple EBS volumes
– Custom AMI support
• Bring your own custom AMI, not just build from our default AMI
– Easy use of EFS and Lustre use
• Launch Templates support (for EFA)
– Support for C5n and other new instance types
– AWS Batch Integration
• Open Source available on GitHub

AWS Batch
• Dynamically provisions resources
• Plans, schedules, and executes
• No additional components to install
Event
Changes in
data state
Requests
to endpoints
Services (anything)
Scheduled
triggers
Compute
Execution
Your code
Auto Scaling
Job queue

Efficient job scheduling: Multi-node parallel job
support on AWS Batch
HPC stack on AWS
Simplify your compute clusters and scale jobs
across multiple instances with AWS Batch support
for Multi-node Parallel (MNP) jobs
Container 2
Container 4Container 3
Instance 1
Container 1
My job
Instance 2
My job
My job My job
Instance 3 Instance 4

Orchestration tools include support for capacity and
cost optimization
Use Reserved Instances for
known/steady-state workloads
Scale using Spot, On-Demand, or both Evaluate the trade-off of time
to solution vs. cost for scaling

Running HPC applications
at extreme scale
“Storage technology is amazingly complex and we’re constantly pushing the
limits of physics and engineering to deliver next-generation capacities and
technical innovation. This successful collaboration with AWS shows the extreme
scale, power and agility of cloud-based HPC to help us run complex simulations
for future storage architecture analysis and materials science explorations.
Using AWS to easily shrink simulation time from 20 days to 8 hours allows
Western Digital R&D teams to explore new designs and innovations at a pace
un-imaginable just a short time ago.” —Steve Phillpott, CIO, Western Digital
single HPC cluster of 1 million vCPUs
Accelerating time to innovation
20 days à 8 hours

Descartes Labs makes the Top500 List running on AWS
https://medium.com/descarteslabs-team/thunder-from-the-cloud-40-000-cores-
running-in-concert-on-aws-bf1610679978

3 million core-hours of Amazon EC2 Spot capacity
https://www.nature.com/articles/s41588-018-0153-5
Complete sequencing of
3.24 billion base pairs

Manage 50X the number of securities
4,000 times faster
In hours, instead of months
Run risk models
Helping financial institutions
model investment risks

600 times faster
Engineering simulations
Helping to make supersonic
flights mainstream

Flexible configuration and virtually unlimited scalability
to grow and shrink your infrastructure as your HPC
workloads dictate, not the other way around
HPC on AWS

Thank You!
Any Questions?For More Information:
aws.amazon.com/hpc/
aws.amazon.com/getting-started

High Performance Computing in AWS, Immersion Day Huntsville 2019

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a High Performance Computing in AWS, Immersion Day Huntsville 2019

Semelhante a High Performance Computing in AWS, Immersion Day Huntsville 2019 (20)

Mais de Amazon Web Services

Mais de Amazon Web Services (20)

High Performance Computing in AWS, Immersion Day Huntsville 2019