SlideShare uma empresa Scribd logo
1 de 54
Baixar para ler offline
Brad Dispensa
Pr. Security and Compliance SA
WW Public Sector
HPC on AWS
Wednesday, October 9, 2019
© 2019 Amazon Web Services, Inc. and its affiliates. All rights served. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon Web Services, Inc.
Talk Outline
Brief Overview: Cloud Computing on AWS
Why use AWS for HPC?
HPC Building Block in AWS
ü Compute
ü Networking
ü Storage
ü Deployment Tools
Customer Success Stories and Example Use Cases
1
What is Cloud Computing on
AWS?
Global Infrastructure
Coming soon
69 Availability Zones
within 22 geographic
Regions around the world
We add the equivalent of an entire Fortune 500 company’s compute capacity every day
AWS Availability Zones, Data Centers, Servers
AZ
AZ
AZ AZ AZ
Tran
sit
Transit
over 50,000 servers & often over 80,000
2
Why AWS for HPC?
Amazing Amount of Compute Capacity – Economy of Scale
A virtually unlimited number of architecture options
Instance Types, OS, Traditional Cluster, Auto Scaling Clusters, Serverless, GPUs
Extensive deployment options – “Infrastructure as Code”
Console, Configuration Control, Automated, SDK, Bash/CLI, AWS CloudFormation
Lots of useful services
Amazon DynamoDB, Amazon CloudWatch, Amazon Glacier, and much more!
instance with
Amazon
CloudWatch
Auto Scaling
template
Amazon
DynamoDB
Why use AWS for HPC?
AWS
Lambda
Great Features for HPC Workloads
Experimentation without Fear!
Activate Multiple Compute Clusters Simultaneously!
A Supercomputer at the Fingertip of EACH Scientist!
Start and stop instances or entire clusters!
Take Advantage of Spot Pricing
Receive Continuous Updates
Compute, Network, Storage
All Services
Immediate Access to latest Technology
The Life of an Average HPC Code on a Supercomputer:
The average number of cores: 14
The average wall-clock time: 1.69 hrs
The average queue wait time: 4.4 days!
Cloud Improves “Workload Throughput”
Think: “Workload Throughput”
https://xdmod.ccr.buffalo.edu/#main_tab_panel:tg_summary
§ The job queue becomes the capacity buffer
§ Job completion times are hard to predict
§ Users are frustrated and run fewer jobs
§ Innovation is throttled by fixed IT resources
Run many Jobs in Parallel, Use it when you need it
Pay only for what you use
Right-size clusters and resources
Optimize each workload for performance
Time-to-results Efficiency
2
2 2
4
2
1
1
3
7
7
4
9
5
7
6 6
7
7
4
8
4
Cores
8
2
1
9
5
4
5
3
1
2
3
6
1
9
4
8
1
2
8
7
7
6
Fixed data center
capacity limit
Cores
Finite capacity, usually with
long queues to wait in
Massive capacity when needed to speed up time
to results, and agile environment when additional
hardware and software experimentation is needed
“For every $1 spent on
HPC, businesses see
$463 in incremental
revenues and $44 in
incremental profit.”
What this can look like…
25,000
50,000
75,000
# of Cores
0
Time: +00h
Scale using Elastic Capacity
<1,000 cores
What this can look like…
25,000
50,000
75,000
# of Cores
0
Time: +24h
Scale using Elastic Capacity
>75,000 memory optimized cores
What this can look like…
25,000
50,000
75,000
# of Cores
0
Time: +72h
Scale using Elastic Capacity
<1,000 cores
What this can look like…
25,000
50,000
75,000
# of Cores
0
Time: +120h
Scale using Elastic Capacity
>30,000 GPU optimized cores
3
AWS Building Blocks
AWS HPC Building Blocks Outline
Compute
Storage/Data Management
Networking
Deployment Tools
“What did you say???” – The AWS Lingo
Term Meaning
AWS Amazon Web Services
EC2 Elastic Cloud Compute; an AWS service providing virtual machines
AMI Amazon Machine Image, a virtual image for a virtual machine
Instance One launched virtual server
EBS Elastic Block Storage, data storage attached to an EC2 Instance
VPC Virtual Private Cloud, your private piece of the cloud
S3 Simple Storage Service, amazing object storage service
Security Group The instance firewall
Compute
CHOICE OF AWS INSTANCES FOR HPC
M4,5
General
purpose
Compute
Optimized,
Core Count
Storage and IO
optimized
GPU, FPGA
accelerated
Memory
optimized
X1 F1
P3dn
I3 D2
R4, 5
C5(n)
C4
P2
z1d
Instance
Generation
c4.large
Instance
Family
Instance
Size
Vertical Scaling
Amazon EC2 Instances
c4.8xlarge
c4.4xlarge
≈
c4.2xlarge
≈
c4.xlarge
≈
Selecting an instance type for an HPC
Instance Type vCPU
Memory
(GiB)
Storage
(GB)
Networking
Performance Physical Processor
Vector
Engine
Clock
Speed
(GHz) Hypervisor
c4.8xlarge 36 60 EBS Only 10 Gigabit Intel Xeon V3 AVX2 2.9 Xen based
c5(n).18xlarge 72 144 EBS Only 25/100 Gigabit Intel Xeon Platinum AVX512 3.5 Nitro
m5(d).24xlarge 48 384 EBS/SSD 25 Gigabit Intel Xeon Platinum AVX512 2.5 Nitro
r4.16xlarge 32 488 EBS Only 25 Gigabit Intel Xeon V4 AVX2 2.3 Xen based
r5(d).24xlarge 96 768 EBS/SSD 25 Gigabit Intel Xeon Platinum AVX512 3.1 Nitro
x1.32xlarge 128 1,952 SSD 25 Gigabit Intel Xeon V3 AVX2 2.3 Nitro
z1d.12xlarge 48 384 SSD 25 Gigabit Intel Xeon Platinum AVX512 4.0 Nitro
High network bandwidth compute instances: C5n, P3dn, i3en
C5n
§ First “network optimized” instances on AWS
§ Will deliver up to 100Gbps network throughput
§ Instances based on C5/P3/i3 instances:
§ Intel Skylake/Broadwell CPUs
§ Nitro System (hypervisor and ENA)
§ Intended for network-intensive applications including HPC
High bandwidth compute instances: C5n
HPC stack on AWS
3D graphics virtual workstation
License managers and cluster head
nodes with job schedulers
Cloud-based, auto-scaling HPC clusters
Shared file storage Storage cache
Massively scalable performance
• C5n Instances will offer up to 100 Gbps of network bandwidth
• Significant improvements in maximum bandwidth, packet per
seconds, and packets processing
• Custom designed Nitro network cards
• Purpose-built to run network bound workloads including
distributed cluster and database workloads, HPC, real-time
communications and video streaming
Featuring
High bandwidth compute instances: P3dn
HPC stack on AWS
3D graphics virtual workstation
License managers and cluster head
nodes with job schedulers
Cloud-based, auto-scaling HPC clusters
Shared file storage Storage cache
Optimized for distributed ML training
• One of the most powerful GPU instance available in the cloud
• Distributed machine learning training across multiple
GPU instances
• 100 Gbps of networking throughput
• Based on NVIDIA’s latest GPU Tesla V100 with 32GB
of memory each
• The largest Amazon Elastic Compute Cloud (Amazon EC2)
P3 instance size available
High clock speed compute instances: Z1d
HPC stack on AWS
3D graphics virtual workstation
License managers and cluster head
nodes with job schedulers
Cloud-based, auto-scaling HPC clusters
Shared file storage Storage cache
Up to 4 GHz sustained, all-turbo performance
• Z1d instances are optimized for memory-intensive, compute-
intensive applications
• Custom Intel Xeon Scalable processor
• Up to 4 GHz sustained, all-turbo performance
• Up to 385GiB DDR4 memory
• Enhanced networking, up to 25 GB throughput
Featuring
Network
“It’s as fast as it’s SLOWEST Component”
AWS is Committed to Networking at Scale
• The AWS Network is Custom Built
– Full bi-section bandwidth in placement groups
– Designed such that all ports can run flat out
– No blocking, no oversubscription
– Continuously improving
– Commodity parts on a Moore’s Law Pace
• Enhanced Networking
– Reduced instance-to-instance latency
– Reduced jitter
• Amazon Elastic Network Adapter
– New PCI network device developed for EC2
– Available on newer instances, including C5, M5, R4, C5, R5, Z1d
– Ability to scale across a variety of bandwidths
• 10 and 20 Gbps today
“We love where we are right now,” AND ”It will only get better!”
ELASTIC NETWORK ADAPTER
§ Latest generation of Enhanced Networking
§ Hardware Checksums
§ Multi-Queue Support
§ Receive Side Steering
§ 25Gbps in a Placement Group
§ Open Source Amazon Network Driver
Remember this?
AZ
AZ
AZ AZ AZ
Tran
sit
Transit
Our Network Needs to Scale!
22 Regions – 69 Availability Zones – 87 Edge Locations
Elastic Fabric Adapter (EFA)
C5n P3dn
EFA
Elastic Fabric Adapter,
best for large HPC workloads
Scale tightly-coupled
HPC applications on AWS
i3en
AWS Elastic Fabric Adapter – EFA
§ Proprietary, AWS-designed fabric network
§ Built on top of network optimized instances on AWS
§ Delivering up to 100Gbps network throughput
§ Delivering below 15µs latencies for HPC applications
§ Optimized for OpenMPI and other MPI libraries
§ Supported on C5n, R5n, M5n, and P3n instances
Amazon Confidential – provided under NDA
HPC software stack in Amazon EC2
Userspace
Kernel
Without EFA With EFA
What can EFA do?
Thanks to Metacomp Technologies and the Klingon Empire.
OpenFoam benchmark MotorBike 140M
Storage
Comprehensive portfolio of storage options for HPC
Block storage File storage Object storage
Amazon EBS Amazon EFS Amazon S3
Elastic, high performance
block storage
at any scale
Petabyte-scale, elastic file storage
sharable across applications,
instances and servers
Low cost, highly scalable
cloud storage with
99.999999999% durability
Amazon FSx for Lustre:
Fully managed high performance parallel shared file system
HPC stack on AWS
3D graphics virtual workstation
License managers and cluster head
nodes with job schedulers
Cloud-based, auto-scaling HPC clusters
Shared file storage Storage cache
High
performing
Parallel
distributed
file system
Tune complex
performance
parameters
Massively scalable performance
100+ GiB/s throughput
Millions of IOPS
Consistent low latencies
High and scalable performance
Each terabyte (TB) of storage provides 200 MB/second of file system throughput and ~5,000 IOPS
High and
scalable
performance
Parallel File System
100+ GiB/s throughput
Millions of IOPS
Consistent sub-millisecond latencies
Supports concurrent
access from hundreds of
thousands of cores
SSD-based
File system throughput & IOPS scale linearly with storage capacity
Each TB of storage provides 200 MB/s of baseline throughput,
and up to 12x burst throughput
File systems can scale to hundreds of GB/s and millions of IOPS
Capacity Baseline throughput Burst throughput
1TB 200 MB/s up to 2.4 GB/s
10TB 2 GB/s up to 24 GB/s
50TB 10 GB/s up to 120 GB/s
100TB 20 GB/s up to 240 GB/s
1PB 200 GB/s at least 240 GB/s
Deployment Tools - Orchestration
Easy cluster management: AWS ParallelCluster
Simplifies deployment of HPC in the
cloud, including integrating with popular
HPC schedulers
Integrated with AWS Batch, Amazon FSx
for Lustre and
Elastic Fabric Adapter
AWS Parallel Cluster
§ Simplifies deployment of HPC Clusters in the cloud
§ Integrates with popular HPC schedulers including such as:
§ SLURM, Grid Engine, Torque
§ Built on AWS CloudFormation
§ Easy to modify to meet specific application or project requirements
• Latest Features:
– Multiple EBS volumes
– Custom AMI support
• Bring your own custom AMI, not just build from our default AMI
– Easy use of EFS and Lustre use
• Launch Templates support (for EFA)
– Support for C5n and other new instance types
– AWS Batch Integration
• Open Source available on GitHub
AWS Batch
• Dynamically provisions resources
• Plans, schedules, and executes
• No additional components to install
Event
Changes in
data state
Requests
to endpoints
Services (anything)
Scheduled
triggers
Compute
Execution
Your code
Auto Scaling
Job queue
Efficient job scheduling: Multi-node parallel job
support on AWS Batch
HPC stack on AWS
3D graphics virtual workstation
License managers and cluster head
nodes with job schedulers
Cloud-based, auto-scaling HPC clusters
Shared file storage Storage cache
Simplify your compute clusters and scale jobs
across multiple instances with AWS Batch support
for Multi-node Parallel (MNP) jobs
Container 2
Container 4Container 3
Instance 1
Container 1
My job
Instance 2
My job
My job My job
Instance 3 Instance 4
Orchestration tools include support for capacity and
cost optimization
Use Reserved Instances for
known/steady-state workloads
Scale using Spot, On-Demand, or both Evaluate the trade-off of time
to solution vs. cost for scaling
4
Customer References
Running HPC applications
at extreme scale
“Storage technology is amazingly complex and we’re constantly pushing the
limits of physics and engineering to deliver next-generation capacities and
technical innovation. This successful collaboration with AWS shows the extreme
scale, power and agility of cloud-based HPC to help us run complex simulations
for future storage architecture analysis and materials science explorations.
Using AWS to easily shrink simulation time from 20 days to 8 hours allows
Western Digital R&D teams to explore new designs and innovations at a pace
un-imaginable just a short time ago.” —Steve Phillpott, CIO, Western Digital
single HPC cluster of 1 million vCPUs
Accelerating time to innovation
20 days à 8 hours
Descartes Labs makes the Top500 List running on AWS
https://medium.com/descarteslabs-team/thunder-from-the-cloud-40-000-cores-
running-in-concert-on-aws-bf1610679978
3 million core-hours of Amazon EC2 Spot capacity
https://www.nature.com/articles/s41588-018-0153-5
Complete sequencing of
3.24 billion base pairs
Manage 50X the number of securities
4,000 times faster
In hours, instead of months
Run risk models
Helping financial institutions
model investment risks
600 times faster
Engineering simulations
Helping to make supersonic
flights mainstream
Flexible configuration and virtually unlimited scalability
to grow and shrink your infrastructure as your HPC
workloads dictate, not the other way around
HPC on AWS
Thank You!
Any Questions?For More Information:
aws.amazon.com/hpc/
aws.amazon.com/getting-started

Mais conteúdo relacionado

Mais procurados

Running Production Workloads in VMware Cloud on AWS (ENT313-S) - AWS re:Inven...
Running Production Workloads in VMware Cloud on AWS (ENT313-S) - AWS re:Inven...Running Production Workloads in VMware Cloud on AWS (ENT313-S) - AWS re:Inven...
Running Production Workloads in VMware Cloud on AWS (ENT313-S) - AWS re:Inven...Amazon Web Services
 
Integrating with VMware Cloud on AWS
Integrating with VMware Cloud on AWSIntegrating with VMware Cloud on AWS
Integrating with VMware Cloud on AWSAmazon Web Services
 
VMware Cloud on AWS -- A Technical Deep Dive PPT
VMware Cloud on AWS -- A Technical Deep Dive PPTVMware Cloud on AWS -- A Technical Deep Dive PPT
VMware Cloud on AWS -- A Technical Deep Dive PPTAmazon Web Services
 
Expanding Your Data Center with Hybrid Cloud Infrastructure
Expanding Your Data Center with Hybrid Cloud InfrastructureExpanding Your Data Center with Hybrid Cloud Infrastructure
Expanding Your Data Center with Hybrid Cloud InfrastructureAmazon Web Services
 
如何成功的完成混合雲遷移專案
如何成功的完成混合雲遷移專案如何成功的完成混合雲遷移專案
如何成功的完成混合雲遷移專案Amazon Web Services
 
VMware Cloud on AWS for Newbies
VMware Cloud on AWS for NewbiesVMware Cloud on AWS for Newbies
VMware Cloud on AWS for NewbiesFaction
 
AWS Fundamentals for DoD, Immersion Day Huntsville 2019
AWS Fundamentals for DoD, Immersion Day Huntsville 2019AWS Fundamentals for DoD, Immersion Day Huntsville 2019
AWS Fundamentals for DoD, Immersion Day Huntsville 2019Amazon Web Services
 
Going Further with VMware Cloud on AWS: New Integration Options with Native A...
Going Further with VMware Cloud on AWS: New Integration Options with Native A...Going Further with VMware Cloud on AWS: New Integration Options with Native A...
Going Further with VMware Cloud on AWS: New Integration Options with Native A...Amazon Web Services
 
An Intro to Building and Optimizing a Hybrid Cloud on AWS
An Intro to Building and Optimizing a Hybrid Cloud on AWSAn Intro to Building and Optimizing a Hybrid Cloud on AWS
An Intro to Building and Optimizing a Hybrid Cloud on AWSAmazon Web Services
 
Building Hybrid Cloud IT Infrastructures and Operations Using VMC on AWS
Building Hybrid Cloud IT Infrastructures and Operations Using VMC on AWSBuilding Hybrid Cloud IT Infrastructures and Operations Using VMC on AWS
Building Hybrid Cloud IT Infrastructures and Operations Using VMC on AWSAmazon Web Services
 
Moving your commercial databases to Amazon RDS
Moving your commercial databases to Amazon RDSMoving your commercial databases to Amazon RDS
Moving your commercial databases to Amazon RDSAmazon Web Services
 
VMware Cloud on AWS - Technical Deep Dive - AWS Summit Sydney
VMware Cloud on AWS - Technical Deep Dive - AWS Summit SydneyVMware Cloud on AWS - Technical Deep Dive - AWS Summit Sydney
VMware Cloud on AWS - Technical Deep Dive - AWS Summit SydneyAmazon Web Services
 
Introduction to VMware Cloud on AWS
Introduction to VMware Cloud on AWSIntroduction to VMware Cloud on AWS
Introduction to VMware Cloud on AWSAmazon Web Services
 
VMware Cloud on AWS: Technical Deep Dive - SRV341 - Chicago AWS Summit
VMware Cloud on AWS: Technical Deep Dive - SRV341 - Chicago AWS SummitVMware Cloud on AWS: Technical Deep Dive - SRV341 - Chicago AWS Summit
VMware Cloud on AWS: Technical Deep Dive - SRV341 - Chicago AWS SummitAmazon Web Services
 
Azure vmware solutions para partners
Azure vmware solutions para partnersAzure vmware solutions para partners
Azure vmware solutions para partnersskadobayashi
 
VMware Cloud on AWS - AWS Learning Series
VMware Cloud on AWS - AWS Learning SeriesVMware Cloud on AWS - AWS Learning Series
VMware Cloud on AWS - AWS Learning SeriesAmazon Web Services
 

Mais procurados (20)

AWS AutoScalling- Tech Talks Maio 2019
AWS AutoScalling- Tech Talks Maio 2019AWS AutoScalling- Tech Talks Maio 2019
AWS AutoScalling- Tech Talks Maio 2019
 
VMware Cloud on AWS
VMware Cloud on AWSVMware Cloud on AWS
VMware Cloud on AWS
 
Running Production Workloads in VMware Cloud on AWS (ENT313-S) - AWS re:Inven...
Running Production Workloads in VMware Cloud on AWS (ENT313-S) - AWS re:Inven...Running Production Workloads in VMware Cloud on AWS (ENT313-S) - AWS re:Inven...
Running Production Workloads in VMware Cloud on AWS (ENT313-S) - AWS re:Inven...
 
Integrating with VMware Cloud on AWS
Integrating with VMware Cloud on AWSIntegrating with VMware Cloud on AWS
Integrating with VMware Cloud on AWS
 
VMware Cloud on AWS -- A Technical Deep Dive PPT
VMware Cloud on AWS -- A Technical Deep Dive PPTVMware Cloud on AWS -- A Technical Deep Dive PPT
VMware Cloud on AWS -- A Technical Deep Dive PPT
 
Expanding Your Data Center with Hybrid Cloud Infrastructure
Expanding Your Data Center with Hybrid Cloud InfrastructureExpanding Your Data Center with Hybrid Cloud Infrastructure
Expanding Your Data Center with Hybrid Cloud Infrastructure
 
如何成功的完成混合雲遷移專案
如何成功的完成混合雲遷移專案如何成功的完成混合雲遷移專案
如何成功的完成混合雲遷移專案
 
VMware Cloud on AWS for Newbies
VMware Cloud on AWS for NewbiesVMware Cloud on AWS for Newbies
VMware Cloud on AWS for Newbies
 
AWS Fundamentals for DoD, Immersion Day Huntsville 2019
AWS Fundamentals for DoD, Immersion Day Huntsville 2019AWS Fundamentals for DoD, Immersion Day Huntsville 2019
AWS Fundamentals for DoD, Immersion Day Huntsville 2019
 
Going Further with VMware Cloud on AWS: New Integration Options with Native A...
Going Further with VMware Cloud on AWS: New Integration Options with Native A...Going Further with VMware Cloud on AWS: New Integration Options with Native A...
Going Further with VMware Cloud on AWS: New Integration Options with Native A...
 
An Intro to Building and Optimizing a Hybrid Cloud on AWS
An Intro to Building and Optimizing a Hybrid Cloud on AWSAn Intro to Building and Optimizing a Hybrid Cloud on AWS
An Intro to Building and Optimizing a Hybrid Cloud on AWS
 
VMware Cloud on AWS
VMware Cloud on AWSVMware Cloud on AWS
VMware Cloud on AWS
 
Building Hybrid Cloud IT Infrastructures and Operations Using VMC on AWS
Building Hybrid Cloud IT Infrastructures and Operations Using VMC on AWSBuilding Hybrid Cloud IT Infrastructures and Operations Using VMC on AWS
Building Hybrid Cloud IT Infrastructures and Operations Using VMC on AWS
 
Moving your commercial databases to Amazon RDS
Moving your commercial databases to Amazon RDSMoving your commercial databases to Amazon RDS
Moving your commercial databases to Amazon RDS
 
VMware Cloud on AWS - Technical Deep Dive - AWS Summit Sydney
VMware Cloud on AWS - Technical Deep Dive - AWS Summit SydneyVMware Cloud on AWS - Technical Deep Dive - AWS Summit Sydney
VMware Cloud on AWS - Technical Deep Dive - AWS Summit Sydney
 
Introduction to VMware Cloud on AWS
Introduction to VMware Cloud on AWSIntroduction to VMware Cloud on AWS
Introduction to VMware Cloud on AWS
 
VMware Cloud on AWS: Technical Deep Dive - SRV341 - Chicago AWS Summit
VMware Cloud on AWS: Technical Deep Dive - SRV341 - Chicago AWS SummitVMware Cloud on AWS: Technical Deep Dive - SRV341 - Chicago AWS Summit
VMware Cloud on AWS: Technical Deep Dive - SRV341 - Chicago AWS Summit
 
Azure vmware solutions para partners
Azure vmware solutions para partnersAzure vmware solutions para partners
Azure vmware solutions para partners
 
VMWare Cloud on AWS | Floor 28
VMWare Cloud on AWS | Floor 28VMWare Cloud on AWS | Floor 28
VMWare Cloud on AWS | Floor 28
 
VMware Cloud on AWS - AWS Learning Series
VMware Cloud on AWS - AWS Learning SeriesVMware Cloud on AWS - AWS Learning Series
VMware Cloud on AWS - AWS Learning Series
 

Semelhante a High Performance Computing in AWS, Immersion Day Huntsville 2019

Amazon EC2 Instances, Featuring Performance Optimisation Best Practices
Amazon EC2 Instances, Featuring Performance Optimisation Best PracticesAmazon EC2 Instances, Featuring Performance Optimisation Best Practices
Amazon EC2 Instances, Featuring Performance Optimisation Best PracticesAmazon Web Services
 
AWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWSAWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWSAmazon Web Services
 
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)Amazon Web Services
 
Re invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampionRe invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampionMia D Champion
 
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWSArquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWSAmazon Web Services LATAM
 
AWSome Day Online 2020_Module 2: Getting started with the cloud
AWSome Day Online 2020_Module 2: Getting started with the cloudAWSome Day Online 2020_Module 2: Getting started with the cloud
AWSome Day Online 2020_Module 2: Getting started with the cloudAmazon Web Services
 
Foundations of Amazon EC2 - SRV319
Foundations of Amazon EC2 - SRV319 Foundations of Amazon EC2 - SRV319
Foundations of Amazon EC2 - SRV319 Amazon Web Services
 
High Performance Computing on AWS: Accelerating Innovation with virtually unl...
High Performance Computing on AWS: Accelerating Innovation with virtually unl...High Performance Computing on AWS: Accelerating Innovation with virtually unl...
High Performance Computing on AWS: Accelerating Innovation with virtually unl...Amazon Web Services
 
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWSAWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWSAmazon Web Services
 
洞悉未來運算:量子與5G、混合雲架構與EC2新應用
洞悉未來運算:量子與5G、混合雲架構與EC2新應用洞悉未來運算:量子與5G、混合雲架構與EC2新應用
洞悉未來運算:量子與5G、混合雲架構與EC2新應用Amazon Web Services
 
Module 2: Getting started with the cloud - AWSome Day Online Conference 2019
 Module 2: Getting started with the cloud - AWSome Day Online Conference 2019 Module 2: Getting started with the cloud - AWSome Day Online Conference 2019
Module 2: Getting started with the cloud - AWSome Day Online Conference 2019Amazon Web Services
 
AWS Webcast - Explore the AWS Cloud
AWS Webcast - Explore the AWS CloudAWS Webcast - Explore the AWS Cloud
AWS Webcast - Explore the AWS CloudAmazon Web Services
 
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul MaddoxAWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul MaddoxAWS Riyadh User Group
 
High Performance Computing with AWS
High Performance Computing with AWSHigh Performance Computing with AWS
High Performance Computing with AWSAmazon Web Services
 
Best Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in EnterprisesBest Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in Enterprisesgeetachauhan
 
Track 3 Session 5_ 使用 Amazon EC2 打造企業計算平台與成本和容量優化
Track 3 Session 5_ 使用 Amazon EC2 打造企業計算平台與成本和容量優化Track 3 Session 5_ 使用 Amazon EC2 打造企業計算平台與成本和容量優化
Track 3 Session 5_ 使用 Amazon EC2 打造企業計算平台與成本和容量優化Amazon Web Services
 
Randall's re:Invent Recap
Randall's re:Invent RecapRandall's re:Invent Recap
Randall's re:Invent RecapRandall Hunt
 
Amazon EC2 Foundations - CMP203 - re:Invent 2017
Amazon EC2 Foundations - CMP203 - re:Invent 2017Amazon EC2 Foundations - CMP203 - re:Invent 2017
Amazon EC2 Foundations - CMP203 - re:Invent 2017Amazon Web Services
 
What would You do with a Million cores? HPC on AWS
What would You do with a Million cores? HPC on AWSWhat would You do with a Million cores? HPC on AWS
What would You do with a Million cores? HPC on AWSAmazon Web Services
 

Semelhante a High Performance Computing in AWS, Immersion Day Huntsville 2019 (20)

Amazon EC2 Instances, Featuring Performance Optimisation Best Practices
Amazon EC2 Instances, Featuring Performance Optimisation Best PracticesAmazon EC2 Instances, Featuring Performance Optimisation Best Practices
Amazon EC2 Instances, Featuring Performance Optimisation Best Practices
 
AWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWSAWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWS
 
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
 
Re invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampionRe invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampion
 
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWSArquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
 
AWSome Day Online 2020_Module 2: Getting started with the cloud
AWSome Day Online 2020_Module 2: Getting started with the cloudAWSome Day Online 2020_Module 2: Getting started with the cloud
AWSome Day Online 2020_Module 2: Getting started with the cloud
 
Foundations of Amazon EC2 - SRV319
Foundations of Amazon EC2 - SRV319 Foundations of Amazon EC2 - SRV319
Foundations of Amazon EC2 - SRV319
 
High Performance Computing on AWS: Accelerating Innovation with virtually unl...
High Performance Computing on AWS: Accelerating Innovation with virtually unl...High Performance Computing on AWS: Accelerating Innovation with virtually unl...
High Performance Computing on AWS: Accelerating Innovation with virtually unl...
 
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWSAWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
 
洞悉未來運算:量子與5G、混合雲架構與EC2新應用
洞悉未來運算:量子與5G、混合雲架構與EC2新應用洞悉未來運算:量子與5G、混合雲架構與EC2新應用
洞悉未來運算:量子與5G、混合雲架構與EC2新應用
 
Module 2: Getting started with the cloud - AWSome Day Online Conference 2019
 Module 2: Getting started with the cloud - AWSome Day Online Conference 2019 Module 2: Getting started with the cloud - AWSome Day Online Conference 2019
Module 2: Getting started with the cloud - AWSome Day Online Conference 2019
 
AWS Webcast - Explore the AWS Cloud
AWS Webcast - Explore the AWS CloudAWS Webcast - Explore the AWS Cloud
AWS Webcast - Explore the AWS Cloud
 
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul MaddoxAWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
 
High Performance Computing with AWS
High Performance Computing with AWSHigh Performance Computing with AWS
High Performance Computing with AWS
 
Best Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in EnterprisesBest Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in Enterprises
 
Introduction on Amazon EC2
Introduction on Amazon EC2Introduction on Amazon EC2
Introduction on Amazon EC2
 
Track 3 Session 5_ 使用 Amazon EC2 打造企業計算平台與成本和容量優化
Track 3 Session 5_ 使用 Amazon EC2 打造企業計算平台與成本和容量優化Track 3 Session 5_ 使用 Amazon EC2 打造企業計算平台與成本和容量優化
Track 3 Session 5_ 使用 Amazon EC2 打造企業計算平台與成本和容量優化
 
Randall's re:Invent Recap
Randall's re:Invent RecapRandall's re:Invent Recap
Randall's re:Invent Recap
 
Amazon EC2 Foundations - CMP203 - re:Invent 2017
Amazon EC2 Foundations - CMP203 - re:Invent 2017Amazon EC2 Foundations - CMP203 - re:Invent 2017
Amazon EC2 Foundations - CMP203 - re:Invent 2017
 
What would You do with a Million cores? HPC on AWS
What would You do with a Million cores? HPC on AWSWhat would You do with a Million cores? HPC on AWS
What would You do with a Million cores? HPC on AWS
 

Mais de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

High Performance Computing in AWS, Immersion Day Huntsville 2019

  • 1. Brad Dispensa Pr. Security and Compliance SA WW Public Sector HPC on AWS Wednesday, October 9, 2019 © 2019 Amazon Web Services, Inc. and its affiliates. All rights served. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon Web Services, Inc.
  • 2. Talk Outline Brief Overview: Cloud Computing on AWS Why use AWS for HPC? HPC Building Block in AWS ü Compute ü Networking ü Storage ü Deployment Tools Customer Success Stories and Example Use Cases
  • 3. 1 What is Cloud Computing on AWS?
  • 4. Global Infrastructure Coming soon 69 Availability Zones within 22 geographic Regions around the world We add the equivalent of an entire Fortune 500 company’s compute capacity every day
  • 5. AWS Availability Zones, Data Centers, Servers AZ AZ AZ AZ AZ Tran sit Transit over 50,000 servers & often over 80,000
  • 7. Amazing Amount of Compute Capacity – Economy of Scale A virtually unlimited number of architecture options Instance Types, OS, Traditional Cluster, Auto Scaling Clusters, Serverless, GPUs Extensive deployment options – “Infrastructure as Code” Console, Configuration Control, Automated, SDK, Bash/CLI, AWS CloudFormation Lots of useful services Amazon DynamoDB, Amazon CloudWatch, Amazon Glacier, and much more! instance with Amazon CloudWatch Auto Scaling template Amazon DynamoDB Why use AWS for HPC? AWS Lambda
  • 8. Great Features for HPC Workloads Experimentation without Fear! Activate Multiple Compute Clusters Simultaneously! A Supercomputer at the Fingertip of EACH Scientist! Start and stop instances or entire clusters! Take Advantage of Spot Pricing Receive Continuous Updates Compute, Network, Storage All Services Immediate Access to latest Technology
  • 9. The Life of an Average HPC Code on a Supercomputer: The average number of cores: 14 The average wall-clock time: 1.69 hrs The average queue wait time: 4.4 days! Cloud Improves “Workload Throughput” Think: “Workload Throughput” https://xdmod.ccr.buffalo.edu/#main_tab_panel:tg_summary § The job queue becomes the capacity buffer § Job completion times are hard to predict § Users are frustrated and run fewer jobs § Innovation is throttled by fixed IT resources Run many Jobs in Parallel, Use it when you need it Pay only for what you use Right-size clusters and resources Optimize each workload for performance
  • 10. Time-to-results Efficiency 2 2 2 4 2 1 1 3 7 7 4 9 5 7 6 6 7 7 4 8 4 Cores 8 2 1 9 5 4 5 3 1 2 3 6 1 9 4 8 1 2 8 7 7 6 Fixed data center capacity limit Cores Finite capacity, usually with long queues to wait in Massive capacity when needed to speed up time to results, and agile environment when additional hardware and software experimentation is needed “For every $1 spent on HPC, businesses see $463 in incremental revenues and $44 in incremental profit.”
  • 11. What this can look like… 25,000 50,000 75,000 # of Cores 0 Time: +00h Scale using Elastic Capacity <1,000 cores
  • 12. What this can look like… 25,000 50,000 75,000 # of Cores 0 Time: +24h Scale using Elastic Capacity >75,000 memory optimized cores
  • 13. What this can look like… 25,000 50,000 75,000 # of Cores 0 Time: +72h Scale using Elastic Capacity <1,000 cores
  • 14. What this can look like… 25,000 50,000 75,000 # of Cores 0 Time: +120h Scale using Elastic Capacity >30,000 GPU optimized cores
  • 16. AWS HPC Building Blocks Outline Compute Storage/Data Management Networking Deployment Tools
  • 17. “What did you say???” – The AWS Lingo Term Meaning AWS Amazon Web Services EC2 Elastic Cloud Compute; an AWS service providing virtual machines AMI Amazon Machine Image, a virtual image for a virtual machine Instance One launched virtual server EBS Elastic Block Storage, data storage attached to an EC2 Instance VPC Virtual Private Cloud, your private piece of the cloud S3 Simple Storage Service, amazing object storage service Security Group The instance firewall
  • 19. CHOICE OF AWS INSTANCES FOR HPC M4,5 General purpose Compute Optimized, Core Count Storage and IO optimized GPU, FPGA accelerated Memory optimized X1 F1 P3dn I3 D2 R4, 5 C5(n) C4 P2 z1d
  • 20. Instance Generation c4.large Instance Family Instance Size Vertical Scaling Amazon EC2 Instances c4.8xlarge c4.4xlarge ≈ c4.2xlarge ≈ c4.xlarge ≈
  • 21. Selecting an instance type for an HPC Instance Type vCPU Memory (GiB) Storage (GB) Networking Performance Physical Processor Vector Engine Clock Speed (GHz) Hypervisor c4.8xlarge 36 60 EBS Only 10 Gigabit Intel Xeon V3 AVX2 2.9 Xen based c5(n).18xlarge 72 144 EBS Only 25/100 Gigabit Intel Xeon Platinum AVX512 3.5 Nitro m5(d).24xlarge 48 384 EBS/SSD 25 Gigabit Intel Xeon Platinum AVX512 2.5 Nitro r4.16xlarge 32 488 EBS Only 25 Gigabit Intel Xeon V4 AVX2 2.3 Xen based r5(d).24xlarge 96 768 EBS/SSD 25 Gigabit Intel Xeon Platinum AVX512 3.1 Nitro x1.32xlarge 128 1,952 SSD 25 Gigabit Intel Xeon V3 AVX2 2.3 Nitro z1d.12xlarge 48 384 SSD 25 Gigabit Intel Xeon Platinum AVX512 4.0 Nitro
  • 22. High network bandwidth compute instances: C5n, P3dn, i3en C5n § First “network optimized” instances on AWS § Will deliver up to 100Gbps network throughput § Instances based on C5/P3/i3 instances: § Intel Skylake/Broadwell CPUs § Nitro System (hypervisor and ENA) § Intended for network-intensive applications including HPC
  • 23. High bandwidth compute instances: C5n HPC stack on AWS 3D graphics virtual workstation License managers and cluster head nodes with job schedulers Cloud-based, auto-scaling HPC clusters Shared file storage Storage cache Massively scalable performance • C5n Instances will offer up to 100 Gbps of network bandwidth • Significant improvements in maximum bandwidth, packet per seconds, and packets processing • Custom designed Nitro network cards • Purpose-built to run network bound workloads including distributed cluster and database workloads, HPC, real-time communications and video streaming Featuring
  • 24. High bandwidth compute instances: P3dn HPC stack on AWS 3D graphics virtual workstation License managers and cluster head nodes with job schedulers Cloud-based, auto-scaling HPC clusters Shared file storage Storage cache Optimized for distributed ML training • One of the most powerful GPU instance available in the cloud • Distributed machine learning training across multiple GPU instances • 100 Gbps of networking throughput • Based on NVIDIA’s latest GPU Tesla V100 with 32GB of memory each • The largest Amazon Elastic Compute Cloud (Amazon EC2) P3 instance size available
  • 25. High clock speed compute instances: Z1d HPC stack on AWS 3D graphics virtual workstation License managers and cluster head nodes with job schedulers Cloud-based, auto-scaling HPC clusters Shared file storage Storage cache Up to 4 GHz sustained, all-turbo performance • Z1d instances are optimized for memory-intensive, compute- intensive applications • Custom Intel Xeon Scalable processor • Up to 4 GHz sustained, all-turbo performance • Up to 385GiB DDR4 memory • Enhanced networking, up to 25 GB throughput Featuring
  • 26. Network “It’s as fast as it’s SLOWEST Component”
  • 27. AWS is Committed to Networking at Scale • The AWS Network is Custom Built – Full bi-section bandwidth in placement groups – Designed such that all ports can run flat out – No blocking, no oversubscription – Continuously improving – Commodity parts on a Moore’s Law Pace • Enhanced Networking – Reduced instance-to-instance latency – Reduced jitter • Amazon Elastic Network Adapter – New PCI network device developed for EC2 – Available on newer instances, including C5, M5, R4, C5, R5, Z1d – Ability to scale across a variety of bandwidths • 10 and 20 Gbps today “We love where we are right now,” AND ”It will only get better!”
  • 28. ELASTIC NETWORK ADAPTER § Latest generation of Enhanced Networking § Hardware Checksums § Multi-Queue Support § Receive Side Steering § 25Gbps in a Placement Group § Open Source Amazon Network Driver
  • 29. Remember this? AZ AZ AZ AZ AZ Tran sit Transit Our Network Needs to Scale! 22 Regions – 69 Availability Zones – 87 Edge Locations
  • 30. Elastic Fabric Adapter (EFA) C5n P3dn EFA Elastic Fabric Adapter, best for large HPC workloads Scale tightly-coupled HPC applications on AWS i3en
  • 31. AWS Elastic Fabric Adapter – EFA § Proprietary, AWS-designed fabric network § Built on top of network optimized instances on AWS § Delivering up to 100Gbps network throughput § Delivering below 15µs latencies for HPC applications § Optimized for OpenMPI and other MPI libraries § Supported on C5n, R5n, M5n, and P3n instances Amazon Confidential – provided under NDA
  • 32. HPC software stack in Amazon EC2 Userspace Kernel Without EFA With EFA
  • 33. What can EFA do? Thanks to Metacomp Technologies and the Klingon Empire.
  • 36. Comprehensive portfolio of storage options for HPC Block storage File storage Object storage Amazon EBS Amazon EFS Amazon S3 Elastic, high performance block storage at any scale Petabyte-scale, elastic file storage sharable across applications, instances and servers Low cost, highly scalable cloud storage with 99.999999999% durability
  • 37. Amazon FSx for Lustre: Fully managed high performance parallel shared file system HPC stack on AWS 3D graphics virtual workstation License managers and cluster head nodes with job schedulers Cloud-based, auto-scaling HPC clusters Shared file storage Storage cache High performing Parallel distributed file system Tune complex performance parameters Massively scalable performance 100+ GiB/s throughput Millions of IOPS Consistent low latencies
  • 38. High and scalable performance Each terabyte (TB) of storage provides 200 MB/second of file system throughput and ~5,000 IOPS High and scalable performance Parallel File System 100+ GiB/s throughput Millions of IOPS Consistent sub-millisecond latencies Supports concurrent access from hundreds of thousands of cores SSD-based
  • 39. File system throughput & IOPS scale linearly with storage capacity Each TB of storage provides 200 MB/s of baseline throughput, and up to 12x burst throughput File systems can scale to hundreds of GB/s and millions of IOPS Capacity Baseline throughput Burst throughput 1TB 200 MB/s up to 2.4 GB/s 10TB 2 GB/s up to 24 GB/s 50TB 10 GB/s up to 120 GB/s 100TB 20 GB/s up to 240 GB/s 1PB 200 GB/s at least 240 GB/s
  • 40. Deployment Tools - Orchestration
  • 41. Easy cluster management: AWS ParallelCluster Simplifies deployment of HPC in the cloud, including integrating with popular HPC schedulers Integrated with AWS Batch, Amazon FSx for Lustre and Elastic Fabric Adapter
  • 42. AWS Parallel Cluster § Simplifies deployment of HPC Clusters in the cloud § Integrates with popular HPC schedulers including such as: § SLURM, Grid Engine, Torque § Built on AWS CloudFormation § Easy to modify to meet specific application or project requirements • Latest Features: – Multiple EBS volumes – Custom AMI support • Bring your own custom AMI, not just build from our default AMI – Easy use of EFS and Lustre use • Launch Templates support (for EFA) – Support for C5n and other new instance types – AWS Batch Integration • Open Source available on GitHub
  • 43. AWS Batch • Dynamically provisions resources • Plans, schedules, and executes • No additional components to install Event Changes in data state Requests to endpoints Services (anything) Scheduled triggers Compute Execution Your code Auto Scaling Job queue
  • 44. Efficient job scheduling: Multi-node parallel job support on AWS Batch HPC stack on AWS 3D graphics virtual workstation License managers and cluster head nodes with job schedulers Cloud-based, auto-scaling HPC clusters Shared file storage Storage cache Simplify your compute clusters and scale jobs across multiple instances with AWS Batch support for Multi-node Parallel (MNP) jobs Container 2 Container 4Container 3 Instance 1 Container 1 My job Instance 2 My job My job My job Instance 3 Instance 4
  • 45. Orchestration tools include support for capacity and cost optimization Use Reserved Instances for known/steady-state workloads Scale using Spot, On-Demand, or both Evaluate the trade-off of time to solution vs. cost for scaling
  • 47. Running HPC applications at extreme scale “Storage technology is amazingly complex and we’re constantly pushing the limits of physics and engineering to deliver next-generation capacities and technical innovation. This successful collaboration with AWS shows the extreme scale, power and agility of cloud-based HPC to help us run complex simulations for future storage architecture analysis and materials science explorations. Using AWS to easily shrink simulation time from 20 days to 8 hours allows Western Digital R&D teams to explore new designs and innovations at a pace un-imaginable just a short time ago.” —Steve Phillpott, CIO, Western Digital single HPC cluster of 1 million vCPUs Accelerating time to innovation 20 days à 8 hours
  • 48. Descartes Labs makes the Top500 List running on AWS https://medium.com/descarteslabs-team/thunder-from-the-cloud-40-000-cores- running-in-concert-on-aws-bf1610679978
  • 49.
  • 50. 3 million core-hours of Amazon EC2 Spot capacity https://www.nature.com/articles/s41588-018-0153-5 Complete sequencing of 3.24 billion base pairs
  • 51. Manage 50X the number of securities 4,000 times faster In hours, instead of months Run risk models Helping financial institutions model investment risks
  • 52. 600 times faster Engineering simulations Helping to make supersonic flights mainstream
  • 53. Flexible configuration and virtually unlimited scalability to grow and shrink your infrastructure as your HPC workloads dictate, not the other way around HPC on AWS
  • 54. Thank You! Any Questions?For More Information: aws.amazon.com/hpc/ aws.amazon.com/getting-started