CMP301_Deep Dive on Amazon EC2 Instances

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Deep Dive on Amazon EC2
Instances
Featur ing Per for mance Optimization B est
Pr actices
A d a m B o e g l i n , S r H P C S o l u t i o n s A r c h i t e c t
C M P 3 0 1
N o v e m b e r 2 8 , 2 0 1 7

What to expect from the session
• Understanding the factors that going into choosing an Amazon
EC2 instance
• Defining system performance and how it is characterized for
different workloads
• How Amazon EC2 instances deliver performance while providing
flexibility and agility
• How to make the most of your EC2 instance experience through
the lens of several instance types

API
EC2
EC2
Amazon Elastic Compute Cloud (EC2) is big
Instances
Networking
Purchase options

Amazon EC2 instances
Host server
Hypervisor
Guest 1 Guest 2 Guest n

• First launched in August 2006
• M1 instance
• “One size fits all”
In the past
M1

Amazon EC2 instances history
2006
m1.small
2007
m1.large
m1.xlarge
2008
c1.xlarge
c1.medium
2009
m2.2xlarge
m2.4xlarge
2010
m2.xlarge
cc1.4xlarge
t1.micro
cg1.4xlarge
2011
cc2.8xlarge
2012
m1.medium
hi1.4xlarge
hs1.8xlarge
m3.2xlarge
m3.xlarge
2013
cr1.8xlarge
g2.2xlarge
c3.8xlarge
c4.4xlarge
c3.2xlarge
c3.xlarge
c3.large
i2.xlarge
i2.2xlarge
i2.4xlarge
i2.8xlarge
2014
m3.medium
m3.large
r3.large
r3.xlarge
r3.2xlarge
r3.4xlarge
r3.8xlarge
t2.micro
t2.small
t2.medium
2015
g2.8xlarge
c4.large
c4.xlarge
c4.2xlarge
c4.4xlarge
c4.8xlarge
d2.xlarge
d2.2xlarge
d2.4xlarge
d2.8xlarge
t2.large
m4.large
m4.xlarge
m4.2xlarge
m4.4xlarge
m4.10xlarge
t2.nano
x1.32xlarge
2016
m4.16xlarge
x1.16xlarge
p2.xlarge
p2.8xlarge
p2.16xlarge
r4.large
r4.xlarge
r4.2xlarge
r4.4xlarge
r4.8xlarge
r4.16xlarge
2017
x1e.32xlarge
g3.4xlarge
g3.8xlarge
g3.16xlarge
i3.large
i3.xlarge
i3.2xlarge
i3.4xlarge
i3.8xlarge
i3.16xlarge
f1.2xlarge
f1.16xlarge
c5.19xlarge
c5.9xlarge
c5.4xlarge
c5.2xlarge
c5.xlarge
p3.16xlarge
p3.8xlarge
p3.2xlarge
x1e.16xlarge

Instance generation
c5.xlarge
Instance family Instance size

EC2 instance families
General
purpose
Compute
optimized
C4
Storage and I/O
optimized
I3 P3
Accelerated
Memory
optimized
R4C5
M4
D2
X1
G3
F1
C3

What’s a virtual CPU? (vCPU)
• A vCPU is typically a hyper-threaded physical core*
• On Linux, “A” threads enumerated before “B” threads
• On Windows, threads are interleaved
• Divide vCPU count by two to get core count
• Cores by EC2 and Amazon RDS DB instance type:
https://aws.amazon.com/ec2/virtualcores/
* The “T” family is special

Disable hyper-threading if you need to
• Useful for FPU heavy applications
• Use ‘lscpu’ to validate layout
• Hot offline the “B” threads
for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list |
cut -s -d, -f2- | tr ',' 'n' | sort -un); do
echo 0 | sudo tee /sys/devices/system/cpu/cpu${cpunum}/online
Done
• Set grub to only initialize the first half of all threads
maxcpus=20

Instance sizing
c5.18xlarge 2 - c5.9xlarge
≈
4 - c5.4xlarge
≈
8 - c5.2xlarge
≈

Resource allocation
• All resources assigned to you are dedicated to your instance with no
over-commitment*
• All vCPUs are dedicated to you
• Memory allocated is assigned only to your instance
• Network resources are partitioned to avoid “noisy neighbors”
• Curious about the number of instances per host? Use “dedicated hosts”
as a guide.
*Again, the “T” family is special

“Launching new instances and running tests in
parallel is easy…[when choosing an instance]
there is no substitute for measuring the
performance of your full application.”
—EC2 documentation

Timekeeping explained
• Timekeeping in an instance is deceptively hard
• gettimeofday(), clock_gettime(), QueryPerformanceCounter()
• The TSC
• CPU counter, accessible from userspace
• Requires calibration, vDSO
• Invariant on Sandy Bridge+ processors
• Xen pvclock; does not support vDSO
• On current generation instances, use TSC as clocksource

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#define BILLION 1E9
int main(){
float diff_ns;
struct timespec start, end;
int x;
clock_gettime(CLOCK_MONOTONIC, &start);
for ( x = 0; x < 100000000; x++ ) {
struct timeval tv;
gettimeofday(&tv, NULL);
}
clock_gettime(CLOCK_MONOTONIC, &end);
diff_ns = (BILLION * (end.tv_sec - start.tv_sec)) + (end.tv_nsec - start.tv_nsec);
printf ("Elapsed time is %.4f secondsn", diff_ns / BILLION );
return 0;
}
Benchmarking—time intensive application

[centos@ip-192-168-1-77 testbench]$ strace -c ./test
Elapsed time is 10.0336 seconds
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.99 3.322956 2 2001862 gettimeofday
0.00 0.000096 6 16 mmap
0.00 0.000050 5 10 mprotect
0.00 0.000038 8 5 open
0.00 0.000026 5 5 fstat
0.00 0.000025 5 5 close
0.00 0.000023 6 4 read
0.00 0.000008 8 1 1 access
0.00 0.000006 6 1 brk
0.00 0.000006 6 1 execve
0.00 0.000005 5 1 arch_prctl
0.00 0.000000 0 1 munmap
------ ----------- ----------- --------- --------- ----------------
100.00 3.323239 2001912 1 total
Using the Xen clocksource

[centos@ip-192-168-1-77 testbench]$ strace -c ./test
Elapsed time is 2.0787 seconds
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
32.97 0.000121 7 17 mmap
20.98 0.000077 8 10 mprotect
11.72 0.000043 9 5 open
10.08 0.000037 7 5 close
7.36 0.000027 5 6 fstat
6.81 0.000025 6 4 read
2.72 0.000010 10 1 munmap
2.18 0.000008 8 1 1 access
1.91 0.000007 7 1 execve
1.63 0.000006 6 1 brk
1.63 0.000006 6 1 arch_prctl
0.00 0.000000 0 1 write
------ ----------- ----------- --------- --------- ----------------
100.00 0.000367 53 1 total
Using the TSC clocksource

Tip: Use TSC as clocksource
On Linux:
# cat /sys/devices/system/cl*/cl*/current_clocksource
xen
Change with:
# echo tsc > /sys/devices/system/cl*/cl*/current_clocksource
Or add to grub:
tsc=reliable clocksource=tsc
On Windows 2008 R2 and newer:
• Nothing to do, it selects the best clocksource automatically

• c4.8xlarge, d2.8xlarge, m4.10xlarge,
m4.16xlarge, p2.16xlarge, x1.16xlarge,
x1.32xlarge, etc
• By entering deeper idle states, non-
idle cores can achieve up to 300MHz
higher clock frequencies
• But… deeper idle states require more
time to exit, may not be appropriate
for latency-sensitive workloads
• Linux: limit c-state by adding
“intel_idle.max_cstate=1” to grub
• Windows: no options to control c
states
P-state and C-state control

Tip: P-state control for AVX2
• If an application makes heavy use of AVX2 on all cores, the processor may
attempt to draw more power than it should
• Processor will transparently reduce frequency
• Frequent changes of CPU frequency can slow an application
sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo“
See also: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processor_state_control.html

T2 instances
• Lowest cost Amazon EC2 instance at $0.0065 per hour
• Burstable performance
• Fixed allocation enforced with CPU credits
Model vCPU Baseline CPU
credits/
hour
Maximum CPU
credit balance
Memory (GiB)
t2.nano 1 5% 3 72 .5
t2.micro 1 10% 6 144 1
t2.small 1 20% 12 288 2
t2.medium 2 40% (of 200% max) 24 576 4
t2.large 2 60% (of 200% max) 36 864 8
t2.xlarge 4 90% (of 400% max) 54 1296 16
t2.2xlarge 8 135% (of 800% max) 81 1944 32

How credits work
Baseline rate
Credit
balance
Burst
rate
• A CPU credit provides the performance of
a full CPU core for one minute
• An instance earns CPU credits at a steady
rate
• An instance consumes credits when
active
• Credits expire (leak) after 24 hours

Tip: Monitor CPU credit balance

X1 instances
• Largest memory instance with 4 TB of DRAM
• Quad socket, Intel E7 processors with 128 vCPUs
Model vCPU Memory (GiB) Local storage Network
x1.32xlarge 128 1952 2x 1920GB SSD 25Gbps
x1.16xlarge 64 976 1x 1920GB SSD 10Gbps
x1e.32xlarge 128 3904 2x 1920GB SSD 25Gbps
x1e.16xlarge 64 1952 1x 1920GB SSD 10Gbps
x1e.8xlarge 32 976 960 GB SSD Up to 10Gbps
x1e.4xlarge, x1e.2xlarge, x1e.xlarge…
In-memory databases, big data processing, HPC workloads

• Non-uniform memory access
• Each processor in a multi-CPU system has local
memory that is accessible through a fast
interconnect
• Each processor can also access memory from other
CPUs, but local memory access is a lot faster than
remote memory
• Performance is related to the number of CPU
sockets and how they are connected—Intel
QuickPath Interconnect (QPI)
NUMA

r4.16xlarge
QPI
244GB 244GB
32 vCPUs 32 vCPUs

x1e.32xlarge
QPI
QPI
QPIQPI
QPI
976GB
976GB
976GB
976GB
32 vCPU’s 32 vCPU’s
32 vCPU’s 32 vCPU’s

Tip: Kernel support for NUMA balancing
• Linux Kernel 3.8+ and Windows Datacenter 2003+ support NUMA balancing
• Not always the most efficient
• Overhead of managing memory can slow down your app
• Does your app have more memory that fits in a single socket?
• Linux: set “numa=off” in grub to disable NUMA awareness
• Do you have many processes or a footprint less than a single socket?
• Linux: use “numactl” to restrict them to specific cores or nodes
• Example: numactl --cpunodebind=0 --membind=0 ./myapp.run
• Windows: use processor affinity to lock applications to specific cores

Operating systems impact performance
• Memory intensive web application
• Created many threads
• Rapidly allocated/deallocated memory
• Comparing performance of RHEL6 and RHEL7
• Notice high amount of “system” time in top
• Found a benchmark tool (ebizzy) with a similar performance profile
• Traced it’s performance with “perf”

On RHEL6
[ec2-user@ip-172-31-12-150-RHEL6 ebizzy-0.3]$ sudo perf stat ./ebizzy -S 10
12,409 records/s
real 10.00 s
user 7.37 s
sys 341.22 s
Performance counter stats for './ebizzy -S 10':
361458.371052 task-clock (msec) # 35.880 CPUs utilized
10,343 context-switches # 0.029 K/sec
2,582 cpu-migrations # 0.007 K/sec
1,418,204 page-faults # 0.004 M/sec
10.074085097 seconds time elapsed

RHEL6 flame graph output
www.brendangregg.com/flamegraphs.html

On RHEL7
[ec2-user@ip-172-31-7-22-RHEL7 ~]$ sudo perf stat ./ebizzy-0.3/ebizzy -S 10
425,143 records/s
real 10.00 s
user 397.28 s
sys 0.18 s
Performance counter stats for './ebizzy-0.3/ebizzy -S 10':
397515.862535 task-clock (msec) # 39.681 CPUs utilized
25,256 context-switches # 0.064 K/sec
2,201 cpu-migrations # 0.006 K/sec
14,109 page-faults # 0.035 K/sec
10.017856000 seconds time elapsed
Up from 12,400 records/s!
Down from 1,418,204!

RHEL7 flame graph output

Split driver model
Hardware
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
Frontend
driver
Backend
driver
Device
Driver
Physical CPU
Physical
Memory
Network
Device
Virtual CPU
Virtual
Memory
CPU
Scheduling
Sockets
Application
1
23
4
5

Device pass through: Enhanced networking
• SR-IOV eliminates need for driver domain
• Physical network device exposes virtual function to instance
• Requires a specialized driver, which means:
• Your instance OS needs to know about it
• Amazon EC2 needs to be told your instance can use it

After enhanced networking
Hardware
Driver Domain Guest Domain Guest Domain
VMM
NIC
Driver
Physical
CPU
Physical
Memory
SR-IOV
Network Device
Virtual CPU
Virtual
Memory
CPU
Scheduling
Sockets
Application
1
2
3
NIC
Driver

• Next generation of enhanced
networking
• Hardware checksums
• Multi-queue support
• Receive side steering
• 25 Gbps in a placement group
• New open source Amazon network
driver
Elastic network adapter

Network performance
• 25 Gigabit and 10 Gigabit
• Measured one-way, double that for bi-directional (full duplex)
• High, moderate, low–a function of the instance size and EBS optimization
• Not all created equal–test with iperf if it’s important!
• Use placement groups when you need high and consistent instance-to-
instance bandwidth
• All traffic limited to 5 Gb/s when exiting EC2

• R4, I3, G3, F1, X1, P3, C5
• Largest sizes provide consistent 10 Gbps and 25
Gbps
• Smaller sizes
• Up to 10 Gbps with baseline
• Accrue credits when network usage below
baseline
• Full bandwidth without placement group, but do
need multiple streams
• Single stream limited to 10 Gbps in
placement group
• 5 Gbps per stream across AZ’s, but up to
25 Gbps total
ENA on R4 and beyond

P3 GPU Instances
• NVIDIA Tesla V100 Volta Based GPU
• 1 PetaFLOP of computational performance in a single instance
• Provides up to 14X performance improvement over P2 for Machine Learning use-cases
• Up to 2.6X performance improvement over P2 for High-Performance Computing use-cases
P3
Instance Size GPUs
Accelerator
(V100)
GPU Peer to
Peer
GPU Memory
(GB)
vCPUs
Memory
(GB)
Network
Bandwidth
EBS
Bandwidth
P3.2xlarge 1 1 No 16 8 61 Up to 10Gbps 1.7Gbps
P3.8xlarge 4 4 NVLink 64 32 244 10Gbps 7Gbps
P3.16xlarge 8 8 NVLink 128 64 488 25Gbps 14Gbps

F1 FPGA instances
• Up to 8 Xilinx Virtex UltraScale Plus VU9p FPGAs in a single instance with four high-speed
DDR-4 per FPGA
• Largest size includes high performance FPGA interconnects via PCIe Gen3 (FPGA Direct), and
bidirectional ring (FPGA Link)
• Designed for hardware-accelerated applications including financial computing, genomics,
accelerated search, and image processing
F1
Instance size FPGAs FPGA
Link
FPGA
Direct
vCPUs Memory
(GiB)
NVMe
instance
storage
Network
bandwidth
f1.2xlarge 1 - 8 122 1 x 470 Up to 10
Gigabit
f1.16xlarge 8 Y Y 64 976 4 x 940 25 Gbps

What’s new with C5?
• Custom KVM based hypervisor
• More resources for EC2 instances
• ENA Only
• No netfront (vif) driver failback
• Install ENA driver on r4 or similar instance
• Disable “Predictable Network Interface Names”
• Add “net.ifnames=0” to grub
• Remove “/etc/udev/70-persistent-net.rules” before creating AMI
• NVME based EBS storage
• Use the most recent kernel/NVME driver that your operating system supports!
• You’ll need to have the NVME driver available in initramfs
• Ensure you’re using UUIDs or LABELs in /etc/fstab
• NVME Drivers have timeouts that “blkfront” does not
• Maximum of 27 PCI Devices
• EBS Volumes + ENI’s
• ACPI Signals for shutdown (use acpid)

2009–Longer ago than you think
• The 2.6.32 Linux Kernel for RHEL6 was released 2009
• Doesn’t support latest AWS networking (ENA) out of the box
• Outdated Xen virtualization drivers spend more time in the hypervisor
• Older NVME driver does not perform as well for IO
Linux: use 3.10+ kernel (or newer for NVME instances)
• Current Amazon Linux
• Ubuntu 14.04 or later
• RHEL/Centos 7 or later
• Or use the 3.10 (kernel-lt) or 4.x (kernel-ml) version available from ELREPO in
RHEL/Centos 6
Windows:
• Windows Server 2008 R2, Windows Server 2012 R2,
Windows Server 2016

EBS performance
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
Instance
type
EBS-
optimized
by default
Max EBS
bandwidth
(Mbps)*
Expected EBS
throughput
(MB/s)**
Max. IOPS
(16 KB I/O
size)**
Max network
bandwidth
r4.16xlarge Yes 14,000 1,750 75,000 25 Gb/s
i3.16xlarge Yes 14,000 1,750 65,000 25 Gb/s
m4.16xlarge Yes 10,000 1,250 65,000 25 Gb/s
p2.16xlarge Yes 10,000 1,250 65,000 25 Gb/s
x1.32xlarge Yes 10,000 1,250 65,000 10 Gb/s
i3.8xlarge Yes 7,000 850 32,500 10 Gb/s
r4.8xlarge Yes 7,000 875 37,500 10 Gb/s
• Instance size affects throughput
• Match your volume size and type to your instance
• RAID multiple EBS volumes together to achieve:

Summary: Getting the most out of EC2
instances
• Use modern, up to date operating systems
• Timekeeping: use TSC
• C state and P state controls
• Monitor T2 CPU credits
• NUMA balancing
• Enhanced networking
• Profile your application

Virtualization themes
• Bare metal performance goal, and in many scenarios already there
• History of eliminating hypervisor intermediation and driver domains
• Hardware-assisted virtualization
• Scheduling and granting efficiencies
• Device pass through

Next steps
• Visit the Amazon EC2 documentation
• Launch an instance and try your app!

Thank you!

CMP301_Deep Dive on Amazon EC2 Instances

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a CMP301_Deep Dive on Amazon EC2 Instances

Semelhante a CMP301_Deep Dive on Amazon EC2 Instances (20)

Mais de Amazon Web Services

Mais de Amazon Web Services (20)

CMP301_Deep Dive on Amazon EC2 Instances