Charts from NITK-IBM Computer Systems Research Group (NCSRG)
- Dennard Scaling,Moore's Law, OpenPOWER, Storage Class Memory, FPGA, GPU, CAPI, OpenCAPI, nVidia nvlink, Google Microsoft Heterogeneous system usage
1. IBM Confidential
Heterogeneous Computing
The Future of Systems
Anand Haridass
Senior Technical Staff Member
IBM Cognitive Systems
NITK (KREC) – Batch of ‘95 (E&C)
IBM Academy of Technology
NITK-IBM Computer Systems Research Group (NCSRG)
Seminar Sep/18/2017
2. 2
Agenda
System Overview
Technology Trends – End of Dennard Scaling
Vertical Integration - OpenPOWER
“Feeding the Engine” – Memory / Storage
Need for High Performance Bus – OpenCAPI
GPU Attach - NVLINK
Accelerator Examples
3. 3
Von Neumann Architecture
• First published by John von Neumann in 1945.
• Design consists of a Control Unit, Arithmetic & Logic Unit (ALU), Memory Unit, Registers & Inputs/Outputs.
• Stored-program computer concept instruction data and program data are stored in the same memory.
• Most Servers & PC’s produced today use this design.
4. 4
Typical 2 Socket Systems [2017]
CPU CPU
Memory Memory
IO/ Storage / NW
AcceleratorAccelerator
IO/ Storage / NW
5. 5
Processor Technology Trends
Moore’’’’s Law
Alive & Kicking
Moore’s Law (1965)
”Number of transistors in a dense integrated circuit
doubles approximately every two years”
6. 6
Dennard Scaling Limits
Dennard scaling As transistors get smaller their power density stays constant, so that the
power use stays in proportion with area: both voltage and current scale (downward) with
length.
Power requirements are proportional to area (both voltage & current being proportional to length). Transistor dimensions are scaled by 30%
(0.7x) every technology generation, thus reducing their area by 50%. This reduces the delay by 30% (0.7x) and therefore increases operating
frequency by about 40% (1.4x). To keep electric field constant, voltage is reduced by 30%, reducing energy by 65% and power (at 1.4x
frequency) by 50%.
• Voltage scaling for high-performance designs is limited
• By leakage issues: can’t reduce threshold voltages
• Need steeper sub-threshold slopes
• Limited by variability, esp VT variability
• Need to minimize random dopant fluctuations
• Limited by gate oxide thickness
• Some relief from high-K materials
• Limited voltage scaling + decreasing feature sizes
Increasing electric fields
• New device structures needed (FinFETs)
• Reliability challenges (devices and wires)
7. 7
CMOS Power - Performance Scaling
Where this curve is flat, can only improve chip frequency by:
a) Pushing core/chip to higher power density (air cooling limits)
b) Design power efficiency improvements (low-hanging fruit all gone)
10
100
0.01 0.1 1 10
Feature pitch (microns)
RelativePerformanceMetric
(Constpowerdensity) When scaling
was good…
12. 12
End customer doesn't care about Frequency / ST performance & other ‘‘‘‘processor’’’’ metrics
Cost/Performance is the metric
Processors
Semiconductor Technology
Industry trends, Challenges & Opportunities
Microprocessors alone no longer drive sufficient Cost/Performance improvements
15. 15
Materials Innovations - Increased Complexity & Cost
Global Foundries projects that a
computer chip manufacturing plant in NY
would cost $14.7 billion to build
16. 16
“Data Access” Performance
(bandwidth & latency) & Cost
(Power) still very challenging
Some techniques to hide
latency/bw/pwr
Caches
Locality optimization
Out-of-order execution
Multithreading
Pre-fetching
“Fat’ pipes / Memory Buffers
ns
StorageMemory
Storage Class Memory
(100 – 1000ns)
Source: SNIA
“Feeding the Engine” Challenge
17. 17
Access latency in
uP cycles
(@ 4GHz)
Source H.Hunter IBM
21 23
211 213 215
219 223
L1/L2(SRAM) HDD
27
L3/L4
25
29 217
221
Flash
“I/O Calls” (Read/Writes)“Memory Calls“ (Load/Store)
DRAM
Memory / Storage
Storage Class of Memory
NVMe - Non-Volatile Memory express (PCIe)
• Standardized high performance interface for PCI Express SSD.
Available today in three different form factors: PCIe Add in Card, SFF
2.5” and M.2
• PCIeGen3 (today) x8 ~8GB/s [x4 ~4GB/s, x2 ~2GB/s] vs SAS 12Gbs
[1.5GB/s /port]
• PCIeGen4 (2018) x8 ~16GB/s [x4 ~8GB/s, x2 ~4GB/s] vs SAS 24Gbs
[3GB/s /port]
NVMe over fabrics (low latency RDMA access) <10us including switches
CAPI based Flash (today) x16 (16GB/s) – at faster access latencies
(more on this later)
HBM (High Bandwidth memory)
• 3D Stacked DRAM from AMD/Hynix/Samsung
• HBM2 256GB/sec ~4GB/package (8 DRAM TSV stacked)
• 1024bits x 2GT/s
• HBM3 512GB/sec ~2020 time frame
NVDIMM
• Persistent memory solution on DDR interface
• Combines DRAM, NAND Flash and power source
• Delivers DRAM R/W perf with the persistence & reliability
of NAND
18. 18 Source: SNIA
The Contenders
https://www.snia.org/sites/default/files/NVM/2016/presentations/Panel_1_Combined_NVM_Futures%20Revision.pdf
19. 19
Function offload – greater concurrency & utilization
Power efficiency (performance/watt)
Workloads
Encryption-decryption / Compression-
decompression / Encoding-decoding / Network
Controllers / Math Libraries / DB queries / Search
Deep Learning (Arms race !) for training &
inferencing
Hardware Acceleration
Types of Accelerators
General Purpose GPU / Many Integrated Core (MIC)
Nvidia Tesla/Volta, Intel Xeon Phi, AMD Radeon
Field Programmable Gate Array (FPGA)
Xilinx, Altera (now Intel)
Purpose Built / Custom ASIC’s
Google’s TPU
Intelligent Network Controllers
Cavium ARM-accelerated NIC
Mellanox NIC+FPGA
Microsoft FPGA-only network adapter
Traditionally (“IO” limited) sequential instructions
on processor / parallel compute offloaded to
accelerator
Penalty for “IO” operations heavy
20. 20
HPC & Hyper-scale datacenters (Cloud) are driving need for higher network bandwidth
HPC & Deep learning require more bandwidth between accelerators and memory
PCI Express has limitations (coherence / bandwidth / protocol overhead)
Desired Attributes
Low Latency / High Bandwidth / Coherence
Emergence of complex storage & memory solutions (BW & latency & heterogeneity)
Growing demand for network performance (BW & latency)
Various form factors (e.g., GPUs, FPGAs, ASICs, etc.)
Open standard for broad industry, architecture agnostic participation / avoid vendor lock-in
Volume pricing advantages & Broad software ecosystem growth and adoption
Vendor specific variants
Intel Omni Path Architecture, Nvidia Nvlink, AMD Hypertransport
Open Standards evolving
Cache Coherent Interconnect for Accelerators (CCIX) www.ccixconsortium.com
Gen-Z genzconsortium.org
Open Coherent Accelerator Processor Interface (OpenCAPI) opencapi.org
Need for High Performance Next Generation Bus/Interconnect
21. 21
Coherent Accelerator Processor Interface (CAPI) - 2014
CAPP PCIe
Power Processor
FPGA
Functionn
Function0
Function1
Function2
CAPI
IBM Supplied POWER
Service Layer
Virtual Addressing
Removes the requirement for pinning system memory for PCIe
transfers
Eliminates the copying of data into and out of the pinned DMA buffers
Eliminates the operating system call overhead to pin memory for
DMA
Accelerator can work with same addresses that the processors use
Pointers can be de-referenced same as the host application
- Example: Enables the ability to traverse data structures
Coherent Caching of Data
Enables an accelerator to cache data structures
Enables Cache to Cache transfers between accelerator and processor
Enables the accelerator to participate in “Locks” as a normal thread
Elimination of Device Driver
Direct communication with Application
No requirement to call an OS device driver or Hypervisor function for
mainline processing
Enables Accelerator Features not possible with PCIe
Enables efficient Hybrid Applications
Applications partially implemented in the accelerator and partially on
the host CPU
Visibility to full system memory
Simpler programming model for Application Modules
Coherent Accelerator Processor Proxy (CAPP)
– Proxy for FPGA Accelerator on PowerBus
– Integrated into Processor
– Programmable (Table Driven) Protocol for CAPI
– Shadow Cache Directory for Accelerator
• Up to 1MB Cache Tags (Line based)
• Larger block based Cache
POWER Service Layer (PSL)
– Implemented in FPGA Technology
– Provides Address Translation for Accelerator
• Compatible with POWER Architecture
– Provides Cache for Accelerator
– Facilities for downloading Accelerator Functions
22. 22
PCIe
How CAPI Works
AlgorithmAlgo mrith
POWER8 Processor
Acceleration Portion:
Data or Compute Intensive,
Storage or External I/O
Application Portion:
Data Set-up, Control
Sharing the same memory space
Accelerator is a peer to POWER8 Core
CAPI Developer Kit Card
Coherent Accelerator Processor Interface (CAPI) - 2014
Accelerator is a Full Peer to Processor
Accelerator Function(s) use an unmodified
Effective address
Full access to Real address space
Utilize Processor’s Page Tables Directly
Page Faults handled by System Software
Multiple Functions can exist in a single
Accelerator
23. 23
Memory Subsystem
Virt Addr
IO Attached Accelerator
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
App
FPGA
PCIE
Variables
Input
Data
DD
Device Driver
Storage Area
Variables
Input
Data
Variables
Input
Data
Output
Data
Output
Data
An application called a device driver to utilize an FPGA Accelerator.
The device driver performed a memory mapping operation.
3 versions of the data (not coherent).
1000s of instructions in the device driver.
24. 24
Memory Subsystem
Virt Addr
CAPI Coherency
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
App
FPGA
PCIE
With CAPI, the FPGA shares memory with the cores
PSL
Variable
s
Input
Data
Output
Data
1 coherent version of the data.
No device driver call/instructions.
25. 25
Typical I/O Model Flow:
Flow with a Coherent Model:
Shared Mem.
Notify Accelerator
Acceleration
Shared Memory
Completion
DD Call
Copy or Pin
Source Data
MMIO Notify
Accelerator
Acceleration
Poll / Interrupt
Completion
Copy or Unpin
Result Data
Ret. From DD
Completion
Application
Dependent, but
Equal to below
Application
Dependent, but
Equal to above
300 Instructions 10,000 Instructions 3,000 Instructions
1,000 Instructions
1,000 Instructions
7.9µs 4.9µs
Total ~13µs for data prep
400 Instructions 100 Instructions
0.3µs 0.06µs
Total 0.36µs
CAPI vs. I/O Device Driver: Data Prep
26. 26
IBM Accelerated GZIP Compression
An FPGA-based low-latency GZIP Compressor & Decompressor with
single-thread througput of ~2GB/s and a compression rate significantly
better than low-CPU overhead compressors like snappy.
29. 29
CAPI Acceleration
29
Examples: Encryption, Compression, Erasure prior to network or storage
Processor
Chip
Acc
Data
Egress Transform
DLx/TLx
Processor
Chip
Acc
Data
Bi-Directional Transform
Acc
TLx/DLx
Examples: NoSQL such as Neo4J with Graph Node Traversals, etc
Needle-in-a-haystack Engine
Examples: Machine or Deep Learning potentially using OpenCAPI attached memory
Memory Transform
Processor
Chip
Acc
DataDLx/TLx
Example: Basic work offload
Processor
Chip
Acc
NeedlesDLx/TLx
Examples: Database searches, joins, intersections, merges
Ingress Transform
Processor
Chip
Acc
DataDLx/TLx
Examples: Video Analytics, HFT, VPN/IPsec/SSL, Deep Packet Inspection (DPI),
Data Plane Accelerator (DPA), Video Encoding (H.265), etc
Needle-In-A-Haystack Engine
Haystack
Data
OpenCAPI WINS due to Bandwidth
to/from accelerators, best of breed
latency, and flexibility of an Open
architecture
30. 30
NVLink 1
4 links
20 GBps per link raw bandwidth each
direction
~160GBps total net NVLink bandwidth
NVLink 2
6 links
25GBps per link raw bandwidth each
direction
~300GBps total net NVLink bandwidth
Volta GV100
• 15 TFLOPS FP32
• 16GB HBM2 – 900 GB/s
• 300W TDP
• 50 GFLOPS/W (FP32)
• 12nm process
• 300GB/s NV Link2
• Tensor Core....
Source: Nvidia
NVIDIA GPU
31. 31
“Minsky” S822LC for HPC
• Tight coupling: strong CPU: strong GPU performance
• Equalizing access to memory - for all kinds of programming
• Closer programming to the CPU paradigm
115GB/S 115GB/S
NVLink
DDR4
P8’
DDR4
P8’
Tesla
P100
Tesla
P100
80GB/S Tesla
P100
Tesla
P100
80GB/S
OpenPOWER P8’ Design
PCIe
32GBps
GPUGPU
x86x86
GPUGPU GPUGPU
x86x86
GPUGPU
For x86 Servers: PCIe Bottleneck
No NVLink between CPU & GPU
2.7X faster query response time on “Minsky”
87% of the total speedup (2.35x of 2.7x
improvement) is due to the NVLink Interface
from CPU:GPU
• Profiling result based on running Kinetica “Filter by geographic area” queries on data set of 280 million simulated 1 simultaneous query stream each with 0 think time.
• Power System S822LC for HPC; 20 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 1024 GB memory, 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU; Ubuntu 16.04.
• Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 512GB memory 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU, Ubuntu 16.04.
33. 33
Google TPU 1.0
[Jouppi et al., ISCA 2017]
Relative performance/Watt (TDP) of GPU server (blue) and
TPU server (red) to CPU server, and TPU server to GPU
server (orange).
TPU’ is an improved TPU that uses GDDR5 memory. The
green bar shows its ratio to the CPU server, and the lavender
bar shows its relation to the GPU server.
Total includes host server power, but incremental doesn’t. GM
and WM are the geometric and weighted means.
34. 34
Google TPU performance
Stars are for the TPU
Triangles are for the K80
Circles are for Haswell.
[Jouppi et al., ISCA 2017]
35. 35
Microsoft Azure FPGA Usage
[M.Russinovich, MSBuild 2017]
FPGA for SDN Offload FPGA for Bing
37. 37
Ease of Consumption
Compiler Optimization
Math libraries optimization
Native Support for CUDA / OpenMP / OpenCL ..
Native Support for Frameworks for eg for Deep Learning (Torch/Tensorflow/Caffe …)
42. 42
When to Use FPGAs
Transistor Efficiency & Extreme Parallelism
Bit-level operations
Variable-precision floating point
Power-Performance Advantage
>2x compared to Multicore (MIC) or GPGPU
Unused LUTs are powered off
Technology Scaling better than CPU/GPU
FPGAs are not frequency or power limited yet
3D has great potential
Dynamic reconfiguration
Flexibility for application tuning at run-time vs.
compile-time
Additional advantages when FPGAs are network
connected ...
allows network as well as compute
specialization
Extreme FLOPS & Parallelism
Double-precision floating point leadership
Hundreds of GPGPU cores
Programming Ease & Software Group Interest
CUDA & extensive libraries
OpenCL
IBM Java (coming soon)
Bandwidth Advantage on Power
Start w/PCIe gen3 x16 and then move to
NVLink
Leverage existing GPGPU eco-system and
development base
Lots of existing use-Cases to build on
Heavy HPC investment in GPGPU
When to Use GPGPUs
45. Use CasesUse CasesUse CasesUse Cases –––– A truly heterogeneous architecture built uponA truly heterogeneous architecture built uponA truly heterogeneous architecture built uponA truly heterogeneous architecture built upon OpenCAPIOpenCAPIOpenCAPIOpenCAPI
OpenCAPI 3.0
OpenCAPI 3.1
OpenCAPI specifications are
downloadable from the
website
at www.opencapi.org
- Register
- Download
46. OpenCAPI Advantages for MemoryOpenCAPI Advantages for MemoryOpenCAPI Advantages for MemoryOpenCAPI Advantages for Memory
Open standard interface enables to attach wide range of devices
OpenCAPI protocol was architected to minimize latency
Especially advantageous for classic DRAM memory
Extreme bandwidth beyond classical DDR memory interface
Agnostic interface allows extension to evolving memory technologies in the future
(e.g., compute-in-memory)
Ability to handle a memory buffer to decouple raw memory and host interfaces to
optimize power, cost and performance
Common physical interface between non-memory and memory devices
9
47. 47
OpenCAPI Key AttributesOpenCAPI Key AttributesOpenCAPI Key AttributesOpenCAPI Key Attributes
• Architecture agnostic bus – Applicable with any system/microprocessor architecture
• Coherency - Attached devices operate natively within application’s user space and coherently with host uP
• High performance interface design with no ‘overhead’ and optimized for a high bandwidth and low latency
• Point to point construct optimized within a system
• Allows attached device to fully participate in application without kernel involvement/overhead
• 25Gbit/sec signaling and protocol to enable very low latency interface on CPU and attached device
• Supports a wide range of use cases and access semantics
• Hardware accelerators
• High-performance I/O devices
• Advanced memories and Classic memory
• Various form factors (e.g., GPUs, FPGAs, ASICs, memory, etc.)
• Reduced complexity of design implementation
• Wanted to make this easy for the accelerator, memory and system design teams
• Moved complexities of coherence and virtual addressing onto the host microprocessor to simplify
attached devices and facilitate interoperability across multiple CPU architectures
48. Virtual Addressing and BenefitsVirtual Addressing and BenefitsVirtual Addressing and BenefitsVirtual Addressing and Benefits
An OpenCAPI device operates in the virtual address spaces of the applications that it supports
• Eliminates kernel and device driver software overhead
• Allows device to operate on application memory without kernel-level data copies/pinned pages
• Simplifies programming effort to integrate accelerators into applications
• Improves accelerator performance
The Virtual-to-Physical Address Translation occurs in the host CPU
• Reduces design complexity of OpenCAPI-attached devices
• Makes it easier to ensure interoperability between OpenCAPI devices and different CPU architectures
• Security - Since the OpenCAPI device never has access to a physical address, this eliminates the
possibility of a defective or malicious device accessing memory locations belonging to the kernel or
other applications that it is not authorized to access