4. Key Messages Throughout
IBM’s Strong History in IO Offerings
OpenCAPI is an Open IO Standard
Not tied to Power – Architecture Agnostic
High Performance – No OS/Hypervisor/FW Overhead with
Low Latency and High Bandwidth
Programing Ease
Very Low Accelerator Design Overhead
Supports heterogeneous environment – Use Cases
Ideal for Accelerated Computing and SCM
Optimized for within a single system node
Products exist today!
4
6. What’s Driving the Creation of a High Perf. Bus
• Historical silicon technology improvements out of steam
• More cores on a processor help but you’ll never have enough; especially for today’s emerging
workloads (analytics, artificial intelligence, machine learning, real time analysis, etc.)
• New advanced memory technologies are changing the economics of computing
• Companies realizing the need to off load the microprocessors from routine algorithms to
meet demand and improve system performance Accelerated Computing
• Accelerators - If you are going to use them, you need a lot of data in/out quickly
6
Computation Data Access
8. Open Coherent Accelerator Processor Interface
Industry driving these necessary Accelerator attributes
• High performance (move a lot of data quickly; bandwidth, latency)
• Fulfill various accelerator form factors (e.g., GPUs, FPGAs, ASICs)
• Introduction of device coherency requirements
• Emergence of complex Storage and Memory solutions
• Growing demand for network performance with increased
computational demand
• Need to be architecture agnostic to enable the ecosystem growth and
adaption
8
9. Accelerated Computing Example
IBM’s AC922 built for Accelerated Computing
Classic system topology setup with two POWER9 sockets, each socket having at
least two NVIDIA GPUs
Does Accelerated Computing work?
• Summit Supercomputer, an IBM-built supercomputer now running at the
Department of Energy’s (DOE) Oak Ridge National Laboratory, captured the
number one spot with a performance of 143,5 petaflops on High Performance
Linpack (HPL), the benchmark used to rank the TOP500 list. Summit has 4,356
nodes, each one equipped with two 22-core Power9 CPUs, and six NVIDIA Tesla
V100 GPUs. The nodes are linked together with a Mellanox dual-rail EDR
InfiniBand network.
9
11. Why OpenCAPI and what is it?
OpenCAPI is a new ‘bottom’s up’ IO standard
Key Attributes of OpenCAPI 3.0
• Open IO Standard – Choice for developers and others to contribute and grow an ecosystem
• Coherent interface – Microprocessor memory, accelerator and caches share the same memory space
• Architecture agnostic – Capable going beyond Power Architecture
• Not tied to Power – Architecture Agnostic
• High performance – No OS/Hypervisor/FW Overhead for Low Latency and High Bandwidth
• Ease of programing
• Ease of implementation with minimal accelerator design overhead
• Ideal for accelerated computing and SCM including various form factors (FPGA, GPU, ASIC, TPU, etc.)
• Optimized for within a single system node
• Supports heterogeneous environment – Use Cases
OpenCAPI 3.1
Applies OpenCAPI technology for use of standard DRAM off the microprocessor
Based on an Open Memory Interface (OMI)
Further tuned for extreme lower latency
11
12. Accelerated
OpenCAPI Device
OpenCAPI Key Attributes
12
TL/DL 25Gb I/O
Any OpenCAPI Enabled Processor
U
Accelerated
Function
TLx/DLx
1. Architecture agnostic bus – Applicable with any system/microprocessor architecture
2. Optimized for High Bandwidth and Low Latency
3. High performance 25Gbps PHY design with zero ‘overhead’
4. Coherency - Attached devices operate natively within application’s user space and coherently with host microprocessor
5. Virtual addressing enables low overhead with no Kernel, hypervisor or firmware involvement; security benefit
6. Wide range of Use Cases and access semantics
7. CPU coherent device memory (Home Agent Memory)
8. Architected for both Classic Memory and emerging Advanced Storage Class Memory
9. Minimal OpenCAPI design overhead (FPGA less than 5%)
Caches
Application
Storage/Compute/Network etc
ASIC/FPGA/FFSA
FPGA, SOC, GPU Accelerator
Load/Store or Block Access
Standard System Memory Advanced SCM
Solutions
BufferedSystemMemory
OpenCAPIMemoryBuffers
Device Memory
13. Use Cases - A True Heterogeneous Architecture Built Upon OpenCAPI
OpenCAPI 3.0
OpenCAPI 3.1
OpenCAPI specifications are
downloadable from the website
at www.opencapi.org
- Register
- Download
13
15. POWER9 IO Features
8 and 16Gbps PHY
Protocols Supported
• PCIe Gen3 x16 and PCIe Gen4 x8
• CAPI 2.0 on PCIe Gen4
PCIeGen4
P9
25Gbs
25Gbps PHY
Protocols Supported
• OpenCAPI 3.0
• NVLink 2.0
Silicon Die
Various packages
(scale-out, scale-up)
POWER9 IO Leading the Industry
• PCIe Gen4
• CAPI 2.0 (Power)
• NVLink 2.0
• OpenCAPI 3.0
POWER9
15
16. Why do I care about Virtual Addressing?
An OpenCAPI device operates in the virtual address spaces of the applications that it supports
• Eliminates kernel and device driver software overhead
• Allows device to operate on application memory without kernel-level data copies/pinned pages
• Simplifies programming effort to integrate accelerators into applications
• Culmination => Improves Accelerator Performance
The Virtual-to-Physical Address Translation occurs in the host CPU
• Reduces design complexity of OpenCAPI accelerator development
• Makes it easier to ensure interoperability between OpenCAPI devices and different CPU architectures
• Security - Since the OpenCAPI device never has access to a physical address, this eliminates the
possibility of a defective or malicious device accessing memory locations belonging to the kernel or
other applications that it is not authorized to access
16
17. 17
OpenCAPI vs I/O Device Driver – Because minimizing SW Path Length is crucial for performance
Typical I/O Model Flow using Device Driver
Flow with a OpenCAPI Model
Shared Memory
Notify Accelerator Acceleration
Shared memory completion
or fast thread wake up
DD Call
Copy or Pin
Source Data
MMIO Notify
Accelerator Acceleration
Poll / Intr
Completion
Copy or Unpin
Result Data
Return From DD
Completion
TL/DL
25Gb I/O
Processor
FPGA/SoC/GPU
Functionn
Function0
Function1
Function2
OpenCAPI
TLx/DLx
Device Memory
Host Memory
Total ~13µs for data prep
Total 0.36µs
400 Instructions 100 Instructions
0.3µs 0.06µs
Application
Dependent, but
Equal to above
3,000 Instructions
1,000 Instructions
1,000 Instructions
4.9µs
300 Instructions 10,000 Instructions
7.9µs
Application
Dependent, but
Equal to below
19. So How is Accelerated Computing Leveraged?
Okay, I’m sold. But how do I leverage this technology?
Isolate and identify your heavily used workloads,
algorithms and/or applications
Place this workload onto an accelerator
Path of least resistance and flexibility is an FPGA that is
programmable
Purchase any of the OpenCAPI vendor FPGA cards and an
OpenCAPI enabled system and start developing your
solution!
19
20. Comparison of Memory Paradigms
Emerging Storage Class Memory
Processor Chip DLx/TLx
SCMData
OpenCAPI 3.1 Architecture
Ultra Low Latency ASIC Memory buffer chip
adding ~ +7-10 ns on top of native DDR direct
connect!!
Storage Class Memory tiered with traditional DDR
Memory all built upon OpenCAPI 3.1 & 3.0
architecture.
Still have the ability to use Load/Store Semantics
Storage Class Memories have the potential to be
the next disruptive technology…..
Examples include ReRAM, MRAM, Z-NAND……
All are racing to become the defacto
Main Memory
Processor Chip DLx/TLx
DDR4/5
Example: Basic DDR attach
Data
Tiered Memory
Processor Chip
DLx/TLx
DDR4/5
DLx/TLx
SCM
Data
Data
Tier 1 Memory
Tier 2 Memory
21. Acceleration Paradigms with Great Performance
21
Examples: Encryption, Compression, Erasure prior to
delivering data to the network or storage
Processor Chip
Acc
Data
Egress Transform
DLx/TLx
Processor Chip
Acc
Data
Bi-Directional Transform
Acc
DLx/TLx
Examples: NoSQL such as Neo4J with Graph Node Traversals, etc
Needle-in-a-haystack Engine
Examples: Machine or Deep Learning such as Natural Language processing,
sentiment analysis or other Actionable Intelligence using OpenCAPI attached memory
Memory Transform
Processor Chip
Acc
DataDLx/TLx
Example: Basic work offload
Processor Chip
Acc
NeedlesDLx/TLx
Examples: Database searches, joins, intersections, merges
Only the Needles are sent to the processor
Ingress Transform
Processor Chip
Acc
DataDLx/TLx
Examples: Video Analytics, Network Security,
Deep Packet Inspection, Data Plane Accelerator,
Video Encoding (H.265), High Frequency Trading etc
Needle-In-A-Haystack Engine
Large
Haystack
Of Data
OpenCAPI is ideal for acceleration due
to Bandwidth to/from accelerators, best
of breed latency, and flexibility of an
Open architecture
23. IBM AC922
Air Cooled
23
Power9 Systems with OpenCAPI
System Details
• 2 – Socket 2U
• Up to 40 cores
• Up to 2TB memory (16 DDR4 Dimms)
• 4 Gen4 PCIe Slots, 3 CAPI2.0 Enabled
• 2 2.5” SFF Drive Bays
• 4 OpenPOWER Mezzanine Sockets
• Up to 4 NVLink V100 GPUs
• Up to 4 socketed OpenCAPI Adapters*
• Up to 1 cabled OpenCAPI Cards w/ SlimSAS
adapter*
* Future Support
24. 24
Power9 Systems with OpenCAPI
• Dual Socket
• OCP 48V
• Up to 4TB memory (32 DDR4 Dimms)
• 5 PCIe Gen4 Slots, 3 CAPI2.0 Enabled
• 2 x8 OCP Gen4 PCIe Mezz Slots CAPI2.0 Enabled
• 4 x8 25 Gbps SlimSAS Accelerator Ports
• Supports up to 4 OpenCAPI cabled adapters
Google & Rackspace Zaius
Motherboard
25. Rackspace BarrelEye G2 ServerSed
25
• Zaius Motherboard
• Full-depth 48V
Open Rack V2
• 2OU Chassis
• High Density, Hot
Swap Storage Bay
• 4 x8 25Gb/s
Coherent Attach
Ports
“The OpenCAPI accelerator and
software ecosystem is growing
rapidly. With collateral available via
the Open Compute website,
accelerator developers find it easy to
design and test their solutions on our
platform.”
Adi Gangidi, Senior Design
Engineer with Rackspace
26. Power9 Systems with OpenCAPI
26
• 2 Socket 2U
• Up to 48 cores
• Up to 4TB memory (32 DDR4 DIMMs)
• 4 Gen4 PCIe Slots, CAPI2.0 Enabled
• 6 Gen3 PCIe Slots
• Up to 24 SFF / 12 LFF Drives
• 4 x8 25 Gbps Ports
• Up to 4 cabled OpenCAPI
Adapters*
Mihawk
“In order to provide the best backend architecture in AI, Big Data, and
Cloud applications, Wistron POWER9 system design incorporates OpenCAPI
technology through 25Gbps high speed link to dramatically change the
traditional data transition method. This design not only improves GPU
performance, but also utilizes next generation advanced memory, coherent
network, storage, and FPGA. This is an ideal system infrastructure to
meet next decade computing world challenges.”
Donald Hwang, CTO and President of EBG at Wistron Corporation.
31. TLx and DLx Reference Designs in an FPGA
TLX and DLX will be provided as reference designs to OpenCAPI consortium members
• Associated reference design specifications for TLx and DLx will also be delivered
along with RTL
Open Verilog – free to enhance, improve or leverage pieces of reference design for
your own accelerator development
Designed for 64B packet flow running at 400MHz
Xilinx Vivado 2017.1 TLX and DLX Statistics on VU3P Device
31
VU3P Resources CLB FlipFlops LUT as Logic LUT Memory Block Ram Tile
DLx 9392/788160
(1.19%)
19026/394080
(4.82%)
0/197280
(0%)
7.5/720
(1.0%)
TLx 13806/788160
(1.75%)
8463/394080
(2.14%)
2156/197280
(1.09%)
0/720
(0%)
Total 23108/788160
(2.94%)
27849/394080
(6.98%)
2156/197280
(1.09%)
7.5/720
(1.0%)
Because power efficiency and size do matter
32. OpenCAPI Device
• Customer application and accelerator
• Operation system enablement
• Little Endian Linux
• Reference Kernel Driver (ocxl)
• Reference User Library (libocxl)
• Hardware and reference designs to
enable coherent acceleration
Core
Processor
OS
App
(software)
Memory (Coherent)
Accelerated
Function(s)
TLx
DLx
25G
ocxl
libocxl
OCSE (OpenCAPI Simulation Environment)
models the red outlined area
OCSE enables AFU and Application co-
simulation when the reference libocxl
and reference TLx/DLx are used
Contributed to the Enablement
Workgroup under the OpenCAPI
consortium
Cable
Memory (Coherent)
Elements of the OpenCAPI Simulation Environment
25G
DL
TL
32
33. Exerciser Examples – Provided to OCC Members
MemCopy
• The MemCopy example is a data mover from source address -> destination address
using Virtual Addressing and includes these features
• Configuration and MMIO Register Space
• acTag Table used for Bus/Device/Function and Process ID identification
• 512 processes/contexts and 32 engines supporting up to 2K transfers using 64B,
128B, or 256B operations
Memory Home Agent
• The Memory Home Agent example implements memory off the endpoint OpenCAPI
accelerator to act as a coherent extension to the host processor memory
• The Memory Home Agent example includes these features
• Configuration and MMIO Register Space
• Individual and pipelined operation for memory loads and stores
• Interrupts, with error details reported to software through MMIO registers
• Sparse Address Mapping feature to extend 1 MB of real space to 4 TB of address
Open Examples – free to enhance, improve or leverage pieces of exerciser
examples for your own accelerator development 33
34. Reference Card Design Number 1
Definition of the FPGA reference card(s) is driven as part of the Enablement Work Group
Definition of the cable(s) is driven as part of the PHY Mechanical Work Group
Representative Diagram is articulated below
34
FPGA
OpenCAPI Protocol Over the 25Gbps link
Including Discovery, Configuration and Enumeration
Power and Ground only
Convenient Fixture
Amphenol SlimSAS OpenCAPI
Connector
35. Reference Card Design Number 2
Definition of the FPGA reference card(s) is driven as part of the Enablement Work Group
Definition of the cable(s) will be driven as part of the PHY Mechanical Work Group
Representative Diagram is articulated below
35
Amphenol SlimSAS OpenCAPI
Connector
Mezzanine based Form Factor
36. Table of Enablement Deliveries
36
Item Delivery Name Where to Obtain Available
OpenCAPI 3.0 TLx and DLx
Reference Xilinx FPGA Designs (RTL
and Specifications)
<snapshot>.tar.gz Enablement WG Today
Xilinx Vivado Project Build with
Memcopy Exerciser
Vivado Project Flow Enablement WG Today
Device Discovery and Configuration
Specification and RTL
OpenCAPI 3.0 Configuration Sub-
System Reference Design
Specification
Enablement WG Causeway Today
AFU Interface Specification TLX 3.0 Reference Design.pdf Enablement WG Causeway Today
25Gbps PHY Signal Specification OC PHY 25G Specification PHY Signaling WG Causeway Today
25Gbps PHY Mechanical
Specification
25Gbps Interface Mechanical Spec PHY Mechanical WG Causeway Today
OpenCAPI Simulation Environment
(OCSE)
ocse-<version>.tar.gz
OpenCAPIDemokit.pdf
Enablement WG Today
Today
Memcopy and Memory Home
Agent Exercisers
MCP3 and LPC
<snapshot>.tar.gz
Enablement WG Today
Reference Driver Available LIBOCXL Ubuntu 18.04
GitHub
Today
u
37. SmartDV OpenCAPI VIP Environment
contains:
• Complete regression suite
• Usage examples
• Detailed documentation
• User’s Guide and Release Notes
Benefits
• Complete Verification of
OpenCAPI Design
• Easy to Use
• Simplify Result Analysis
• Runs in every sim environment
37
SmartDV
OpenCAPI Verification IP
http://www.smart-dv.com/vip/opencapi.html
40. Latency Test Results
OpenCAPI Link
P9 OpenCAPI
3.9GHz Core, 2.4GHz Nest
Xilinx FPGA VU3P
298ns‡
2ns Jitter
TL, DL, PHY
TLx, DLx, PHYx (80ns‖)
378ns† Total Latency
PCIe G4 Link
P9 PCIe Gen4
Xilinx FPGA VU3P
est. <337ns
PCIe Stack
Xilinx PCIe HIP (218ns¶)
est. <555ns§ Total Latency
PCIe G3 Link
P9 PCIe Gen3
3.9GHz Core, 2.4GHz Nest
Altera FPGA Stratix V
337ns
7ns Jitter
PCIe Stack
Altera PCIe HIP (400ns¶)
737ns§ Total Latency
PCIe G3 Link
Kaby Lake PCIe Gen3*
3.9GHz Core, 2.4GHz Nest
Altera FPGA Stratix V
376ns
31ns Jitter
PCIe Stack
Altera PCIe HIP (400ns¶)
776ns§ Total Latency
* Intel Core i7 7700 Quad-Core 3.6GHz (4.2GHz Turbo Boost)
† Derived from round-trip time minus simulated FPGA app time
‡ Derived from round-trip time minus simulated FPGA app time and simulated FPGA TLx/DLx/PHYx time
§ Derived from measured CPU turnaround time plus vendor provided HIP latency
‖ Derived from simulation
¶ Vendor provided latency statistic
RACE TO ZERO LATENCY
BECAUSE JITTER MATTERS
40
42. OpenCAPI Consortium Overview
Goals
1. Provide a forum to give the industry ability to innovate the next generation bus protocol
2. Drive hardware/software innovation to enable choice and efficiency in data center
architectures
3. Build an ecosystem with the flexibility to build servers and data centers best suited for their
computational demands
Mission
Create an open coherent high performance bus interface based on a new bus standard called Open
Coherent Accelerator Processor Interface (OpenCAPI) and grow the ecosystem that utilizes this interface.
Incorporated September 13, 2016
Announced October 14, 2016
42
43. OpenCAPI Consortium
• Open forum founded by AMD, Google, IBM, Mellanox, and Micron
• Manage the OpenCAPI specification, establish enablement, grow the ecosystem
• Currently about 40 members
• Why its own consortium? Architecture agnostic thus capable of going beyond Power Architecture
• Consortium now established
• Established Board of Directors (AMD, Google, IBM, Mellanox Technologies, Micron, NVIDIA, Western
Digital, Xilinx)
• Governing Documents (Bylaws, IPR Policy, Membership) with established Membership Levels
• Technical Steering Committee and Marketing/Communications Committee
• Website www.opencapi.org
• OpenCAPI 3.0 and 3.1 Specifications available as contributed to consortium as starting point for the Work Groups
• OpenCAPI 4.0 now added to the web site !
• AFU Coherent Data Caching of System Memory
• AFU Address Translation Caching (allows posted operations to system memory)
Incorporated September 13, 2016
Announced October 14, 2016
43
45. OpenCAPI Workgroup Status
45
Item Availability
OpenCAPI Technical Steering Committee Up and running
Marketing & Communications Committee Up and running
PHY Signaling Workgroup Up and running
PHY Mechanical Workgroup Up and running
TL Architecture Specification Workgroup Up and running
DL Architecture Specification Workgroup Up and running
Enablement Workgroup Up and running
Compliance Workgroup Up and running
Accelerator/Memory Workgroup Forthcoming
46. OpenCAPI Workgroup Status
46
Item Availability Specs Available for Review
OpenCAPI Technical Steering Committee Up and running WG Process Spec
Marketing & Communications Committee Up and running Regular News Letters
PHY Signaling Workgroup Up and running Today
PHY Mechanical Workgroup Up and running May 2019
TL Architecture Specification Workgroup Up and running Today
DL Architecture Specification Workgroup Up and running April 2019
Enablement Workgroup Up and running Today
Compliance Workgroup Up and running Ongoing
Accelerator/Memory Workgroup Forthcoming --
47. Cross Industry Collaboration and Innovation
47
OpenCAPI Protocol
Welcoming new members in all areas of the ecosystem
Systems and Software
SW
Research &
Academic
Products and Services
Deployment
SOC
Accelerator Solutions
48. Membership Entitlement Details
Strategic level - $25K
• Draft and Final Specifications and
enablement
• License for Product development
• Workgroup participation and voting
• TSC participation
• Vote on new Board Members
• Nominate and/or run for officer election
• Prominent listing in appropriate materials
Observing level - $5K
• Final Specifications and enablement
• License for Product development
Contributor level - $15K
• Draft and Final Specifications and
enablement
• License for Product development
• Workgroup participation and voting
• TSC participation
• Submit proposals
Academic and Non-Profit level - Free
• Final Specifications and enablement
• Workgroup participation and voting
48