SlideShare uma empresa Scribd logo
1 de 49
Baixar para ler offline
OpenCAPI: Next Generation of
Acceleration for the Cognitive Area
—
Brian Allison
OpenCAPI Technology and Enablement
E-mail: ballis@us.ibm.com
Industry Collaboration and Innovation
2
OpenCAPI Topics
Key Messages
Industry Background
Technology Overview
Possible Solutions
OpenCAPI Based Systems
OpenCAPI & CAPI2 Adapters
Design Enablement
Performance Metrics
OpenCAPI Consortium
3
Key Messages Throughout
 IBM’s Strong History in IO Offerings
 OpenCAPI is an Open IO Standard
 Not tied to Power – Architecture Agnostic
 High Performance – No OS/Hypervisor/FW Overhead with
Low Latency and High Bandwidth
 Programing Ease
 Very Low Accelerator Design Overhead
 Supports heterogeneous environment – Use Cases
 Ideal for Accelerated Computing and SCM
 Optimized for within a single system node
 Products exist today!
4
OpenCAPI Topics
Key Messages
Industry Background
Technology Overview
Possible Solutions
OpenCAPI Based Systems
OpenCAPI & CAPI2 Adapters
Design Enablement
Performance Metrics
OpenCAPI Consortium
5
What’s Driving the Creation of a High Perf. Bus
• Historical silicon technology improvements out of steam
• More cores on a processor help but you’ll never have enough; especially for today’s emerging
workloads (analytics, artificial intelligence, machine learning, real time analysis, etc.)
• New advanced memory technologies are changing the economics of computing
• Companies realizing the need to off load the microprocessors from routine algorithms to
meet demand and improve system performance  Accelerated Computing
• Accelerators - If you are going to use them, you need a lot of data in/out quickly
6
Computation Data Access
© 2018 IBM Corporation
POWER Processor Technology Roadmap
2H12
POWER7+
32 nm
- 2.5x Larger L3 cache
- On-die acceleration
- Zero-power core idle state
- Up to 12 Cores
- SMT8
- CAPI Acceleration
- High Bandwidth GPU Attach
1H14 – 2H161H10
POWER7
45 nm
- 8 Cores
- SMT4
- eDRAM L3 Cache
POWER9 Family
14nm
POWER8 Family
22nm
7
Enterprise
Enterprise
Enterprise &
Big Data Optimized
2H17 – 2H18+
Built for the Cognitive Era
− Enhanced Core and Chip
Architecture Optimized for
Emerging Workloads
− Processor Family with
Scale-Up and Scale-Out
Optimized Silicon
− Premier Platform for
Accelerated Computing
Open Coherent Accelerator Processor Interface
Industry driving these necessary Accelerator attributes
• High performance (move a lot of data quickly; bandwidth, latency)
• Fulfill various accelerator form factors (e.g., GPUs, FPGAs, ASICs)
• Introduction of device coherency requirements
• Emergence of complex Storage and Memory solutions
• Growing demand for network performance with increased
computational demand
• Need to be architecture agnostic to enable the ecosystem growth and
adaption
8
Accelerated Computing Example
IBM’s AC922 built for Accelerated Computing
 Classic system topology setup with two POWER9 sockets, each socket having at
least two NVIDIA GPUs
Does Accelerated Computing work?
• Summit Supercomputer, an IBM-built supercomputer now running at the
Department of Energy’s (DOE) Oak Ridge National Laboratory, captured the
number one spot with a performance of 143,5 petaflops on High Performance
Linpack (HPL), the benchmark used to rank the TOP500 list. Summit has 4,356
nodes, each one equipped with two 22-core Power9 CPUs, and six NVIDIA Tesla
V100 GPUs. The nodes are linked together with a Mellanox dual-rail EDR
InfiniBand network.
9
OpenCAPI Topics
Key Messages
Industry Background
Technology Overview
Possible Solutions
OpenCAPI Based Systems
OpenCAPI & CAPI2 Adapters
Design Enablement
Performance Metrics
OpenCAPI Consortium 10
Why OpenCAPI and what is it?
 OpenCAPI is a new ‘bottom’s up’ IO standard
 Key Attributes of OpenCAPI 3.0
• Open IO Standard – Choice for developers and others to contribute and grow an ecosystem
• Coherent interface – Microprocessor memory, accelerator and caches share the same memory space
• Architecture agnostic – Capable going beyond Power Architecture
• Not tied to Power – Architecture Agnostic
• High performance – No OS/Hypervisor/FW Overhead for Low Latency and High Bandwidth
• Ease of programing
• Ease of implementation with minimal accelerator design overhead
• Ideal for accelerated computing and SCM including various form factors (FPGA, GPU, ASIC, TPU, etc.)
• Optimized for within a single system node
• Supports heterogeneous environment – Use Cases
 OpenCAPI 3.1
 Applies OpenCAPI technology for use of standard DRAM off the microprocessor
 Based on an Open Memory Interface (OMI)
 Further tuned for extreme lower latency
11
Accelerated
OpenCAPI Device
OpenCAPI Key Attributes
12
TL/DL 25Gb I/O
Any OpenCAPI Enabled Processor
U
Accelerated
Function
TLx/DLx
1. Architecture agnostic bus – Applicable with any system/microprocessor architecture
2. Optimized for High Bandwidth and Low Latency
3. High performance 25Gbps PHY design with zero ‘overhead’
4. Coherency - Attached devices operate natively within application’s user space and coherently with host microprocessor
5. Virtual addressing enables low overhead with no Kernel, hypervisor or firmware involvement; security benefit
6. Wide range of Use Cases and access semantics
7. CPU coherent device memory (Home Agent Memory)
8. Architected for both Classic Memory and emerging Advanced Storage Class Memory
9. Minimal OpenCAPI design overhead (FPGA less than 5%)
Caches
Application
 Storage/Compute/Network etc
 ASIC/FPGA/FFSA
 FPGA, SOC, GPU Accelerator
 Load/Store or Block Access
Standard System Memory Advanced SCM
Solutions
BufferedSystemMemory
OpenCAPIMemoryBuffers
Device Memory
Use Cases - A True Heterogeneous Architecture Built Upon OpenCAPI
OpenCAPI 3.0
OpenCAPI 3.1
OpenCAPI specifications are
downloadable from the website
at www.opencapi.org
- Register
- Download
13
© 2018 IBM Corporation
Proposed POWER Processor Technology and I/O Roadmap
POWER8 Architecture POWER9 Architecture
2014
POWER8
12 cores
22nm
New Micro-
Architecture
New Process
Technology
2016
POWER8
w/ NVLink
12 cores
22nm
Enhanced
Micro-
Architecture
With NVLink
2017
P9 SO
12/24 cores
14nm
New Micro-
Architecture
Direct attach
memory
New Process
Technology
2018
P9 SU
12/24 cores
14nm
Enhanced
Micro-
Architecture
Buffered
Memory
POWER7 Architecture
2010
POWER7
8 cores
45nm
New Micro-
Architecture
New Process
Technology
2012
POWER7+
8 cores
32nm
Enhanced
Micro-
Architecture
New Process
Technology
2020+
P10
TBA cores
New Micro-
Architecture
New
Technology
POWER10
2019
P9
w/ Adv. I/O
12/24 cores
14nm
Enhanced
Micro-
Architecture
New
Memory
Subsystem
150 GB/s
PCIe Gen4 x48
25 GT/s
300GB/s
CAPI 2.0,
OpenCAPI3.0,
NVLink
Sustained Memory Bandwidth
Standard I/O Interconnect
Advanced I/O Signaling
Advanced I/O Architecture
210 GB/s
PCIe Gen4 x48
25 GT/s
300GB/s
CAPI 2.0,
OpenCAPI3.0,
NVLink
350+ GB/s
PCIe Gen4 x48
25 GT/s
300GB/s
CAPI 2.0,
OpenCAPI4.0,
NVLink
435+ GB/s
PCIe Gen5
32 & 50 GT/s
TBA
210 GB/s
PCIe Gen3
N/A
CAPI 1.0
210 GB/s
PCIe Gen3
20 GT/s
160GB/s
CAPI 1.0 ,
NVLink
65 GB/s
PCIe Gen2
N/A
N/A
65 GB/s
PCIe Gen2
N/A
N/A
Statement of Direction, Subject to Change 14
POWER9 IO Features
8 and 16Gbps PHY
Protocols Supported
• PCIe Gen3 x16 and PCIe Gen4 x8
• CAPI 2.0 on PCIe Gen4
PCIeGen4
P9
25Gbs
25Gbps PHY
Protocols Supported
• OpenCAPI 3.0
• NVLink 2.0
Silicon Die
Various packages
(scale-out, scale-up)
POWER9 IO Leading the Industry
• PCIe Gen4
• CAPI 2.0 (Power)
• NVLink 2.0
• OpenCAPI 3.0
POWER9
15
Why do I care about Virtual Addressing?
 An OpenCAPI device operates in the virtual address spaces of the applications that it supports
• Eliminates kernel and device driver software overhead
• Allows device to operate on application memory without kernel-level data copies/pinned pages
• Simplifies programming effort to integrate accelerators into applications
• Culmination => Improves Accelerator Performance
 The Virtual-to-Physical Address Translation occurs in the host CPU
• Reduces design complexity of OpenCAPI accelerator development
• Makes it easier to ensure interoperability between OpenCAPI devices and different CPU architectures
• Security - Since the OpenCAPI device never has access to a physical address, this eliminates the
possibility of a defective or malicious device accessing memory locations belonging to the kernel or
other applications that it is not authorized to access
16
17
OpenCAPI vs I/O Device Driver – Because minimizing SW Path Length is crucial for performance
Typical I/O Model Flow using Device Driver
Flow with a OpenCAPI Model
Shared Memory
Notify Accelerator Acceleration
Shared memory completion
or fast thread wake up
DD Call
Copy or Pin
Source Data
MMIO Notify
Accelerator Acceleration
Poll / Intr
Completion
Copy or Unpin
Result Data
Return From DD
Completion
TL/DL
25Gb I/O
Processor
FPGA/SoC/GPU
Functionn
Function0
Function1
Function2
OpenCAPI
TLx/DLx
Device Memory
Host Memory
Total ~13µs for data prep
Total 0.36µs
400 Instructions 100 Instructions
0.3µs 0.06µs
Application
Dependent, but
Equal to above
3,000 Instructions
1,000 Instructions
1,000 Instructions
4.9µs
300 Instructions 10,000 Instructions
7.9µs
Application
Dependent, but
Equal to below
OpenCAPI Topics
Key Messages
Industry Background
Technology Overview
Possible Solutions
OpenCAPI Based Systems
OpenCAPI & CAPI2 Adapters
Design Enablement
Performance Metrics
OpenCAPI Consortium 18
So How is Accelerated Computing Leveraged?
Okay, I’m sold. But how do I leverage this technology?
Isolate and identify your heavily used workloads,
algorithms and/or applications
Place this workload onto an accelerator
Path of least resistance and flexibility is an FPGA that is
programmable
Purchase any of the OpenCAPI vendor FPGA cards and an
OpenCAPI enabled system and start developing your
solution!
19
Comparison of Memory Paradigms
Emerging Storage Class Memory
Processor Chip DLx/TLx
SCMData
OpenCAPI 3.1 Architecture
Ultra Low Latency ASIC Memory buffer chip
adding ~ +7-10 ns on top of native DDR direct
connect!!
Storage Class Memory tiered with traditional DDR
Memory all built upon OpenCAPI 3.1 & 3.0
architecture.
Still have the ability to use Load/Store Semantics
Storage Class Memories have the potential to be
the next disruptive technology…..
Examples include ReRAM, MRAM, Z-NAND……
All are racing to become the defacto
Main Memory
Processor Chip DLx/TLx
DDR4/5
Example: Basic DDR attach
Data
Tiered Memory
Processor Chip
DLx/TLx
DDR4/5
DLx/TLx
SCM
Data
Data
Tier 1 Memory
Tier 2 Memory
Acceleration Paradigms with Great Performance
21
Examples: Encryption, Compression, Erasure prior to
delivering data to the network or storage
Processor Chip
Acc
Data
Egress Transform
DLx/TLx
Processor Chip
Acc
Data
Bi-Directional Transform
Acc
DLx/TLx
Examples: NoSQL such as Neo4J with Graph Node Traversals, etc
Needle-in-a-haystack Engine
Examples: Machine or Deep Learning such as Natural Language processing,
sentiment analysis or other Actionable Intelligence using OpenCAPI attached memory
Memory Transform
Processor Chip
Acc
DataDLx/TLx
Example: Basic work offload
Processor Chip
Acc
NeedlesDLx/TLx
Examples: Database searches, joins, intersections, merges
Only the Needles are sent to the processor
Ingress Transform
Processor Chip
Acc
DataDLx/TLx
Examples: Video Analytics, Network Security,
Deep Packet Inspection, Data Plane Accelerator,
Video Encoding (H.265), High Frequency Trading etc
Needle-In-A-Haystack Engine
Large
Haystack
Of Data
OpenCAPI is ideal for acceleration due
to Bandwidth to/from accelerators, best
of breed latency, and flexibility of an
Open architecture
OpenCAPI Topics
Key Messages
Industry Background
Technology Overview
Possible Solutions
OpenCAPI Based Systems
OpenCAPI & CAPI2 Adapters
Design Enablement
Performance Metrics
OpenCAPI Consortium 22
IBM AC922
Air Cooled
23
Power9 Systems with OpenCAPI
System Details
• 2 – Socket 2U
• Up to 40 cores
• Up to 2TB memory (16 DDR4 Dimms)
• 4 Gen4 PCIe Slots, 3 CAPI2.0 Enabled
• 2 2.5” SFF Drive Bays
• 4 OpenPOWER Mezzanine Sockets
• Up to 4 NVLink V100 GPUs
• Up to 4 socketed OpenCAPI Adapters*
• Up to 1 cabled OpenCAPI Cards w/ SlimSAS
adapter*
* Future Support
24
Power9 Systems with OpenCAPI
• Dual Socket
• OCP 48V
• Up to 4TB memory (32 DDR4 Dimms)
• 5 PCIe Gen4 Slots, 3 CAPI2.0 Enabled
• 2 x8 OCP Gen4 PCIe Mezz Slots CAPI2.0 Enabled
• 4 x8 25 Gbps SlimSAS Accelerator Ports
• Supports up to 4 OpenCAPI cabled adapters
Google & Rackspace Zaius
Motherboard
Rackspace BarrelEye G2 ServerSed
25
• Zaius Motherboard
• Full-depth 48V
Open Rack V2
• 2OU Chassis
• High Density, Hot
Swap Storage Bay
• 4 x8 25Gb/s
Coherent Attach
Ports
“The OpenCAPI accelerator and
software ecosystem is growing
rapidly. With collateral available via
the Open Compute website,
accelerator developers find it easy to
design and test their solutions on our
platform.”
Adi Gangidi, Senior Design
Engineer with Rackspace
Power9 Systems with OpenCAPI
26
• 2 Socket 2U
• Up to 48 cores
• Up to 4TB memory (32 DDR4 DIMMs)
• 4 Gen4 PCIe Slots, CAPI2.0 Enabled
• 6 Gen3 PCIe Slots
• Up to 24 SFF / 12 LFF Drives
• 4 x8 25 Gbps Ports
• Up to 4 cabled OpenCAPI
Adapters*
Mihawk
“In order to provide the best backend architecture in AI, Big Data, and
Cloud applications, Wistron POWER9 system design incorporates OpenCAPI
technology through 25Gbps high speed link to dramatically change the
traditional data transition method. This design not only improves GPU
performance, but also utilizes next generation advanced memory, coherent
network, storage, and FPGA. This is an ideal system infrastructure to
meet next decade computing world challenges.”
Donald Hwang, CTO and President of EBG at Wistron Corporation.
OpenCAPI Topics
Key Messages
Industry Background
Technology Overview
Possible Solutions
OpenCAPI Based Systems
OpenCAPI & CAPI2 Adapters
Design Enablement
Performance Metrics
OpenCAPI Consortium 27
28
OpenCAPI and CAPI2 Adapters
Nallatech 250S+
Storage Expansion
• Xilinx US+ KU15P FPGA
• 4 GB DDR4
• PCIe Gen4 x8 and CAPI2
• 4x M.2 Slots
• M.2 to MiniSAS or Oculink for U.2
drive support
CAPI Flash API, Accelerated DB, Burst
Buffer
Nallatech 250-SoC
Multipurpose Converged Network /
Storage
• Xilinx Zynq US+ ZU19EG FPGA
• 8/16 GB DDR4, 4/8 GB DDR4 ARM
• PCIe Gen4 x8 or Gen3 x16, CAPI2
• 4 x8 Oculink Ports support NVMe,
Network, or OpenCAPI
• 2 100Gb QSFP28 Cages
Mellanox Innova2
Network + FPGA
• Xilinx US+ KU15P FPGA
• Mellanox CX5 NIC
• 16 GB DDR4
• PCIe Gen4 x8
• 2 25Gb SFP Cages
• X8 25Gb/s OpenCAPI Support
Network Acceleration (NFV, Packet
Classification), Security Acceleration
29
OpenCAPI and CAPI2 Adapters
AlphaData ADM-9V3
High Performance Reconfigurable
Computing
• Xilinx US+ VU3P FPGA
• 16 / 32 GB DDR4
• PCIe Gen3 x16 or Gen4 x8 and CAPI2
• 2 QSFP28 Cages
• X8 25Gb/s OpenCAPI SlimSAS
Data Center, Network Accel, HPC, HFT
Bittware XUPVV4
Massive FPGA
• Xilinx US+ VU13P FPGA
• 4 288-pin DIMM Slots, DDR4 or Dual QDR
• Up to 512GB DDR4
• PCIe Gen3 x16, CAPI2 Capable
• 4 100Gb QSFP28 Cages
• 2 x8 25Gb/s OpenCAPI Support
Optimized for Thermal Performance for
Large acceleration in the Data Center
AlphaData ADM-9H7
Large FPGA with 8GB HBM
• Xilinx US+ VU37P FPGA + HBM
• 8GB High Bandwidth Memory
• PCIe Gen4 x8 or Gen3 x16, CAPI2
• 2 x8 25 Gb/s OpenCAPI Ports (support
up to 50 GB/s)
• 4 100Gb QSFP28 Cages
AlphaData ADM-9H3
Large FPGA with 8GB HBM
• Xilinx Virtex US+ VU33P-3 FPGA + HBM
• 8GB High Bandwidth Memory
• PCIe Gen4 x8 or Gen3 x16, CAPI2
• 1 x8 25 Gb/s OpenCAPI Ports (support
up to 50 GB/s)
• 2 100Gb QSFP28 Cages
ML/DL, Inference, System Modeling, HPC
OpenCAPI Topics
Key Messages
Industry Background
Technology Overview
Possible Solutions
OpenCAPI Based Systems
OpenCAPI & CAPI2 Adapters
Design Enablement
Performance Metrics
OpenCAPI Consortium
30
TLx and DLx Reference Designs in an FPGA
 TLX and DLX will be provided as reference designs to OpenCAPI consortium members
• Associated reference design specifications for TLx and DLx will also be delivered
along with RTL
 Open Verilog – free to enhance, improve or leverage pieces of reference design for
your own accelerator development
 Designed for 64B packet flow running at 400MHz
 Xilinx Vivado 2017.1 TLX and DLX Statistics on VU3P Device
31
VU3P Resources CLB FlipFlops LUT as Logic LUT Memory Block Ram Tile
DLx 9392/788160
(1.19%)
19026/394080
(4.82%)
0/197280
(0%)
7.5/720
(1.0%)
TLx 13806/788160
(1.75%)
8463/394080
(2.14%)
2156/197280
(1.09%)
0/720
(0%)
Total 23108/788160
(2.94%)
27849/394080
(6.98%)
2156/197280
(1.09%)
7.5/720
(1.0%)
Because power efficiency and size do matter
OpenCAPI Device
• Customer application and accelerator
• Operation system enablement
• Little Endian Linux
• Reference Kernel Driver (ocxl)
• Reference User Library (libocxl)
• Hardware and reference designs to
enable coherent acceleration
Core
Processor
OS
App
(software)
Memory (Coherent)
Accelerated
Function(s)
TLx
DLx
25G
ocxl
libocxl
OCSE (OpenCAPI Simulation Environment)
models the red outlined area
OCSE enables AFU and Application co-
simulation when the reference libocxl
and reference TLx/DLx are used
Contributed to the Enablement
Workgroup under the OpenCAPI
consortium
Cable
Memory (Coherent)
Elements of the OpenCAPI Simulation Environment
25G
DL
TL
32
Exerciser Examples – Provided to OCC Members
 MemCopy
• The MemCopy example is a data mover from source address -> destination address
using Virtual Addressing and includes these features
• Configuration and MMIO Register Space
• acTag Table used for Bus/Device/Function and Process ID identification
• 512 processes/contexts and 32 engines supporting up to 2K transfers using 64B,
128B, or 256B operations
 Memory Home Agent
• The Memory Home Agent example implements memory off the endpoint OpenCAPI
accelerator to act as a coherent extension to the host processor memory
• The Memory Home Agent example includes these features
• Configuration and MMIO Register Space
• Individual and pipelined operation for memory loads and stores
• Interrupts, with error details reported to software through MMIO registers
• Sparse Address Mapping feature to extend 1 MB of real space to 4 TB of address
 Open Examples – free to enhance, improve or leverage pieces of exerciser
examples for your own accelerator development 33
Reference Card Design Number 1
 Definition of the FPGA reference card(s) is driven as part of the Enablement Work Group
 Definition of the cable(s) is driven as part of the PHY Mechanical Work Group
 Representative Diagram is articulated below
34
FPGA
OpenCAPI Protocol Over the 25Gbps link
Including Discovery, Configuration and Enumeration
Power and Ground only
Convenient Fixture
Amphenol SlimSAS OpenCAPI
Connector
Reference Card Design Number 2
 Definition of the FPGA reference card(s) is driven as part of the Enablement Work Group
 Definition of the cable(s) will be driven as part of the PHY Mechanical Work Group
 Representative Diagram is articulated below
35
Amphenol SlimSAS OpenCAPI
Connector
Mezzanine based Form Factor
Table of Enablement Deliveries
36
Item Delivery Name Where to Obtain Available
OpenCAPI 3.0 TLx and DLx
Reference Xilinx FPGA Designs (RTL
and Specifications)
<snapshot>.tar.gz Enablement WG Today
Xilinx Vivado Project Build with
Memcopy Exerciser
Vivado Project Flow Enablement WG Today
Device Discovery and Configuration
Specification and RTL
OpenCAPI 3.0 Configuration Sub-
System Reference Design
Specification
Enablement WG Causeway Today
AFU Interface Specification TLX 3.0 Reference Design.pdf Enablement WG Causeway Today
25Gbps PHY Signal Specification OC PHY 25G Specification PHY Signaling WG Causeway Today
25Gbps PHY Mechanical
Specification
25Gbps Interface Mechanical Spec PHY Mechanical WG Causeway Today
OpenCAPI Simulation Environment
(OCSE)
ocse-<version>.tar.gz
OpenCAPIDemokit.pdf
Enablement WG Today
Today
Memcopy and Memory Home
Agent Exercisers
MCP3 and LPC
<snapshot>.tar.gz
Enablement WG Today
Reference Driver Available LIBOCXL Ubuntu 18.04
GitHub
Today
u
SmartDV OpenCAPI VIP Environment
contains:
• Complete regression suite
• Usage examples
• Detailed documentation
• User’s Guide and Release Notes
Benefits
• Complete Verification of
OpenCAPI Design
• Easy to Use
• Simplify Result Analysis
• Runs in every sim environment
37
SmartDV
OpenCAPI Verification IP
http://www.smart-dv.com/vip/opencapi.html
OpenCAPI Topics
Key Messages
Industry Background
Technology Overview
Possible Solutions
Design Enablement
Performance Metrics
OpenCAPI Consortium
38
CAPI and OpenCAPI Performance
39
CAPI 1.0
PCIE Gen3 x8
Measured BW
@8Gb/s
CAPI 2.0
PCIE Gen4 x8
Measured BW
@16Gb/s
OpenCAPI 3.0
25 Gb/s x8
Measured BW
@25Gb/s
128B DMA
Read
3.81 GB/s 12.57 GB/s 22.1 GB/s
128B DMA
Write
4.16 GB/s 11.85 GB/s 21.6 GB/s
256B DMA
Read
N/A 13.94 GB/s 22.1 GB/s
256B DMA
Write
N/A 14.04 GB/s 22.0 GB/s
POWER8
Introduced
in 2013
POWER9
Second
Generation
POWER9
Open Architecture with a
Clean Slate Focused on
Bandwidth and Latency
POWER8 CAPI 1.0
POWER9 CAPI 2.0
and OpenCAPI 3.0
Xilinx
KU60/VU3P FPGA
Latency Test Results
OpenCAPI Link
P9 OpenCAPI
3.9GHz Core, 2.4GHz Nest
Xilinx FPGA VU3P
298ns‡
2ns Jitter
TL, DL, PHY
TLx, DLx, PHYx (80ns‖)
378ns† Total Latency
PCIe G4 Link
P9 PCIe Gen4
Xilinx FPGA VU3P
est. <337ns
PCIe Stack
Xilinx PCIe HIP (218ns¶)
est. <555ns§ Total Latency
PCIe G3 Link
P9 PCIe Gen3
3.9GHz Core, 2.4GHz Nest
Altera FPGA Stratix V
337ns
7ns Jitter
PCIe Stack
Altera PCIe HIP (400ns¶)
737ns§ Total Latency
PCIe G3 Link
Kaby Lake PCIe Gen3*
3.9GHz Core, 2.4GHz Nest
Altera FPGA Stratix V
376ns
31ns Jitter
PCIe Stack
Altera PCIe HIP (400ns¶)
776ns§ Total Latency
* Intel Core i7 7700 Quad-Core 3.6GHz (4.2GHz Turbo Boost)
† Derived from round-trip time minus simulated FPGA app time
‡ Derived from round-trip time minus simulated FPGA app time and simulated FPGA TLx/DLx/PHYx time
§ Derived from measured CPU turnaround time plus vendor provided HIP latency
‖ Derived from simulation
¶ Vendor provided latency statistic
RACE TO ZERO LATENCY
BECAUSE JITTER MATTERS
40
OpenCAPI Topics
Key Messages
Industry Background
Technology Overview
Possible Solutions
Design Enablement
Performance Metrics
OpenCAPI Consortium
41
OpenCAPI Consortium Overview
Goals
1. Provide a forum to give the industry ability to innovate the next generation bus protocol
2. Drive hardware/software innovation to enable choice and efficiency in data center
architectures
3. Build an ecosystem with the flexibility to build servers and data centers best suited for their
computational demands
Mission
Create an open coherent high performance bus interface based on a new bus standard called Open
Coherent Accelerator Processor Interface (OpenCAPI) and grow the ecosystem that utilizes this interface.
Incorporated September 13, 2016
Announced October 14, 2016
42
OpenCAPI Consortium
• Open forum founded by AMD, Google, IBM, Mellanox, and Micron
• Manage the OpenCAPI specification, establish enablement, grow the ecosystem
• Currently about 40 members
• Why its own consortium? Architecture agnostic thus capable of going beyond Power Architecture
• Consortium now established
• Established Board of Directors (AMD, Google, IBM, Mellanox Technologies, Micron, NVIDIA, Western
Digital, Xilinx)
• Governing Documents (Bylaws, IPR Policy, Membership) with established Membership Levels
• Technical Steering Committee and Marketing/Communications Committee
• Website www.opencapi.org
• OpenCAPI 3.0 and 3.1 Specifications available as contributed to consortium as starting point for the Work Groups
• OpenCAPI 4.0 now added to the web site !
• AFU Coherent Data Caching of System Memory
• AFU Address Translation Caching (allows posted operations to system memory)
Incorporated September 13, 2016
Announced October 14, 2016
43
OpenCAPI Workgroup Status
39
Item
OpenCAPI Technical Steering Committee
Marketing & Communications Committee
PHY Signaling Workgroup
PHY Mechanical Workgroup
TL Architecture Specification Workgroup
DL Architecture Specification Workgroup
Enablement Workgroup
Compliance Workgroup
Accelerator/Memory Workgroup
OpenCAPI Workgroup Status
45
Item Availability
OpenCAPI Technical Steering Committee Up and running
Marketing & Communications Committee Up and running
PHY Signaling Workgroup Up and running
PHY Mechanical Workgroup Up and running
TL Architecture Specification Workgroup Up and running
DL Architecture Specification Workgroup Up and running
Enablement Workgroup Up and running
Compliance Workgroup Up and running
Accelerator/Memory Workgroup Forthcoming
OpenCAPI Workgroup Status
46
Item Availability Specs Available for Review
OpenCAPI Technical Steering Committee Up and running WG Process Spec
Marketing & Communications Committee Up and running Regular News Letters
PHY Signaling Workgroup Up and running Today
PHY Mechanical Workgroup Up and running May 2019
TL Architecture Specification Workgroup Up and running Today
DL Architecture Specification Workgroup Up and running April 2019
Enablement Workgroup Up and running Today
Compliance Workgroup Up and running Ongoing
Accelerator/Memory Workgroup Forthcoming --
Cross Industry Collaboration and Innovation
47
OpenCAPI Protocol
Welcoming new members in all areas of the ecosystem
Systems and Software
SW
Research &
Academic
Products and Services
Deployment
SOC
Accelerator Solutions
Membership Entitlement Details
Strategic level - $25K
• Draft and Final Specifications and
enablement
• License for Product development
• Workgroup participation and voting
• TSC participation
• Vote on new Board Members
• Nominate and/or run for officer election
• Prominent listing in appropriate materials
Observing level - $5K
• Final Specifications and enablement
• License for Product development
Contributor level - $15K
• Draft and Final Specifications and
enablement
• License for Product development
• Workgroup participation and voting
• TSC participation
• Submit proposals
Academic and Non-Profit level - Free
• Final Specifications and enablement
• Workgroup participation and voting
48
OpenCAPI Consortium Next Steps
JOIN TODAY!
www.opencapi.org
49

Mais conteúdo relacionado

Mais procurados

Best Practices and Performance Studies for High-Performance Computing Clusters
Best Practices and Performance Studies for High-Performance Computing ClustersBest Practices and Performance Studies for High-Performance Computing Clusters
Best Practices and Performance Studies for High-Performance Computing ClustersIntel® Software
 
Introduction to architecture exploration
Introduction to architecture explorationIntroduction to architecture exploration
Introduction to architecture explorationDeepak Shankar
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeIntel® Software
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformGanesan Narayanasamy
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsGanesan Narayanasamy
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceOdinot Stanislas
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2Yutaka Kawai
 
High Performance Interconnects: Landscape, Assessments & Rankings
High Performance Interconnects: Landscape, Assessments & RankingsHigh Performance Interconnects: Landscape, Assessments & Rankings
High Performance Interconnects: Landscape, Assessments & Rankingsinside-BigData.com
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementGanesan Narayanasamy
 
BXI: Bull eXascale Interconnect
BXI: Bull eXascale InterconnectBXI: Bull eXascale Interconnect
BXI: Bull eXascale Interconnectinside-BigData.com
 
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Deepak Shankar
 

Mais procurados (20)

Best Practices and Performance Studies for High-Performance Computing Clusters
Best Practices and Performance Studies for High-Performance Computing ClustersBest Practices and Performance Studies for High-Performance Computing Clusters
Best Practices and Performance Studies for High-Performance Computing Clusters
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 
Introduction to architecture exploration
Introduction to architecture explorationIntroduction to architecture exploration
Introduction to architecture exploration
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's Stampede
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
Intel python 2017
Intel python 2017Intel python 2017
Intel python 2017
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power Systems
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
 
@IBM Power roadmap 8
@IBM Power roadmap 8 @IBM Power roadmap 8
@IBM Power roadmap 8
 
AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
 
POWER9 for AI & HPC
POWER9 for AI & HPCPOWER9 for AI & HPC
POWER9 for AI & HPC
 
High Performance Interconnects: Landscape, Assessments & Rankings
High Performance Interconnects: Landscape, Assessments & RankingsHigh Performance Interconnects: Landscape, Assessments & Rankings
High Performance Interconnects: Landscape, Assessments & Rankings
 
DOME 64-bit μDataCenter
DOME 64-bit μDataCenterDOME 64-bit μDataCenter
DOME 64-bit μDataCenter
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 
Summit workshop thompto
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
 
BXI: Bull eXascale Interconnect
BXI: Bull eXascale InterconnectBXI: Bull eXascale Interconnect
BXI: Bull eXascale Interconnect
 
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 

Semelhante a OpenCAPI next generation accelerator

GTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWERGTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWERAchronix
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_finalYutaka Kawai
 
Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIBM Switzerland
 
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...Cesar Maciel
 
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems SpecialistOWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems SpecialistParis Open Source Summit
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Redis Labs
 
OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017Radisys Corporation
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors DataWorks Summit/Hadoop Summit
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Indrajit Poddar
 
ODSA Sub-Project Launch
ODSA Sub-Project LaunchODSA Sub-Project Launch
ODSA Sub-Project LaunchODSA Workgroup
 
ODSA Sub-Project Launch
 ODSA Sub-Project Launch ODSA Sub-Project Launch
ODSA Sub-Project LaunchNetronome
 
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AIIntroduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AITyrone Systems
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsAnand Haridass
 
RedisConf17 - Redis Enterprise on IBM Power Systems
RedisConf17 - Redis Enterprise on IBM Power SystemsRedisConf17 - Redis Enterprise on IBM Power Systems
RedisConf17 - Redis Enterprise on IBM Power SystemsRedis Labs
 
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...Filipe Miranda
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeAnand Haridass
 

Semelhante a OpenCAPI next generation accelerator (20)

Demystify OpenPOWER
Demystify OpenPOWERDemystify OpenPOWER
Demystify OpenPOWER
 
GTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWERGTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWER
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final
 
Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bk
 
Power overview 2018 08-13b
Power overview 2018 08-13bPower overview 2018 08-13b
Power overview 2018 08-13b
 
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
 
IBM Power8 announce
IBM Power8 announceIBM Power8 announce
IBM Power8 announce
 
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems SpecialistOWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
 
OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017
 
OpenPOWER Foundation Overview
OpenPOWER Foundation OverviewOpenPOWER Foundation Overview
OpenPOWER Foundation Overview
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
 
ODSA Sub-Project Launch
ODSA Sub-Project LaunchODSA Sub-Project Launch
ODSA Sub-Project Launch
 
ODSA Sub-Project Launch
 ODSA Sub-Project Launch ODSA Sub-Project Launch
ODSA Sub-Project Launch
 
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AIIntroduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AI
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
 
RedisConf17 - Redis Enterprise on IBM Power Systems
RedisConf17 - Redis Enterprise on IBM Power SystemsRedisConf17 - Redis Enterprise on IBM Power Systems
RedisConf17 - Redis Enterprise on IBM Power Systems
 
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand Challenge
 

Mais de Ganesan Narayanasamy

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency programGanesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and VerilogGanesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISAGanesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsGanesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsGanesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Ganesan Narayanasamy
 

Mais de Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 
Robustness in deep learning
Robustness in deep learningRobustness in deep learning
Robustness in deep learning
 
Perspectives of Frond end Design
Perspectives of Frond end DesignPerspectives of Frond end Design
Perspectives of Frond end Design
 
A2O Core implementation on FPGA
A2O Core implementation on FPGAA2O Core implementation on FPGA
A2O Core implementation on FPGA
 

Último

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Último (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

OpenCAPI next generation accelerator

  • 1. OpenCAPI: Next Generation of Acceleration for the Cognitive Area — Brian Allison OpenCAPI Technology and Enablement E-mail: ballis@us.ibm.com
  • 3. OpenCAPI Topics Key Messages Industry Background Technology Overview Possible Solutions OpenCAPI Based Systems OpenCAPI & CAPI2 Adapters Design Enablement Performance Metrics OpenCAPI Consortium 3
  • 4. Key Messages Throughout  IBM’s Strong History in IO Offerings  OpenCAPI is an Open IO Standard  Not tied to Power – Architecture Agnostic  High Performance – No OS/Hypervisor/FW Overhead with Low Latency and High Bandwidth  Programing Ease  Very Low Accelerator Design Overhead  Supports heterogeneous environment – Use Cases  Ideal for Accelerated Computing and SCM  Optimized for within a single system node  Products exist today! 4
  • 5. OpenCAPI Topics Key Messages Industry Background Technology Overview Possible Solutions OpenCAPI Based Systems OpenCAPI & CAPI2 Adapters Design Enablement Performance Metrics OpenCAPI Consortium 5
  • 6. What’s Driving the Creation of a High Perf. Bus • Historical silicon technology improvements out of steam • More cores on a processor help but you’ll never have enough; especially for today’s emerging workloads (analytics, artificial intelligence, machine learning, real time analysis, etc.) • New advanced memory technologies are changing the economics of computing • Companies realizing the need to off load the microprocessors from routine algorithms to meet demand and improve system performance  Accelerated Computing • Accelerators - If you are going to use them, you need a lot of data in/out quickly 6 Computation Data Access
  • 7. © 2018 IBM Corporation POWER Processor Technology Roadmap 2H12 POWER7+ 32 nm - 2.5x Larger L3 cache - On-die acceleration - Zero-power core idle state - Up to 12 Cores - SMT8 - CAPI Acceleration - High Bandwidth GPU Attach 1H14 – 2H161H10 POWER7 45 nm - 8 Cores - SMT4 - eDRAM L3 Cache POWER9 Family 14nm POWER8 Family 22nm 7 Enterprise Enterprise Enterprise & Big Data Optimized 2H17 – 2H18+ Built for the Cognitive Era − Enhanced Core and Chip Architecture Optimized for Emerging Workloads − Processor Family with Scale-Up and Scale-Out Optimized Silicon − Premier Platform for Accelerated Computing
  • 8. Open Coherent Accelerator Processor Interface Industry driving these necessary Accelerator attributes • High performance (move a lot of data quickly; bandwidth, latency) • Fulfill various accelerator form factors (e.g., GPUs, FPGAs, ASICs) • Introduction of device coherency requirements • Emergence of complex Storage and Memory solutions • Growing demand for network performance with increased computational demand • Need to be architecture agnostic to enable the ecosystem growth and adaption 8
  • 9. Accelerated Computing Example IBM’s AC922 built for Accelerated Computing  Classic system topology setup with two POWER9 sockets, each socket having at least two NVIDIA GPUs Does Accelerated Computing work? • Summit Supercomputer, an IBM-built supercomputer now running at the Department of Energy’s (DOE) Oak Ridge National Laboratory, captured the number one spot with a performance of 143,5 petaflops on High Performance Linpack (HPL), the benchmark used to rank the TOP500 list. Summit has 4,356 nodes, each one equipped with two 22-core Power9 CPUs, and six NVIDIA Tesla V100 GPUs. The nodes are linked together with a Mellanox dual-rail EDR InfiniBand network. 9
  • 10. OpenCAPI Topics Key Messages Industry Background Technology Overview Possible Solutions OpenCAPI Based Systems OpenCAPI & CAPI2 Adapters Design Enablement Performance Metrics OpenCAPI Consortium 10
  • 11. Why OpenCAPI and what is it?  OpenCAPI is a new ‘bottom’s up’ IO standard  Key Attributes of OpenCAPI 3.0 • Open IO Standard – Choice for developers and others to contribute and grow an ecosystem • Coherent interface – Microprocessor memory, accelerator and caches share the same memory space • Architecture agnostic – Capable going beyond Power Architecture • Not tied to Power – Architecture Agnostic • High performance – No OS/Hypervisor/FW Overhead for Low Latency and High Bandwidth • Ease of programing • Ease of implementation with minimal accelerator design overhead • Ideal for accelerated computing and SCM including various form factors (FPGA, GPU, ASIC, TPU, etc.) • Optimized for within a single system node • Supports heterogeneous environment – Use Cases  OpenCAPI 3.1  Applies OpenCAPI technology for use of standard DRAM off the microprocessor  Based on an Open Memory Interface (OMI)  Further tuned for extreme lower latency 11
  • 12. Accelerated OpenCAPI Device OpenCAPI Key Attributes 12 TL/DL 25Gb I/O Any OpenCAPI Enabled Processor U Accelerated Function TLx/DLx 1. Architecture agnostic bus – Applicable with any system/microprocessor architecture 2. Optimized for High Bandwidth and Low Latency 3. High performance 25Gbps PHY design with zero ‘overhead’ 4. Coherency - Attached devices operate natively within application’s user space and coherently with host microprocessor 5. Virtual addressing enables low overhead with no Kernel, hypervisor or firmware involvement; security benefit 6. Wide range of Use Cases and access semantics 7. CPU coherent device memory (Home Agent Memory) 8. Architected for both Classic Memory and emerging Advanced Storage Class Memory 9. Minimal OpenCAPI design overhead (FPGA less than 5%) Caches Application  Storage/Compute/Network etc  ASIC/FPGA/FFSA  FPGA, SOC, GPU Accelerator  Load/Store or Block Access Standard System Memory Advanced SCM Solutions BufferedSystemMemory OpenCAPIMemoryBuffers Device Memory
  • 13. Use Cases - A True Heterogeneous Architecture Built Upon OpenCAPI OpenCAPI 3.0 OpenCAPI 3.1 OpenCAPI specifications are downloadable from the website at www.opencapi.org - Register - Download 13
  • 14. © 2018 IBM Corporation Proposed POWER Processor Technology and I/O Roadmap POWER8 Architecture POWER9 Architecture 2014 POWER8 12 cores 22nm New Micro- Architecture New Process Technology 2016 POWER8 w/ NVLink 12 cores 22nm Enhanced Micro- Architecture With NVLink 2017 P9 SO 12/24 cores 14nm New Micro- Architecture Direct attach memory New Process Technology 2018 P9 SU 12/24 cores 14nm Enhanced Micro- Architecture Buffered Memory POWER7 Architecture 2010 POWER7 8 cores 45nm New Micro- Architecture New Process Technology 2012 POWER7+ 8 cores 32nm Enhanced Micro- Architecture New Process Technology 2020+ P10 TBA cores New Micro- Architecture New Technology POWER10 2019 P9 w/ Adv. I/O 12/24 cores 14nm Enhanced Micro- Architecture New Memory Subsystem 150 GB/s PCIe Gen4 x48 25 GT/s 300GB/s CAPI 2.0, OpenCAPI3.0, NVLink Sustained Memory Bandwidth Standard I/O Interconnect Advanced I/O Signaling Advanced I/O Architecture 210 GB/s PCIe Gen4 x48 25 GT/s 300GB/s CAPI 2.0, OpenCAPI3.0, NVLink 350+ GB/s PCIe Gen4 x48 25 GT/s 300GB/s CAPI 2.0, OpenCAPI4.0, NVLink 435+ GB/s PCIe Gen5 32 & 50 GT/s TBA 210 GB/s PCIe Gen3 N/A CAPI 1.0 210 GB/s PCIe Gen3 20 GT/s 160GB/s CAPI 1.0 , NVLink 65 GB/s PCIe Gen2 N/A N/A 65 GB/s PCIe Gen2 N/A N/A Statement of Direction, Subject to Change 14
  • 15. POWER9 IO Features 8 and 16Gbps PHY Protocols Supported • PCIe Gen3 x16 and PCIe Gen4 x8 • CAPI 2.0 on PCIe Gen4 PCIeGen4 P9 25Gbs 25Gbps PHY Protocols Supported • OpenCAPI 3.0 • NVLink 2.0 Silicon Die Various packages (scale-out, scale-up) POWER9 IO Leading the Industry • PCIe Gen4 • CAPI 2.0 (Power) • NVLink 2.0 • OpenCAPI 3.0 POWER9 15
  • 16. Why do I care about Virtual Addressing?  An OpenCAPI device operates in the virtual address spaces of the applications that it supports • Eliminates kernel and device driver software overhead • Allows device to operate on application memory without kernel-level data copies/pinned pages • Simplifies programming effort to integrate accelerators into applications • Culmination => Improves Accelerator Performance  The Virtual-to-Physical Address Translation occurs in the host CPU • Reduces design complexity of OpenCAPI accelerator development • Makes it easier to ensure interoperability between OpenCAPI devices and different CPU architectures • Security - Since the OpenCAPI device never has access to a physical address, this eliminates the possibility of a defective or malicious device accessing memory locations belonging to the kernel or other applications that it is not authorized to access 16
  • 17. 17 OpenCAPI vs I/O Device Driver – Because minimizing SW Path Length is crucial for performance Typical I/O Model Flow using Device Driver Flow with a OpenCAPI Model Shared Memory Notify Accelerator Acceleration Shared memory completion or fast thread wake up DD Call Copy or Pin Source Data MMIO Notify Accelerator Acceleration Poll / Intr Completion Copy or Unpin Result Data Return From DD Completion TL/DL 25Gb I/O Processor FPGA/SoC/GPU Functionn Function0 Function1 Function2 OpenCAPI TLx/DLx Device Memory Host Memory Total ~13µs for data prep Total 0.36µs 400 Instructions 100 Instructions 0.3µs 0.06µs Application Dependent, but Equal to above 3,000 Instructions 1,000 Instructions 1,000 Instructions 4.9µs 300 Instructions 10,000 Instructions 7.9µs Application Dependent, but Equal to below
  • 18. OpenCAPI Topics Key Messages Industry Background Technology Overview Possible Solutions OpenCAPI Based Systems OpenCAPI & CAPI2 Adapters Design Enablement Performance Metrics OpenCAPI Consortium 18
  • 19. So How is Accelerated Computing Leveraged? Okay, I’m sold. But how do I leverage this technology? Isolate and identify your heavily used workloads, algorithms and/or applications Place this workload onto an accelerator Path of least resistance and flexibility is an FPGA that is programmable Purchase any of the OpenCAPI vendor FPGA cards and an OpenCAPI enabled system and start developing your solution! 19
  • 20. Comparison of Memory Paradigms Emerging Storage Class Memory Processor Chip DLx/TLx SCMData OpenCAPI 3.1 Architecture Ultra Low Latency ASIC Memory buffer chip adding ~ +7-10 ns on top of native DDR direct connect!! Storage Class Memory tiered with traditional DDR Memory all built upon OpenCAPI 3.1 & 3.0 architecture. Still have the ability to use Load/Store Semantics Storage Class Memories have the potential to be the next disruptive technology….. Examples include ReRAM, MRAM, Z-NAND…… All are racing to become the defacto Main Memory Processor Chip DLx/TLx DDR4/5 Example: Basic DDR attach Data Tiered Memory Processor Chip DLx/TLx DDR4/5 DLx/TLx SCM Data Data Tier 1 Memory Tier 2 Memory
  • 21. Acceleration Paradigms with Great Performance 21 Examples: Encryption, Compression, Erasure prior to delivering data to the network or storage Processor Chip Acc Data Egress Transform DLx/TLx Processor Chip Acc Data Bi-Directional Transform Acc DLx/TLx Examples: NoSQL such as Neo4J with Graph Node Traversals, etc Needle-in-a-haystack Engine Examples: Machine or Deep Learning such as Natural Language processing, sentiment analysis or other Actionable Intelligence using OpenCAPI attached memory Memory Transform Processor Chip Acc DataDLx/TLx Example: Basic work offload Processor Chip Acc NeedlesDLx/TLx Examples: Database searches, joins, intersections, merges Only the Needles are sent to the processor Ingress Transform Processor Chip Acc DataDLx/TLx Examples: Video Analytics, Network Security, Deep Packet Inspection, Data Plane Accelerator, Video Encoding (H.265), High Frequency Trading etc Needle-In-A-Haystack Engine Large Haystack Of Data OpenCAPI is ideal for acceleration due to Bandwidth to/from accelerators, best of breed latency, and flexibility of an Open architecture
  • 22. OpenCAPI Topics Key Messages Industry Background Technology Overview Possible Solutions OpenCAPI Based Systems OpenCAPI & CAPI2 Adapters Design Enablement Performance Metrics OpenCAPI Consortium 22
  • 23. IBM AC922 Air Cooled 23 Power9 Systems with OpenCAPI System Details • 2 – Socket 2U • Up to 40 cores • Up to 2TB memory (16 DDR4 Dimms) • 4 Gen4 PCIe Slots, 3 CAPI2.0 Enabled • 2 2.5” SFF Drive Bays • 4 OpenPOWER Mezzanine Sockets • Up to 4 NVLink V100 GPUs • Up to 4 socketed OpenCAPI Adapters* • Up to 1 cabled OpenCAPI Cards w/ SlimSAS adapter* * Future Support
  • 24. 24 Power9 Systems with OpenCAPI • Dual Socket • OCP 48V • Up to 4TB memory (32 DDR4 Dimms) • 5 PCIe Gen4 Slots, 3 CAPI2.0 Enabled • 2 x8 OCP Gen4 PCIe Mezz Slots CAPI2.0 Enabled • 4 x8 25 Gbps SlimSAS Accelerator Ports • Supports up to 4 OpenCAPI cabled adapters Google & Rackspace Zaius Motherboard
  • 25. Rackspace BarrelEye G2 ServerSed 25 • Zaius Motherboard • Full-depth 48V Open Rack V2 • 2OU Chassis • High Density, Hot Swap Storage Bay • 4 x8 25Gb/s Coherent Attach Ports “The OpenCAPI accelerator and software ecosystem is growing rapidly. With collateral available via the Open Compute website, accelerator developers find it easy to design and test their solutions on our platform.” Adi Gangidi, Senior Design Engineer with Rackspace
  • 26. Power9 Systems with OpenCAPI 26 • 2 Socket 2U • Up to 48 cores • Up to 4TB memory (32 DDR4 DIMMs) • 4 Gen4 PCIe Slots, CAPI2.0 Enabled • 6 Gen3 PCIe Slots • Up to 24 SFF / 12 LFF Drives • 4 x8 25 Gbps Ports • Up to 4 cabled OpenCAPI Adapters* Mihawk “In order to provide the best backend architecture in AI, Big Data, and Cloud applications, Wistron POWER9 system design incorporates OpenCAPI technology through 25Gbps high speed link to dramatically change the traditional data transition method. This design not only improves GPU performance, but also utilizes next generation advanced memory, coherent network, storage, and FPGA. This is an ideal system infrastructure to meet next decade computing world challenges.” Donald Hwang, CTO and President of EBG at Wistron Corporation.
  • 27. OpenCAPI Topics Key Messages Industry Background Technology Overview Possible Solutions OpenCAPI Based Systems OpenCAPI & CAPI2 Adapters Design Enablement Performance Metrics OpenCAPI Consortium 27
  • 28. 28 OpenCAPI and CAPI2 Adapters Nallatech 250S+ Storage Expansion • Xilinx US+ KU15P FPGA • 4 GB DDR4 • PCIe Gen4 x8 and CAPI2 • 4x M.2 Slots • M.2 to MiniSAS or Oculink for U.2 drive support CAPI Flash API, Accelerated DB, Burst Buffer Nallatech 250-SoC Multipurpose Converged Network / Storage • Xilinx Zynq US+ ZU19EG FPGA • 8/16 GB DDR4, 4/8 GB DDR4 ARM • PCIe Gen4 x8 or Gen3 x16, CAPI2 • 4 x8 Oculink Ports support NVMe, Network, or OpenCAPI • 2 100Gb QSFP28 Cages Mellanox Innova2 Network + FPGA • Xilinx US+ KU15P FPGA • Mellanox CX5 NIC • 16 GB DDR4 • PCIe Gen4 x8 • 2 25Gb SFP Cages • X8 25Gb/s OpenCAPI Support Network Acceleration (NFV, Packet Classification), Security Acceleration
  • 29. 29 OpenCAPI and CAPI2 Adapters AlphaData ADM-9V3 High Performance Reconfigurable Computing • Xilinx US+ VU3P FPGA • 16 / 32 GB DDR4 • PCIe Gen3 x16 or Gen4 x8 and CAPI2 • 2 QSFP28 Cages • X8 25Gb/s OpenCAPI SlimSAS Data Center, Network Accel, HPC, HFT Bittware XUPVV4 Massive FPGA • Xilinx US+ VU13P FPGA • 4 288-pin DIMM Slots, DDR4 or Dual QDR • Up to 512GB DDR4 • PCIe Gen3 x16, CAPI2 Capable • 4 100Gb QSFP28 Cages • 2 x8 25Gb/s OpenCAPI Support Optimized for Thermal Performance for Large acceleration in the Data Center AlphaData ADM-9H7 Large FPGA with 8GB HBM • Xilinx US+ VU37P FPGA + HBM • 8GB High Bandwidth Memory • PCIe Gen4 x8 or Gen3 x16, CAPI2 • 2 x8 25 Gb/s OpenCAPI Ports (support up to 50 GB/s) • 4 100Gb QSFP28 Cages AlphaData ADM-9H3 Large FPGA with 8GB HBM • Xilinx Virtex US+ VU33P-3 FPGA + HBM • 8GB High Bandwidth Memory • PCIe Gen4 x8 or Gen3 x16, CAPI2 • 1 x8 25 Gb/s OpenCAPI Ports (support up to 50 GB/s) • 2 100Gb QSFP28 Cages ML/DL, Inference, System Modeling, HPC
  • 30. OpenCAPI Topics Key Messages Industry Background Technology Overview Possible Solutions OpenCAPI Based Systems OpenCAPI & CAPI2 Adapters Design Enablement Performance Metrics OpenCAPI Consortium 30
  • 31. TLx and DLx Reference Designs in an FPGA  TLX and DLX will be provided as reference designs to OpenCAPI consortium members • Associated reference design specifications for TLx and DLx will also be delivered along with RTL  Open Verilog – free to enhance, improve or leverage pieces of reference design for your own accelerator development  Designed for 64B packet flow running at 400MHz  Xilinx Vivado 2017.1 TLX and DLX Statistics on VU3P Device 31 VU3P Resources CLB FlipFlops LUT as Logic LUT Memory Block Ram Tile DLx 9392/788160 (1.19%) 19026/394080 (4.82%) 0/197280 (0%) 7.5/720 (1.0%) TLx 13806/788160 (1.75%) 8463/394080 (2.14%) 2156/197280 (1.09%) 0/720 (0%) Total 23108/788160 (2.94%) 27849/394080 (6.98%) 2156/197280 (1.09%) 7.5/720 (1.0%) Because power efficiency and size do matter
  • 32. OpenCAPI Device • Customer application and accelerator • Operation system enablement • Little Endian Linux • Reference Kernel Driver (ocxl) • Reference User Library (libocxl) • Hardware and reference designs to enable coherent acceleration Core Processor OS App (software) Memory (Coherent) Accelerated Function(s) TLx DLx 25G ocxl libocxl OCSE (OpenCAPI Simulation Environment) models the red outlined area OCSE enables AFU and Application co- simulation when the reference libocxl and reference TLx/DLx are used Contributed to the Enablement Workgroup under the OpenCAPI consortium Cable Memory (Coherent) Elements of the OpenCAPI Simulation Environment 25G DL TL 32
  • 33. Exerciser Examples – Provided to OCC Members  MemCopy • The MemCopy example is a data mover from source address -> destination address using Virtual Addressing and includes these features • Configuration and MMIO Register Space • acTag Table used for Bus/Device/Function and Process ID identification • 512 processes/contexts and 32 engines supporting up to 2K transfers using 64B, 128B, or 256B operations  Memory Home Agent • The Memory Home Agent example implements memory off the endpoint OpenCAPI accelerator to act as a coherent extension to the host processor memory • The Memory Home Agent example includes these features • Configuration and MMIO Register Space • Individual and pipelined operation for memory loads and stores • Interrupts, with error details reported to software through MMIO registers • Sparse Address Mapping feature to extend 1 MB of real space to 4 TB of address  Open Examples – free to enhance, improve or leverage pieces of exerciser examples for your own accelerator development 33
  • 34. Reference Card Design Number 1  Definition of the FPGA reference card(s) is driven as part of the Enablement Work Group  Definition of the cable(s) is driven as part of the PHY Mechanical Work Group  Representative Diagram is articulated below 34 FPGA OpenCAPI Protocol Over the 25Gbps link Including Discovery, Configuration and Enumeration Power and Ground only Convenient Fixture Amphenol SlimSAS OpenCAPI Connector
  • 35. Reference Card Design Number 2  Definition of the FPGA reference card(s) is driven as part of the Enablement Work Group  Definition of the cable(s) will be driven as part of the PHY Mechanical Work Group  Representative Diagram is articulated below 35 Amphenol SlimSAS OpenCAPI Connector Mezzanine based Form Factor
  • 36. Table of Enablement Deliveries 36 Item Delivery Name Where to Obtain Available OpenCAPI 3.0 TLx and DLx Reference Xilinx FPGA Designs (RTL and Specifications) <snapshot>.tar.gz Enablement WG Today Xilinx Vivado Project Build with Memcopy Exerciser Vivado Project Flow Enablement WG Today Device Discovery and Configuration Specification and RTL OpenCAPI 3.0 Configuration Sub- System Reference Design Specification Enablement WG Causeway Today AFU Interface Specification TLX 3.0 Reference Design.pdf Enablement WG Causeway Today 25Gbps PHY Signal Specification OC PHY 25G Specification PHY Signaling WG Causeway Today 25Gbps PHY Mechanical Specification 25Gbps Interface Mechanical Spec PHY Mechanical WG Causeway Today OpenCAPI Simulation Environment (OCSE) ocse-<version>.tar.gz OpenCAPIDemokit.pdf Enablement WG Today Today Memcopy and Memory Home Agent Exercisers MCP3 and LPC <snapshot>.tar.gz Enablement WG Today Reference Driver Available LIBOCXL Ubuntu 18.04 GitHub Today u
  • 37. SmartDV OpenCAPI VIP Environment contains: • Complete regression suite • Usage examples • Detailed documentation • User’s Guide and Release Notes Benefits • Complete Verification of OpenCAPI Design • Easy to Use • Simplify Result Analysis • Runs in every sim environment 37 SmartDV OpenCAPI Verification IP http://www.smart-dv.com/vip/opencapi.html
  • 38. OpenCAPI Topics Key Messages Industry Background Technology Overview Possible Solutions Design Enablement Performance Metrics OpenCAPI Consortium 38
  • 39. CAPI and OpenCAPI Performance 39 CAPI 1.0 PCIE Gen3 x8 Measured BW @8Gb/s CAPI 2.0 PCIE Gen4 x8 Measured BW @16Gb/s OpenCAPI 3.0 25 Gb/s x8 Measured BW @25Gb/s 128B DMA Read 3.81 GB/s 12.57 GB/s 22.1 GB/s 128B DMA Write 4.16 GB/s 11.85 GB/s 21.6 GB/s 256B DMA Read N/A 13.94 GB/s 22.1 GB/s 256B DMA Write N/A 14.04 GB/s 22.0 GB/s POWER8 Introduced in 2013 POWER9 Second Generation POWER9 Open Architecture with a Clean Slate Focused on Bandwidth and Latency POWER8 CAPI 1.0 POWER9 CAPI 2.0 and OpenCAPI 3.0 Xilinx KU60/VU3P FPGA
  • 40. Latency Test Results OpenCAPI Link P9 OpenCAPI 3.9GHz Core, 2.4GHz Nest Xilinx FPGA VU3P 298ns‡ 2ns Jitter TL, DL, PHY TLx, DLx, PHYx (80ns‖) 378ns† Total Latency PCIe G4 Link P9 PCIe Gen4 Xilinx FPGA VU3P est. <337ns PCIe Stack Xilinx PCIe HIP (218ns¶) est. <555ns§ Total Latency PCIe G3 Link P9 PCIe Gen3 3.9GHz Core, 2.4GHz Nest Altera FPGA Stratix V 337ns 7ns Jitter PCIe Stack Altera PCIe HIP (400ns¶) 737ns§ Total Latency PCIe G3 Link Kaby Lake PCIe Gen3* 3.9GHz Core, 2.4GHz Nest Altera FPGA Stratix V 376ns 31ns Jitter PCIe Stack Altera PCIe HIP (400ns¶) 776ns§ Total Latency * Intel Core i7 7700 Quad-Core 3.6GHz (4.2GHz Turbo Boost) † Derived from round-trip time minus simulated FPGA app time ‡ Derived from round-trip time minus simulated FPGA app time and simulated FPGA TLx/DLx/PHYx time § Derived from measured CPU turnaround time plus vendor provided HIP latency ‖ Derived from simulation ¶ Vendor provided latency statistic RACE TO ZERO LATENCY BECAUSE JITTER MATTERS 40
  • 41. OpenCAPI Topics Key Messages Industry Background Technology Overview Possible Solutions Design Enablement Performance Metrics OpenCAPI Consortium 41
  • 42. OpenCAPI Consortium Overview Goals 1. Provide a forum to give the industry ability to innovate the next generation bus protocol 2. Drive hardware/software innovation to enable choice and efficiency in data center architectures 3. Build an ecosystem with the flexibility to build servers and data centers best suited for their computational demands Mission Create an open coherent high performance bus interface based on a new bus standard called Open Coherent Accelerator Processor Interface (OpenCAPI) and grow the ecosystem that utilizes this interface. Incorporated September 13, 2016 Announced October 14, 2016 42
  • 43. OpenCAPI Consortium • Open forum founded by AMD, Google, IBM, Mellanox, and Micron • Manage the OpenCAPI specification, establish enablement, grow the ecosystem • Currently about 40 members • Why its own consortium? Architecture agnostic thus capable of going beyond Power Architecture • Consortium now established • Established Board of Directors (AMD, Google, IBM, Mellanox Technologies, Micron, NVIDIA, Western Digital, Xilinx) • Governing Documents (Bylaws, IPR Policy, Membership) with established Membership Levels • Technical Steering Committee and Marketing/Communications Committee • Website www.opencapi.org • OpenCAPI 3.0 and 3.1 Specifications available as contributed to consortium as starting point for the Work Groups • OpenCAPI 4.0 now added to the web site ! • AFU Coherent Data Caching of System Memory • AFU Address Translation Caching (allows posted operations to system memory) Incorporated September 13, 2016 Announced October 14, 2016 43
  • 44. OpenCAPI Workgroup Status 39 Item OpenCAPI Technical Steering Committee Marketing & Communications Committee PHY Signaling Workgroup PHY Mechanical Workgroup TL Architecture Specification Workgroup DL Architecture Specification Workgroup Enablement Workgroup Compliance Workgroup Accelerator/Memory Workgroup
  • 45. OpenCAPI Workgroup Status 45 Item Availability OpenCAPI Technical Steering Committee Up and running Marketing & Communications Committee Up and running PHY Signaling Workgroup Up and running PHY Mechanical Workgroup Up and running TL Architecture Specification Workgroup Up and running DL Architecture Specification Workgroup Up and running Enablement Workgroup Up and running Compliance Workgroup Up and running Accelerator/Memory Workgroup Forthcoming
  • 46. OpenCAPI Workgroup Status 46 Item Availability Specs Available for Review OpenCAPI Technical Steering Committee Up and running WG Process Spec Marketing & Communications Committee Up and running Regular News Letters PHY Signaling Workgroup Up and running Today PHY Mechanical Workgroup Up and running May 2019 TL Architecture Specification Workgroup Up and running Today DL Architecture Specification Workgroup Up and running April 2019 Enablement Workgroup Up and running Today Compliance Workgroup Up and running Ongoing Accelerator/Memory Workgroup Forthcoming --
  • 47. Cross Industry Collaboration and Innovation 47 OpenCAPI Protocol Welcoming new members in all areas of the ecosystem Systems and Software SW Research & Academic Products and Services Deployment SOC Accelerator Solutions
  • 48. Membership Entitlement Details Strategic level - $25K • Draft and Final Specifications and enablement • License for Product development • Workgroup participation and voting • TSC participation • Vote on new Board Members • Nominate and/or run for officer election • Prominent listing in appropriate materials Observing level - $5K • Final Specifications and enablement • License for Product development Contributor level - $15K • Draft and Final Specifications and enablement • License for Product development • Workgroup participation and voting • TSC participation • Submit proposals Academic and Non-Profit level - Free • Final Specifications and enablement • Workgroup participation and voting 48
  • 49. OpenCAPI Consortium Next Steps JOIN TODAY! www.opencapi.org 49