This document discusses OpenCAPI acceleration using the OpenCAPI Acceleration Framework (oc-accel). It provides an overview of the oc-accel components and workflow, benchmarks the OC-Accel bandwidth and latency, and provides examples of how to fully utilize OC-Accel capabilities to accelerate functions on an FPGA. The document also outlines the OC-Accel development process and previews upcoming features like support for ODMA to port existing PCIe accelerators to OpenCAPI.
2. 2
……a really good story
Porting functions to FPGA has never been so easy !
Acceleration Framework
3. 3
Contents
• OpenCAPI acceleration paradigms
• Components and workflow
• OC-Accel Bandwidth and Latency
• Examples to fully utilize OC-Accel
4. 4
Xilinx: All Programmable
Say hello to the POWER9 processor
Built from the ground-up for enterprise AI, POWER9 is the only processor with state-of-the-art I/O
subsystem technology.
5. 5
Remote Main
memory
Architecture for heterogeneous computing
CPU
core
Cache
CPU
core
Cache
CPU
core
Cache
CPU
core
Cache
On-chip coherent interconnection
Local Main memory
Cache
FPGA/
ASIC
GPU
memory
Remote Main
memory
Device
memory
SCM
Cache
GPU
NVLINK
CPU
core
Cache
CPU
core
Cache
CPU
core
Cache
CPU
core
Cache
Symmetric Multiple Processors
A X O N
Off-chip coherent interconnection
6. 6
OC-Accel enables virtual memory sharing
Conceptual System
Accelerator card
FPGA/ASIC chip
Processor chip
Acceleration
Function UnitOn-chip coherent interconnection
Host DDR memory
CPU
core
Cache
CPU
core
Cache
CPU
core
Cache
25Gbps x8
OC-Accel
logic external IO
8. 8
All you need for an OpenCAPI device:
Available Today!
Host bus protocol
layer
TL
DL
PHY
PHYX
DLX
TLX
OC-Accel
Bridge
AFU
HostprocessorOpenCAPIDevice
Host bus interface
Serial link
https://github.com/OpenCAPI/OpenCAPI3.0_Client_RefDesign
https://github.com/OpenCAPI/oc-accel
9. 9
OC-Accel includes
• Hardware logic to hide the details of TLX protocol
• Software APIs (libosnap) for Application code to talk to
• Scripts and strategies to construct a FPGA Project
• Simulation Environment with OCSE
• Workflow for coding, debugging, implementation and deployment
• High Level Synthesis support
• Examples and Documents to get start
• Open to New FPGA Board
10. 10
OC/AXI Bridge modeProcess C
Slave Context
libcxl
cxl
SNAP
library
Job
Queue
Process B
Slave Context
libcxl
cxl
SNAP
library
Job
Queue
Process A
Slave Context
libocxl
ocxl
osnap
library
TLx
DLxOpenCAPI
Software on Host Server Acceleration on FPGA
Software Program
DDR
on-card
Others
...
cfg
snap
core
Quick and easy development framework for OpenCAPI Accelerators
Verilog
or
HLS C/C++
Hardware Action
AXI4-MM
AXI
AXI4 lite
Master
Slave
Master
data
bridge
mmio
oc-accel
11. 11
board_support_packages
CARD1
xdc
pinout*.xdc (phy and flash)
config*.xdc
timing*.xdc
verilog
xilinx (phy)
oc_fpga_top.v
oc_bsp.v
cfg_tieoffs.v
vpd_stub.v
ip
create_others.tcl
create_axi_hwicap.tcl, etc. – flashing IP
create_ddr4.t
cl
CARD2
CARD3
Can be applied to
any OpenCAPI cards!
New FPGA card support
13. 13
For Verilog developers:
• AXI4-MM master interface
– Compared to CAPI-SNAP, new features are added
OC/AXI Bridge mode
TLx
DLx
Acceleration on FPGA
DRAM
on-card
Others
...
cfg
snap
core
Verilog
or
HLS C/C++
Hardware Action
AXI4-MM
AXI
AXI4 lite
Master
Slave
Master
data
bridge
mmio
Action (hw)
Prepare Environment
Co-Simulation
Application (sw) HLS C++
& Optimization
Build FPGA image
Deploy on Power Server
Verilog/VHDL
& Unit Sim
14. 14
AXI4-MM Master Interface
AXI Signal Features Compared to CAPI-
SNAP
Address 64b Effective Address Same
Data 1024b 512b
Bus Size Support all narrow-sized transfers
(1, 2, 4, 8, 16, 32, 64, 128Byte)
64Byte (full size)
Burst Length 8bits (1 to 256, not across 4KB boundary) Same
Burst Mode FIXED, INCR INCR
Write Strobe Supported N/A
ID 5bits (32 IDs) 1bit
USER 9bits (512 PASIDs) N/A
15. 15
OC/AXI Bridge mode
TLx
DLx
Acceleration on FPGA
DRAM
on-card
Others
...
cfg
snap
core
Verilog
or
HLS C/C++
Hardware Action
AXI4-MM
AXI
AXI4 lite
Master
Slave
Master
data
bridge
mmio
For HLS developers
static void process_action(snap_membus_t *din_gmem,
snap_membus_t *dout_gmem,
snap_membus_t *d_ddrmem,
action_reg *act_reg)
{...}
Action (hw)
Prepare Environment
Co-Simulation
Application (sw) HLS C++
& Optimization
Build FPGA image
Deploy on Power Server
Verilog/VHDL
& Unit Sim
16. 16
Application calls HW action (Verilog)
Application: main()
Open device
Prepare data (malloc)
Prepare job
Attach action (ID, job)
snap_mmio_write(ctrl_reg)
...
snap_mmio_read(status)
Close device
Cleanup
libosnap APIs
Action: (hw)
Verilog/VHDL
Master
Slave
action registers
(axi_lite_slave)
ctrl_reg
status
para1
...
18. 18
Software+Hardware Co-simulation
OC/AXI Bridge modeProcess C
Slave Context
libcxl
cxl
SNAP
library
Job
Queue
Process B
Slave Context
libcxl
cxl
SNAP
library
Job
Queue
Process A
Slave Context
libocxl of OCSE
osnap
library
Software
Program
DRAM
on-card
Others
...
snap
core
Verilog
or
HLS C/C++
Hardware Action
AXI4-MM
AXI
AXI4 lite
Master
Slave
Maste
r
data
bridge
mmio
afu_driver
OCSE
Process 3 (App) Process 1 (Simulator)Process 2 (OCSE)
21. 21
Latency
From To Average time (ns)
Send Read Address
(ARVALID)
Get Read Data
(RVALID)
495ns
Send Write Address
(AWVALID)
Write Done
(BVALID)
450ns
OC/AXI Bridge mode
TLx
DLx
Acceleration on FPGA
cfg
snap
core
Verilog
or
HLS C/C++
Hardware Action
AXI4-MM
AXI4 lite
Master
Slave
Master
data
bridge
mmio
OpenCAPI
CPU on Host Server
25Gbps x8
Host
Memory
CPU
22. 22
Fully utilize OpenCAPI:
Shared Memory
Software thread:
Set/append Work elements
Execute software tasks
Read/Check HW Status
Access any address at any
time
Hardware on FPGA:
Get/Complete Work
elements
Do hardware computations
Update Status
Access any address at any
time within the same
contextInnate ability for Software/Hardware
cooperation because of
CAPI/OpenCAPI
23. 23
Accelerate more and smaller functions
valgrind + qcachegrind
Accelerate more and smaller functions
Profiling à Host-spot functions à Accelerate them!
24. 24
Scattered memory access
Handle data structure with pointers Send addresses (or list
head pointer) to FPGA
1
Directly reads required
data at random address.
2
blk blk blk blk
Gathering data by software
1
1 transaction of big
amount of data to FPGA
2
Traditional
DMA way
25. 25
SW-HW working together
CPU FPGA
start
Hardware Logic
done
Offload CPU
No device driver running in the kernel-space
CPU FPGA
start
Hardware Logic
done
software task
software task
software task
Hardware Logic
Hardware Logic
Hardware Logic
Running in pipeline with shorter overhead
27. 27
200Gbps: Match ethernet throughput
OC/AXI Bridge mode
TLx
DLx
Acceleration on FPGA
DRAM
on-card
ethernet
cfg
snap
core
Verilog
or
HLS C/C++
Hardware Action
AXI4-MM
AXI
AXI4 lite
Master
Slave
Maste
r
data
bridge
mmio
OpenCAPI
CPU on Host Server
25Gbps x8
100Gbps x2
28. 28
Open & Easy Developing
Global Memory Sharing
Latency
Throughput
Boost your acceleration
performance by:
Acceleration Framework
31. 31
Next season:
• ODMA helps you port existing PCIe based accelerator to OpenCAPI
• Same interface as Xilinx Runtime & XDMA
OC/AXI DMA modeProcess C
Slave Context
libcxl
cxl
SNAP
library
Job
Queue
Process B
Slave Context
libcxl
cxl
SNAP
library
Job
Queue
Process A
Slave Context
libocxl
ocxl
TLx
DLxOpenCAPI
Software on Host Server Acceleration on FPGA
Software Program
DRAM
on-card
Others
...
cfg
snap
core
Verilog
or
HLS C/C++
Hardware Action
AXI
Slave
Master
odma
mmio
Slave
AXI4 Chanels
MM/ST
Slave
...
Slave
Masterosnap
library
Job
Queue
AXI4 lite
AXI4 lite
38. 38
Compliant to SNAP (CAPI1.0/2.0)
• Data Width Converter will be inserted automatically
• No need to change code
OC/AXI Bridge mode
TLx
DLx
Acceleration on FPGA
DRAM
on-card
Others
...
cfg
snap
core
Verilog
or
HLS C/C++
Hardware Action
1024b
AXI
AXI4 lite
Master
Slave
Maste
r
data
bridge
mmio
axi
dwidth
converter
512b
39. 39
Clock Domains
OC/AXI Bridge mode
TLx
DLx
Acceleration on FPGA
DRAM
on-card
Others
...
cfg
snap
core
Verilog
or
HLS C/C++
Hardware Action
AXI4-MM
AXI
AXI4 lite
Master
Slave
Maste
r
data
bridge
mmio
clk_tlx
400MHz
clk_afu: 200MHz clk_act: 200MHz
• User defined action clock
40. 40
FPGA/ASIC
Processor Core
TL
TL Frame/Parser
DL
PHY
Host bus protocol
Accelerator
(C1)
PHYX
DLX
TLX
TLX Frame/Parser
TLX/AFU protocol
On-chip coherent interconnection
Local Main memory
CPU
core
Cache
CPU
core
Cache
CPU
core
Cache
Transportation Layer
Data Link Layer
Physical Layer
memory
interface
OpenCAPI stack: C1 mode
41. 41
FPGA/ASIC
memory
interface
Processor Core
TL
TL Frame/Parser
DL
PHY
Host bus protocol
Home Agent
(M1)
PHYX
DLX
TLX
TLX Frame/Parser
TLX/AFU protocol
On-chip coherent interconnection
Local Main memory
CPU
core
Cache
CPU
core
Cache
CPU
core
Cache
Transportation Layer
Data Link Layer
Physical Layer
memory
interface
OpenCAPI stack: M1 mode
Device
Memory
SCM