Mais conteúdo relacionado
Semelhante a RISC-V 30908 patra (20)
Mais de RISC-V International (20)
RISC-V 30908 patra
- 1. © 2020 Western Digital Corporation or its affiliates. All rights reserved. 11/18/20
Building Cache-Coherent Scaleout
Systems With OmniXtend
Atish Patra, Tu Dang, Anup Patel, Damien Le Moal, Dejan Vucinic
- 2. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 2
Agenda
Compute Node
Architecture
System Models
Unified boot
process
Protocol
Simulation
System
emulation
Conclusion
- 3. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 3
OmniXtend Architectures
Source: An Open and Coherent Memory Centric Architecture Enabled by RISC-V, Dejan Vucinic
OmniXtend is a fully open cache-coherence protocol that works over ethernet layer (L2)
- 4. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 4
OmniXtend Reference Design
Memory Fabric Innovation Platform
Standardize RISC-V coherency bus leveraging OmniXtend
- 5. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 5
An OmniXtend Compute Node
High-level view of each compute node
Local DRAM
Interconnect
(TileLink)
HARTs
OmniXtend
Bus
Network
Local Devices
PLIC
CLINT
PCIe
Ethernet
SD/eMMC
SPI
Other
Devices
Remote
Memory
- 6. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 6
Compute Node Address Space
High-level view of physical address space
Local MMIO
(Non-cacheable)
Global RAM
(Cacheable)
Global MMIO
(Non-cacheable)
• At reset time, node’s own RAM will
be mapped to different part of Global
RAM space
• RAM from other nodes can be
mapped in Global RAM space
• Local MMIO space always maps to a
node’s own MMIO devices
• MMIO devices of other nodes, will be
mapped in Global MMIO
• Node’s own MMIO devices can also
be mapped in Global MMIO space
Local RAM
(Cacheable but
not shared)
• Local RAM is only accessible to local
node
- 7. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 7
OmniXtend Hardware Design
High level view
M Dynamic Xbar
(Cacheable channels)
• Reads/Writes to RAM of other nodes
use both Tx and Rx channels
• Outgoing/Incoming cache coherency
messages use Tx/Rx channels
• Dedicated TLoE endpoint similar to C
dynamic Xbar
C Dynamic Xbar
(Non-cacheable channels)
• Reads/Writes to MMIO of other
nodes use both Tx and Rx channels
• Each Tx/Rx channel has a dedicated
TLoE endpoint
Local MMIO
(Non-cacheable)
Global RAM
(Cacheable)
Global MMIO
(Non-cacheable)
Local RAM
(Cacheable but
not shared)
Cores
Cacheable (M)emory Space
>= 0x20_0000_000
(C)ontrol/MMIO Space
< 0x20_0000_0000
AddressAduster
Local/Remote
AddressAduster
Local/Remote
M Dynamic
Xbar
Local
Devices
AddressControl
HartId
Prefix
M
Prefix
C
Prefix
Local
Memory
C Dynamic
Xbar
TLoE
TX
TLoE
TX
TLoE
TX
TLoE
TX
TLoE
TX
TLoE
TX
TLoE
RX
TLoE
RX
TLoE
RX
PacketPickerTX
(add Ethernet Hdr)
MAC TX
Network
MAC TX
PacketPicker RX
(strip Ethernet Hdr)
TLoE
RX
TLoE
RX
TLoE
RX
Configuration
MAC RX
Cacheable (M)emory space
> 0x4000_0000_0000
(C)ontrol MMIO space
< 0x3FFF-FFFF_FFFF
- 8. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 8
Node 0
Single OS (SO) Model
Similar to a regular SMP/NUMA system
CPU0 CPU1
memory0
CPU2 CPU3
Node 1
CPU0 CPU1
memory1
CPU2 CPU3
System
CPU0 CPU1
memory0
CPU2 CPU3 CPU4 CPU5
memory1
CPU6 CPU7
NUMA node 0 NUMA node 1
OmniXtend
fabric
- 9. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 9
Current Status
• Up to 4 OmniXtend nodes with NUMA enabled
• Dynamic topology reconfiguration with DIP switch
• Point-to-Point via SN2010 switch
• Based on latest upstream Linux kernel (v5.9)
Working NUMA setup
- 10. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 10
Can we do better ?
• NUMA model doesn’t scale beyond few nodes
– Expensive
– Power management issues
– Reliability
• Wide adoption of distributed application model over clustered network servers
– Scales well but higher latency
– New programming model in absence of cache coherency
• We need a cache coherent system that can scale!!
To achieve better scale out models
- 11. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 11
Independent Nodes (IN) Model
• Each node boots independent instance of Linux
• Access shared memory exposed by a kernel driver
Experimental approach that can scale
Node 0
CPU0 CPU1
memory0
CPU2 CPU3
Node 1
CPU0 CPU1
memory1
CPU2 CPU3
OmniXtend
fabric
shared memory
- 12. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 12
OmniXtend Unified Boot Protocol
• Allows to run either SO or IN model on the same hardware after reset
• OnDemand reconfiguration based on the application requirement
• Many possibilities of experimentation and exploration of research problems
• Performance benchmarking with ease
• No designated leader required
Single boot process for all system models
- 13. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 13
Management Server
• Management server can be implemented in
– A separate host machine that is connected to the
OmniXtend network
– One of the Compute Node
• Does not require to participate in all OmniXtend
traffic but should be capable of communicating
with all nodes via raw ethernet frames
• Assigns individual node’s role and memory
ranges
– Every node boots up independently and wait for role to
be assigned by the management server
– Implementation is flexible and can be used to create
different topologies
Manages dynamic discovery and initialization
Management
server
Node 0
(compute &
memory)
Node 2
(compute &
memory)
Node 3
(compute &
memory)
Node N
(memory only)
Node N
(hardware
accelerator)
- 14. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 14
Boot process (SO system)
• BSP (bootstrap processor == first HART to run) wakes up other HARTs once low-level
memory initialization is completed in OpenSBI
Similar to a regular NUMA system boot flow
Linux
(S-mode)
(OS)
ZSBL
(M-mode)
(ROM)
FSBL
(M-mode)
(LOADER)
OpenSBI
(M-mode)
ZSBL
(M-mode)
(ROM)
FSBL
(M-mode)
(LOADER)
OpenSBI
(M-mode)
ZSBL
(M-mode)
(ROM)
FSBL
(M-mode)
(LOADER)
OpenSBI
(M-mode)
All remote memory is
accessible by kernel
and user space
Board1
Board2
Board3
- 15. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 15
Boot process (IN system)
A distributed system with shared memory
Linux
(S-mode)
(OS)
ZSBL
(M-mode)
(ROM)
FSBL
(M-mode)
(LOADER)
OpenSBI
(M-mode)
ZSBL
(M-mode)
(ROM)
FSBL
(M-mode)
(LOADER)
OpenSBI
(M-mode)
ZSBL
(M-mode)
(ROM)
FSBL
(M-mode)
(LOADER)
OpenSBI
(M-mode)
Board1
Board2
Board3
Linux
(S-mode)
(OS)
Linux
(S-mode)
(OS)
On-demand
shared memory
• Shared memory implemented on demand (e.g. mmap()) with remapping of dynamically
allocated node-local and remote DRAM pages
• Can scale up to hundreds of nodes
- 16. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 16
Can we develop faster ?
A software centric approach
Simulation/Emulation can fast track research
Easy to experiment and explore as successive quick iteration in software
Reduce CapEx for Research and Development
Early verify the correctness and scalability of the protocol
Easy to extend other ISAs (x86, ARM64) in future
- 17. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 17
What is protocol simulation?
• Simulate the interaction of OmniXtend endpoints
• Implemented as a C library
• Include unit tests for verification of the protocol
• Quickly adapt to the protocol update
Protocol Simulation
- 18. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 18
Extendable plugins in the library
• Plugins are independent from each other
• Developers can provide their own implementation
• Current implementation
– OX_CACHE: Configurable LRU cache
– OX_PROTOCOL: TileLink over Ethernet (TLoE)
– OX_TRANSPORT: Raw Socket
Allowing configurability and extendibility
TRANSPORT
CACHE
PROTOCOL
(rawsock, IP/TCP, RDMA)
Snoopy, Directory-based
Direct Mapped, LRU, FIFO
DEVICE
- 19. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 19
Protocol simulation overview
User input
Protocol Simulation
library
OXSE_DEV
QTest
System emulation
Test Application
OXPS_DEV
GoogleTest
OXPS_APP
OX_PROTOCOL
OX_TRANSPORT
OX_CACHE
Address
checker
memory ops
callback
- 20. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 20
What is system emulation?
• An OmniXtend system consists
of distributed compute,
memory, storage and optional
accelerator nodes
• Nodes are connected to a
traditional switch or
programmable switch with
enhanced features
Implement the entire OmniXtend protocol in Qemu
Switch
OmniXtend Node 0
phy
kernel
Qemu
emulation
OmniXtend Node 1
phy kernel
Qemu
emulation
NVM
NVM
NVM phy
FPGA/ASIC
OmniXtend Node 2
phy
ML
Accelerator
phy
- 21. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 21
System Emulation Overview
A Qemu device emulates OmniXtend device
Kernel
User space
Qemu process
phy
virtio net-
device
Guest Kernel
Guest User space
Hardware
Virtio disk
OmniXtend
device
Read/Write to non-cacheable
remote memory
Access ethernet packets
using a raw socket
Implement the
OmniXtend
protocol
Address
checker
memory ops
callback
Memory
backend
LRU Cache
Read/Write to
cacheable memory
Local Remote
Remote cacheable
memory request
mmio device
- 22. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 22
They co-exist & complement each other
• Complete system level emulation
• Verify the initialization protocol
• Bring up large number of nodes
Pros
• Low performance
• Can’t emulate hardware accelerator
Cons
• Simple to setup
• Test suites to ensure correctness
• Doesn’t require special hardware
Pros
• Cannot boot Operating Systems
• Low performance
Cons
System emulation
Protocol Simulation
Simulation & Emulation Summary
- 23. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 23
Future work
• Implement kernel support for Independent Node model
– User space applications can mmap & use the shared memory
– Kernel may choose to protect its own page table memory from remote access via PMP
– Kernel driver may handle the remote memory page faults
– A user space library adds to support to existing distributed application framework
• Experiment different topologies with a different rules in management server
• Improve Simulation/Emulation using extensive test suites
• Hook up with the Qemu instances with OmniXtend FPGAs
• Exercise the scalability test with many nodes
• Opensource the simulation library and Qemu
To infinity and beyond!
- 24. © 2020 Western Digital Corporation or its affiliates. All rights reserved. 11/18/20
- 26. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 26
Types of messages
Type Sender Intended
Receiver
Mode of
transmission
Payload Intended action
HELLO OE MS Broadcast None send and wait
WELCOME MS OE Unicast None Continue
MY_INFO OE MS Unicast Local OE DTB send and wait
DTB_INFO MS OE Unicast Full OmniXtend
system DTB
parse OmniXtend dt node
and configure channels
CONFIG_DONE OE MS Unicast None send and wait
ENABLE MS OE Unicast None enable TLoE in given order
DT
ENABLE_DONE OE MS Unicast None send and wait
BOOT MS OE Unicast None Reconfigure memory and
boot
• OmniXtend Endpoint (OE)
• Management server (MS)
- 27. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 27
Booting SO system
Before TLoE is enabled
Configure TX/RX Channels
Enable TLoE end points in
given order
OE(Non-Boot) MS OE (Boot)
Boot hart
Configure ethernet
Send HELLO and
wait for WELCOME
HELLO
Unicast WELCOME packet
WELCOME
MY_INFO
DTB_INFO
Send MY_INFO and
wait for DTB_INFO
Multicast full DTB with OmniXtend
topology to Non-booting node
Parses OmniXtend topology
ENABLE
Send CONFIG_DONE
and wait for ENABLE
Configure TX/RX Channels
CONFIG_DONE
Enable TLoE end points in
given order
Send ENABLE message after CONFIG_DONE
is received from every node
Configure ethernet
Send HELLO and
wait for WELCOME
HELLO
WELCOME
MY_INFO
DTB_INFO
Send MY_INFO and
wait for DTB_INFO
CONFIG_DONE
Parses OmniXtend topology
Multicast full DTB with OmniXtend
topology to booting node
ENABLE
Boot hart
- 28. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 28
Booting SO system
After TLoE is enabled
Send ENABLE_DONE
and wait for BOOT
Configure MMIO devices
OE(Non-Boot) MS OE (Boot)
Boot hart
Send ENABLE_DONE
and wait for BOOT
ENABLE_DONE
BOOT
Reconfigure MMIO
address
Send BOOT message after ENABLE_DONE
is received from every node
ENABLE_DONE
BOOT
Boot hart
Reconfigure cacheable
memory address
Jump
Non-boot hart
Parse full DT
Non-boot hart
Warm boot
Linux Jump to Linux
Normal OpenSBI boot
- 29. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 29
Booting IN system
Before TLoE is enabled
OE(Boot) MS OE (Boot)
Boot hart
Configure ethernet
Send HELLO and
wait for WELCOME
HELLO
Unicast WELCOME packet
WELCOME
MY_INFO
DTB_INFO
Send MY_INFO and
wait for DTB_INFO
Multicast full DTB with OmniXtend
topology to Non-booting node
Parses OmniXtend topology
ENABLE
Send CONFIG_DONE
and wait for ENABLE
Configure TX/RX Channels
CONFIG_DONE
Enable TLoE end points in
given order
Send ENABLE message after CONFIG_DONE
is received from every node
Configure ethernet
Send HELLO and
wait for WELCOME
HELLO
WELCOME
MY_INFO
DTB_INFO
Send MY_INFO and
wait for DTB_INFO
CONFIG_DONE
Parses OmniXtend topology
Configure TX/RX Channels
ENABLE
Enable TLoE end points in
given order
Send CONFIG_DONE
and wait for ENABLE
- 30. 11/18/20
© 2020 Western Digital Corporation or its affiliates. All rights reserved. 30
Booting IN system
After TLoE is enabled
OE(Boot) MS OE (Boot)
Boot hart
Send ENABLE_DONE
and wait for BOOT
ENABLE_DONE
BOOT
Reconfigure MMIO
address
Send BOOT message after ENABLE_DONE
is received from every node
Send ENABLE_DONE
and wait for BOOT
ENABLE_DONE
BOOT
Boot hart
Reconfigure cacheable
memory address
Non-boot hart
Configure MMIO devices
Non-boot hart
Warm boot
Linux Jump to Linux
Normal OpenSBI boot
Warm boot
Normal OpenSBI boot
Linux
Jump to Linux
On-demand shared memory using a
Linux OmniXtend driver
Reconfigure cacheable
memory address