Mais conteúdo relacionado Semelhante a Cisco's journey from Verbs to Libfabric (20) Mais de Jeff Squyres (19) Cisco's journey from Verbs to Libfabric1. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1
Cisco’s Journey
From Verbs to Libfabric
Abondon the shackles of Verbs
Embrace the freedom of Libfabric
Jeffrey M. Squyres
Cisco Systems
23 September 2015
2. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 2
Application
Kernel
Cisco VIC ethX port
TCP stack
General Ethernet driver
enic.ko
Userspace
sockets API userspace library
Application
Verbs IB core
usnic.ko
Send and
receive
fast path
usNICTCP/IP
3. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 3© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 3
Verbs is a fine API.
…if you make InfiniBand hardware.
4. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 4© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 4
...but now there’s this libfabric thing
(see libfabric.org community for details)
5. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 5
Keep in mind, Cisco already supports
UD Verbs
6. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 6
• Monotonic enum
• Could not add popular Ethernet values
1500
9000
• usNIC verbs provider had to lie (!)
…just like iWARP providers
• MPI had to match verbs device with IP
interface to find real MTU
Verbs
IBV_MTU_256
IBV_MTU_512
IBV_MTU_1024
IBV_MTU_2048
IBV_MTU_4096
7. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 7
• Integer (not enum) endpoint attribute
Libfabric
8. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 8
• Integer (not enum) endpoint attribute
Libfabric
DONE
9. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 9
• Mandatory GRH structure
InfiniBand-specific header
• 40 bytes
UDP header is 42 bytes
…and a different format
• Breaks ib_ud_pingpong
• usnic verbs provider used “magic”
ibv_port_query() to return extensions
pointers
E.g., enable 42-byte UDP mode
Verbs
e
t
len chksmacdmac …
ver len
n
e
x
t
h
o
p
sgid dgid
UDP header: 42 bytes
GRH: 40 bytes
10. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 10
• FI_MSG_PREFIX and
ep_attr.msg_prefix_size
Libfabric
e
t
len chksmacdmac …
payload
11. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 11
• FI_MSG_PREFIX and
ep_attr.msg_prefix_size
Libfabric
e
t
len chksmacdmac …
payload
DONE
12. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 12
• Tuple: (device, port)
Usually a physical device and port
Does not match virtualized VIC hardware
• Queue pair
• Completion queue
Verbs
PU P#9
PCI 8086:1521
eth3
PCI 1137:0043
eth4
usnic_0
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:0043
ibv_device
ibv_port
QP QP CQ
QP
13. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 13
• Maps nicely to SR-IOV
• Fabric à PCI physical function (PF)
• Domain à PCI virtual function (VF)
• Endpoint à Resources in VF
PU P#9
PCI 8086:1521
eth3
PCI 1137:0043
eth4
usnic_0
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:0043
Libfabric
fi_fabric
fi_domain
fi_endpoint
(resources in domain)
EP EP CQ
EP
14. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 14
• GID and GUID
No easy mapping back to IP interface
• usnic verbs provider encoded MAC in
GID
Still cumbersome to map back to IP interface
• Could use RDMA CM
…but that would be a ton more code
Verbs
mac[0] = gid->raw[8] ^ 2;
mac[1] = gid->raw[9];
mac[2] = gid->raw[10];
mac[3] = gid->raw[13];
mac[4] = gid->raw[14];
mac[5] = gid->raw[15];
15. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 15
• Can use IP addressing directly
Libfabric
Everything is awesome
16. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 16
• Can use IP addressing directly
Libfabric
Everything is awesomeDONE
17. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 17
• Generic send call
ibv_post_send(…SG list…)
Lots of branches
• Wasteful allocations
• No prefixed receive
• Branching in completions
Verbs
18. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 18
• Multiple types of send calls
fi_send(buffer, …)
• Variable-length prefix receive
Provider-specific
• Fewer branches in completions
Libfabric
19. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 19
1.9
1.95
2
2.05
2.1
2.15
2.2
2.25
2.3
2.35
2.4
0.1 1 10 100
Time(microseconds)
Buffer size
Open MPI with usNIC: IMB PingPong Latency
imb-pingpong-ompi-1.8-verbs.out
imb-pingpong-ompi-1.8-libfabric.out
20. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 20
61000
62000
63000
64000
65000
66000
67000
68000
69000
1e+06
Bandwidth(megabits/second)
Buffer size
Open MPI with usNIC: IMB SendRecv Bandwidth
imb-sendrecv-ompi-1.8-verbs.out
imb-sendrecv-ompi-1.8-libfabric.out
21. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 21
• Performance issues
• Memory registration still a problem
• No MPI-style tag matching
• One-sided capabilities do not match MPI
• Network topology is a separate API
Verbs
22. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 22
• Performance happiness
• Many MPI-helpful features:
Tag matching
One-sided operations
Triggered operations
• Inherently designed to be more than just
point-to-point
• More work to be done… but promising
MMU notify
Network topology
Libfabric
23. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 23
• Long design discussions about how to
expose Ethernet / VIC concepts in the
verbs API
…usually with few good answers
Especially problematic with new VIC features over
time
• Conclusion: possible (obviously), but not
preferable
• Whole API designed with multiple vendor
hardware models in mind
• Much easier to match our hardware to
core Libfabric concepts
• Conclusion: much more preferable than
verbs
LibfabricVerbs
24. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24
Ok, so let’s do libfabric!
25. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25
Does it play well with MPI?
26. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 26
Byte Transport Layer
(BTL) plugins
Matching Transport
Layer (MTL) plugins
MPI_Send(…)
27. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 27
• Inherently multi-device
• Round-robin for
small messages
• Striping for large messages
• Major protocol decisions and MPI message matching driven by an Open MPI
engine
Byte Transport Layer
(BTL) plugins
28. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 28
Matching Transport
Layer (MTL) plugins
• Most details hidden by network API
• MXM
• Portals
• PSM
• As a side effect, must handle:
• Process loopback
• Server loopback (usually via shared memory)
29. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 29
Byte Transport Layer
(BTL) plugins
Matching Transport
Layer (MTL) plugins
• IB / iWarp (verbs)
• Portals
• SCIF
• Shared memory
• TCP
• uGNI
• usNIC (verbs)
• MXM
• Portals
• PSM
• PSM2
30. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 30
• IB / iWarp (verbs)
• Portals
• SCIF
• Shared memory
• TCP
• uGNI
• usNIC
Byte Transport Layer
(BTL) plugins
Matching Transport
Layer (MTL) plugins
• MXM
• Portals
• PSM
• PSM2
• ofi
libfabric
31. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 31
libfabric
usnic BTL ofi MTL
• Cisco developed
• usNIC-specific
• OFI point-to-point / UD
• Tested with usNIC
• Intel developed
• Provider neutral
• OFI tag matching
• Tested with PSM / PSM2
32. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 32
Bootstrapping
Message passing
There are two main parts of the usNIC BTL
33. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 33
verbs
bootstrapping
verbs
message passing
These two parts were previously
written to the Verbs API
34. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 34
verbs
bootstrapping
verbs
message passing
sideband
bootstrapping
1. Find the corresponding ethX device
2. Obtain MTU
3. Open usNIC-specific configuration
options
Per the previous slides, the Verbs API
requires some… help… in the form of
sideband bootstrapping
35. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 35
verbs
bootstrapping
verbs
message passing
sideband
bootstrapping
libfabric
bootstrapping
à
libfabric
message passing
à
Now let’s convert to use the libfabric API
36. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 36
verbs
bootstrapping
verbs
message passing
sideband
bootstrapping
libfabric
bootstrapping
à
libfabric
message passing
àPretty much a ~1:1 swap of verbs à libfabric calls
Bootstrapping sequence totally different / not comparable
…but libfabric needs no sideband bootstrapping
(got to delete several hundred lines of OMPI code – yay!)
37. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 37
• For a specific provider
Ask fi_getinfo() for
prov_name=“usnic”
• Use usNIC extensions
Netmask, link speed, IP device
name, etc.
• usNIC-specific error
messages
• For any tag-matching
provider
• No extension use
100% portable
• Generic error messages
usnic BTL ofi MTL
38. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 38
• For a specific provider
Ask fi_getinfo() for
prov_name=“usnic”
• Use usNIC extensions
Netmask, link speed, IP device
name, etc.
• usNIC-specific error
messages
• For any tag-matching
provider
• No extension use
100% portable
• Generic error messages
usnic BTL ofi MTL
Both libfabric usage models co-exist
(and play well with each other)
inside a single MPI implementation.
Proof positive
of successful co-design
of libfabric and MPI implementations.
39. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 39
• For a specific provider
Ask fi_getinfo() for
prov_name=“usnic”
• Use usNIC extensions
Netmask, link speed, IP device
name, etc.
• usNIC-specific error
messages
• For any tag-matching
provider
• No extension use
100% portable
• Generic error messages
usnic BTL ofi MTL
40. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 40
• Libfabric is the Way
Forward for Cisco
Open community
Matches our hardware
Performance benefits
Features benefits
• Libfabric matches MPI
Has features MPI has been
asking for… for years
Optimistic about its future
(come join us!)
http://libfabric.org
41. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 41
Thank you.