SlideShare uma empresa Scribd logo
1 de 41
Baixar para ler offline
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1
Cisco’s Journey
From Verbs to Libfabric
Abondon the shackles of Verbs
Embrace the freedom of Libfabric
Jeffrey M. Squyres
Cisco Systems
23 September 2015
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 2
Application
Kernel
Cisco VIC ethX port
TCP stack
General Ethernet driver
enic.ko
Userspace
sockets API userspace library
Application
Verbs IB core
usnic.ko
Send and
receive
fast path
usNICTCP/IP
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 3© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 3
Verbs is a fine API.
…if you make InfiniBand hardware.
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 4© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 4
...but now there’s this libfabric thing
(see libfabric.org community for details)
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 5
Keep in mind, Cisco already supports
UD Verbs
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 6
•  Monotonic enum
•  Could not add popular Ethernet values
1500
9000
•  usNIC verbs provider had to lie (!)
…just like iWARP providers
•  MPI had to match verbs device with IP
interface to find real MTU
Verbs
IBV_MTU_256
IBV_MTU_512
IBV_MTU_1024
IBV_MTU_2048
IBV_MTU_4096
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 7
•  Integer (not enum) endpoint attribute
Libfabric
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 8
•  Integer (not enum) endpoint attribute
Libfabric
DONE
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 9
•  Mandatory GRH structure
InfiniBand-specific header
•  40 bytes
UDP header is 42 bytes
…and a different format
•  Breaks ib_ud_pingpong
•  usnic verbs provider used “magic”
ibv_port_query() to return extensions
pointers
E.g., enable 42-byte UDP mode
Verbs
e
t
len chksmacdmac …
ver len
n
e
x
t
h
o
p
sgid dgid
UDP header: 42 bytes
GRH: 40 bytes
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 10
•  FI_MSG_PREFIX and
ep_attr.msg_prefix_size
Libfabric
e
t
len chksmacdmac …
payload
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 11
•  FI_MSG_PREFIX and
ep_attr.msg_prefix_size
Libfabric
e
t
len chksmacdmac …
payload
DONE
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 12
•  Tuple: (device, port)
Usually a physical device and port
Does not match virtualized VIC hardware
•  Queue pair
•  Completion queue
Verbs
PU P#9
PCI 8086:1521
eth3
PCI 1137:0043
eth4
usnic_0
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:0043
ibv_device
ibv_port
QP QP CQ
QP
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 13
•  Maps nicely to SR-IOV
•  Fabric à PCI physical function (PF)
•  Domain à PCI virtual function (VF)
•  Endpoint à Resources in VF
PU P#9
PCI 8086:1521
eth3
PCI 1137:0043
eth4
usnic_0
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:0043
Libfabric
fi_fabric
fi_domain
fi_endpoint
(resources in domain)
EP EP CQ
EP
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 14
•  GID and GUID
No easy mapping back to IP interface
•  usnic verbs provider encoded MAC in
GID
Still cumbersome to map back to IP interface
•  Could use RDMA CM
…but that would be a ton more code
Verbs
mac[0] = gid->raw[8] ^ 2;
mac[1] = gid->raw[9];
mac[2] = gid->raw[10];
mac[3] = gid->raw[13];
mac[4] = gid->raw[14];
mac[5] = gid->raw[15];
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 15
•  Can use IP addressing directly
Libfabric
Everything is awesome
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 16
•  Can use IP addressing directly
Libfabric
Everything is awesomeDONE
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 17
•  Generic send call
ibv_post_send(…SG list…)
Lots of branches
•  Wasteful allocations
•  No prefixed receive
•  Branching in completions
Verbs
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 18
•  Multiple types of send calls
fi_send(buffer, …)
•  Variable-length prefix receive
Provider-specific
•  Fewer branches in completions
Libfabric
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 19
1.9
1.95
2
2.05
2.1
2.15
2.2
2.25
2.3
2.35
2.4
0.1 1 10 100
Time(microseconds)
Buffer size
Open MPI with usNIC: IMB PingPong Latency
imb-pingpong-ompi-1.8-verbs.out
imb-pingpong-ompi-1.8-libfabric.out
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 20
61000
62000
63000
64000
65000
66000
67000
68000
69000
1e+06
Bandwidth(megabits/second)
Buffer size
Open MPI with usNIC: IMB SendRecv Bandwidth
imb-sendrecv-ompi-1.8-verbs.out
imb-sendrecv-ompi-1.8-libfabric.out
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 21
•  Performance issues
•  Memory registration still a problem
•  No MPI-style tag matching
•  One-sided capabilities do not match MPI
•  Network topology is a separate API
Verbs
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 22
•  Performance happiness
•  Many MPI-helpful features:
Tag matching
One-sided operations
Triggered operations
•  Inherently designed to be more than just
point-to-point
•  More work to be done… but promising
MMU notify
Network topology
Libfabric
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 23
•  Long design discussions about how to
expose Ethernet / VIC concepts in the
verbs API
…usually with few good answers
Especially problematic with new VIC features over
time
•  Conclusion: possible (obviously), but not
preferable
•  Whole API designed with multiple vendor
hardware models in mind
•  Much easier to match our hardware to
core Libfabric concepts
•  Conclusion: much more preferable than
verbs
LibfabricVerbs
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24
Ok, so let’s do libfabric!
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25
Does it play well with MPI?
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 26
Byte Transport Layer
(BTL) plugins
Matching Transport
Layer (MTL) plugins
MPI_Send(…)
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 27
•  Inherently multi-device
•  Round-robin for
small messages
•  Striping for large messages
•  Major protocol decisions and MPI message matching driven by an Open MPI
engine
Byte Transport Layer
(BTL) plugins
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 28
Matching Transport
Layer (MTL) plugins
•  Most details hidden by network API
• MXM
• Portals
• PSM
•  As a side effect, must handle:
• Process loopback
• Server loopback (usually via shared memory)
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 29
Byte Transport Layer
(BTL) plugins
Matching Transport
Layer (MTL) plugins
•  IB / iWarp (verbs)
•  Portals
•  SCIF
•  Shared memory
•  TCP
•  uGNI
•  usNIC (verbs)
•  MXM
•  Portals
•  PSM
•  PSM2
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 30
•  IB / iWarp (verbs)
•  Portals
•  SCIF
•  Shared memory
•  TCP
•  uGNI
•  usNIC
Byte Transport Layer
(BTL) plugins
Matching Transport
Layer (MTL) plugins
•  MXM
•  Portals
•  PSM
•  PSM2
•  ofi
libfabric
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 31
libfabric
usnic BTL ofi MTL
•  Cisco developed
•  usNIC-specific
•  OFI point-to-point / UD
•  Tested with usNIC
•  Intel developed
•  Provider neutral
•  OFI tag matching
•  Tested with PSM / PSM2
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 32
Bootstrapping
Message passing
There are two main parts of the usNIC BTL
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 33
verbs
bootstrapping
verbs
message passing
These two parts were previously
written to the Verbs API
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 34
verbs
bootstrapping
verbs
message passing
sideband
bootstrapping
1.  Find the corresponding ethX device
2.  Obtain MTU
3.  Open usNIC-specific configuration
options
Per the previous slides, the Verbs API
requires some… help… in the form of
sideband bootstrapping
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 35
verbs
bootstrapping
verbs
message passing
sideband
bootstrapping
libfabric
bootstrapping
à
libfabric
message passing
à
Now let’s convert to use the libfabric API
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 36
verbs
bootstrapping
verbs
message passing
sideband
bootstrapping
libfabric
bootstrapping
à
libfabric
message passing
àPretty much a ~1:1 swap of verbs à libfabric calls
Bootstrapping sequence totally different / not comparable
…but libfabric needs no sideband bootstrapping
(got to delete several hundred lines of OMPI code – yay!)
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 37
•  For a specific provider
Ask fi_getinfo() for
prov_name=“usnic”
•  Use usNIC extensions
Netmask, link speed, IP device
name, etc.
•  usNIC-specific error
messages
•  For any tag-matching
provider
•  No extension use
100% portable
•  Generic error messages
usnic BTL ofi MTL
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 38
•  For a specific provider
Ask fi_getinfo() for
prov_name=“usnic”
•  Use usNIC extensions
Netmask, link speed, IP device
name, etc.
•  usNIC-specific error
messages
•  For any tag-matching
provider
•  No extension use
100% portable
•  Generic error messages
usnic BTL ofi MTL
Both libfabric usage models co-exist
(and play well with each other)
inside a single MPI implementation.
Proof positive
of successful co-design
of libfabric and MPI implementations.
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 39
•  For a specific provider
Ask fi_getinfo() for
prov_name=“usnic”
•  Use usNIC extensions
Netmask, link speed, IP device
name, etc.
•  usNIC-specific error
messages
•  For any tag-matching
provider
•  No extension use
100% portable
•  Generic error messages
usnic BTL ofi MTL
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 40
•  Libfabric is the Way
Forward for Cisco
Open community
Matches our hardware
Performance benefits
Features benefits
•  Libfabric matches MPI
Has features MPI has been
asking for… for years
Optimistic about its future
(come join us!)
http://libfabric.org
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 41
Thank you.

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

フロー技術によるネットワーク管理
フロー技術によるネットワーク管理フロー技術によるネットワーク管理
フロー技術によるネットワーク管理
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Demystifying EVPN in the data center: Part 1 in 2 episode series
Demystifying EVPN in the data center: Part 1 in 2 episode seriesDemystifying EVPN in the data center: Part 1 in 2 episode series
Demystifying EVPN in the data center: Part 1 in 2 episode series
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
The Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RCThe Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RC
 
Session 1
Session 1Session 1
Session 1
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
Aci presentation
Aci presentationAci presentation
Aci presentation
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
Intelligent, Automatic Restarts for Unhealthy Kafka Consumers on Kubernetes w...
Intelligent, Automatic Restarts for Unhealthy Kafka Consumers on Kubernetes w...Intelligent, Automatic Restarts for Unhealthy Kafka Consumers on Kubernetes w...
Intelligent, Automatic Restarts for Unhealthy Kafka Consumers on Kubernetes w...
 
最近のOpenStackを振り返ってみよう
最近のOpenStackを振り返ってみよう最近のOpenStackを振り返ってみよう
最近のOpenStackを振り返ってみよう
 
root権限無しでKubernetesを動かす
root権限無しでKubernetesを動かす root権限無しでKubernetesを動かす
root権限無しでKubernetesを動かす
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking Deploying CloudStack and Ceph with flexible VXLAN and BGP networking
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking
 
NETCONFとYANGの話
NETCONFとYANGの話NETCONFとYANGの話
NETCONFとYANGの話
 
Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2
 
Kubernetes dealing with storage and persistence
Kubernetes  dealing with storage and persistenceKubernetes  dealing with storage and persistence
Kubernetes dealing with storage and persistence
 

Semelhante a Cisco's journey from Verbs to Libfabric

MCET WebEx vs Lync competitive analysis,
MCET WebEx vs Lync competitive analysis,MCET WebEx vs Lync competitive analysis,
MCET WebEx vs Lync competitive analysis,
dark19921
 
Foreman-and-Puppet-for-Openstack-Audo-Deployment
Foreman-and-Puppet-for-Openstack-Audo-DeploymentForeman-and-Puppet-for-Openstack-Audo-Deployment
Foreman-and-Puppet-for-Openstack-Audo-Deployment
yating yang
 

Semelhante a Cisco's journey from Verbs to Libfabric (20)

Cisco usNIC libfabric provider
Cisco usNIC libfabric providerCisco usNIC libfabric provider
Cisco usNIC libfabric provider
 
2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL
 
Vbrownbag container networking for real workloads
Vbrownbag container networking for real workloadsVbrownbag container networking for real workloads
Vbrownbag container networking for real workloads
 
BRKACI-2102_Tshoot.pdf
BRKACI-2102_Tshoot.pdfBRKACI-2102_Tshoot.pdf
BRKACI-2102_Tshoot.pdf
 
DEVNET-1148 Leveraging Cisco OpenStack Private Cloud for Developers
DEVNET-1148	Leveraging Cisco OpenStack Private Cloud for DevelopersDEVNET-1148	Leveraging Cisco OpenStack Private Cloud for Developers
DEVNET-1148 Leveraging Cisco OpenStack Private Cloud for Developers
 
ACI Hands-on Lab
ACI Hands-on LabACI Hands-on Lab
ACI Hands-on Lab
 
InfluxDB Community Office Hours September 2020
InfluxDB Community Office Hours September 2020 InfluxDB Community Office Hours September 2020
InfluxDB Community Office Hours September 2020
 
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
 
MCET WebEx vs Lync competitive analysis,
MCET WebEx vs Lync competitive analysis,MCET WebEx vs Lync competitive analysis,
MCET WebEx vs Lync competitive analysis,
 
Foreman-and-Puppet-for-Openstack-Audo-Deployment
Foreman-and-Puppet-for-Openstack-Audo-DeploymentForeman-and-Puppet-for-Openstack-Audo-Deployment
Foreman-and-Puppet-for-Openstack-Audo-Deployment
 
Triangle Kubernetes Meetup: Container cloud networking - Contiv for K8S & Ope...
Triangle Kubernetes Meetup: Container cloud networking - Contiv for K8S & Ope...Triangle Kubernetes Meetup: Container cloud networking - Contiv for K8S & Ope...
Triangle Kubernetes Meetup: Container cloud networking - Contiv for K8S & Ope...
 
Integration and Interoperation of existing Nexus networks into an ACI Archite...
Integration and Interoperation of existing Nexus networks into an ACI Archite...Integration and Interoperation of existing Nexus networks into an ACI Archite...
Integration and Interoperation of existing Nexus networks into an ACI Archite...
 
Advanced coding & deployment for Cisco Video Devices - CL20B - DEVNET-3244
Advanced coding & deployment for Cisco Video Devices - CL20B - DEVNET-3244Advanced coding & deployment for Cisco Video Devices - CL20B - DEVNET-3244
Advanced coding & deployment for Cisco Video Devices - CL20B - DEVNET-3244
 
Application Centric Infrastructure (ACI), the policy driven data centre
Application Centric Infrastructure (ACI), the policy driven data centreApplication Centric Infrastructure (ACI), the policy driven data centre
Application Centric Infrastructure (ACI), the policy driven data centre
 
Luca Relandini - Microservices and containers networking: Contiv, deep dive a...
Luca Relandini - Microservices and containers networking: Contiv, deep dive a...Luca Relandini - Microservices and containers networking: Contiv, deep dive a...
Luca Relandini - Microservices and containers networking: Contiv, deep dive a...
 
Cisco Live 2017: Container networking deep dive with Docker Enterprise Editio...
Cisco Live 2017: Container networking deep dive with Docker Enterprise Editio...Cisco Live 2017: Container networking deep dive with Docker Enterprise Editio...
Cisco Live 2017: Container networking deep dive with Docker Enterprise Editio...
 
Microservices and containers networking: Contiv, an industry leading open sou...
Microservices and containers networking: Contiv, an industry leading open sou...Microservices and containers networking: Contiv, an industry leading open sou...
Microservices and containers networking: Contiv, an industry leading open sou...
 
Docker Enterprise Networking and Cisco Contiv - Cisco Live 2017 BRKSDN-2256
Docker Enterprise Networking and Cisco Contiv - Cisco Live 2017 BRKSDN-2256Docker Enterprise Networking and Cisco Contiv - Cisco Live 2017 BRKSDN-2256
Docker Enterprise Networking and Cisco Contiv - Cisco Live 2017 BRKSDN-2256
 
Breizhcamp: Créer un bot, pas si simple. Faisons le point.
Breizhcamp: Créer un bot, pas si simple. Faisons le point.Breizhcamp: Créer un bot, pas si simple. Faisons le point.
Breizhcamp: Créer un bot, pas si simple. Faisons le point.
 
Microservices and containers networking: Contiv, an industry leading open sou...
Microservices and containers networking: Contiv, an industry leading open sou...Microservices and containers networking: Contiv, an industry leading open sou...
Microservices and containers networking: Contiv, an industry leading open sou...
 

Mais de Jeff Squyres

Cisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationCisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentation
Jeff Squyres
 

Mais de Jeff Squyres (19)

Open MPI State of the Union X SC'16 BOF
Open MPI State of the Union X SC'16 BOFOpen MPI State of the Union X SC'16 BOF
Open MPI State of the Union X SC'16 BOF
 
MPI Sessions: a proposal to the MPI Forum
MPI Sessions: a proposal to the MPI ForumMPI Sessions: a proposal to the MPI Forum
MPI Sessions: a proposal to the MPI Forum
 
MPI Fourm SC'15 BOF
MPI Fourm SC'15 BOFMPI Fourm SC'15 BOF
MPI Fourm SC'15 BOF
 
Open MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOFOpen MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOF
 
Fun with Github webhooks: verifying Signed-off-by
Fun with Github webhooks: verifying Signed-off-byFun with Github webhooks: verifying Signed-off-by
Fun with Github webhooks: verifying Signed-off-by
 
Open MPI new version number scheme and roadmap
Open MPI new version number scheme and roadmapOpen MPI new version number scheme and roadmap
Open MPI new version number scheme and roadmap
 
2014 01-21-mpi-community-feedback
2014 01-21-mpi-community-feedback2014 01-21-mpi-community-feedback
2014 01-21-mpi-community-feedback
 
(Open) MPI, Parallel Computing, Life, the Universe, and Everything
(Open) MPI, Parallel Computing, Life, the Universe, and Everything(Open) MPI, Parallel Computing, Life, the Universe, and Everything
(Open) MPI, Parallel Computing, Life, the Universe, and Everything
 
Cisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPICisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPI
 
Cisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationCisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentation
 
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
 
MPI History
MPI HistoryMPI History
MPI History
 
MOSSCon 2013, Cisco Open Source talk
MOSSCon 2013, Cisco Open Source talkMOSSCon 2013, Cisco Open Source talk
MOSSCon 2013, Cisco Open Source talk
 
Ethernet and TCP optimizations
Ethernet and TCP optimizationsEthernet and TCP optimizations
Ethernet and TCP optimizations
 
Friends don't let friends leak MPI_Requests
Friends don't let friends leak MPI_RequestsFriends don't let friends leak MPI_Requests
Friends don't let friends leak MPI_Requests
 
MPI-3 Timer requests proposal
MPI-3 Timer requests proposalMPI-3 Timer requests proposal
MPI-3 Timer requests proposal
 
MPI_Mprobe is good for you
MPI_Mprobe is good for youMPI_Mprobe is good for you
MPI_Mprobe is good for you
 
The Message Passing Interface (MPI) in Layman's Terms
The Message Passing Interface (MPI) in Layman's TermsThe Message Passing Interface (MPI) in Layman's Terms
The Message Passing Interface (MPI) in Layman's Terms
 
What is [Open] MPI?
What is [Open] MPI?What is [Open] MPI?
What is [Open] MPI?
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

Cisco's journey from Verbs to Libfabric

  • 1. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1 Cisco’s Journey From Verbs to Libfabric Abondon the shackles of Verbs Embrace the freedom of Libfabric Jeffrey M. Squyres Cisco Systems 23 September 2015
  • 2. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 2 Application Kernel Cisco VIC ethX port TCP stack General Ethernet driver enic.ko Userspace sockets API userspace library Application Verbs IB core usnic.ko Send and receive fast path usNICTCP/IP
  • 3. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 3© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 3 Verbs is a fine API. …if you make InfiniBand hardware.
  • 4. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 4© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 4 ...but now there’s this libfabric thing (see libfabric.org community for details)
  • 5. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 5 Keep in mind, Cisco already supports UD Verbs
  • 6. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 6 •  Monotonic enum •  Could not add popular Ethernet values 1500 9000 •  usNIC verbs provider had to lie (!) …just like iWARP providers •  MPI had to match verbs device with IP interface to find real MTU Verbs IBV_MTU_256 IBV_MTU_512 IBV_MTU_1024 IBV_MTU_2048 IBV_MTU_4096
  • 7. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 7 •  Integer (not enum) endpoint attribute Libfabric
  • 8. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 8 •  Integer (not enum) endpoint attribute Libfabric DONE
  • 9. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 9 •  Mandatory GRH structure InfiniBand-specific header •  40 bytes UDP header is 42 bytes …and a different format •  Breaks ib_ud_pingpong •  usnic verbs provider used “magic” ibv_port_query() to return extensions pointers E.g., enable 42-byte UDP mode Verbs e t len chksmacdmac … ver len n e x t h o p sgid dgid UDP header: 42 bytes GRH: 40 bytes
  • 10. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 10 •  FI_MSG_PREFIX and ep_attr.msg_prefix_size Libfabric e t len chksmacdmac … payload
  • 11. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 11 •  FI_MSG_PREFIX and ep_attr.msg_prefix_size Libfabric e t len chksmacdmac … payload DONE
  • 12. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 12 •  Tuple: (device, port) Usually a physical device and port Does not match virtualized VIC hardware •  Queue pair •  Completion queue Verbs PU P#9 PCI 8086:1521 eth3 PCI 1137:0043 eth4 usnic_0 PCI 1137:00cf PCI 1137:00cf PCI 1137:00cf PCI 1137:00cf PCI 1137:0043 ibv_device ibv_port QP QP CQ QP
  • 13. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 13 •  Maps nicely to SR-IOV •  Fabric à PCI physical function (PF) •  Domain à PCI virtual function (VF) •  Endpoint à Resources in VF PU P#9 PCI 8086:1521 eth3 PCI 1137:0043 eth4 usnic_0 PCI 1137:00cf PCI 1137:00cf PCI 1137:00cf PCI 1137:00cf PCI 1137:0043 Libfabric fi_fabric fi_domain fi_endpoint (resources in domain) EP EP CQ EP
  • 14. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 14 •  GID and GUID No easy mapping back to IP interface •  usnic verbs provider encoded MAC in GID Still cumbersome to map back to IP interface •  Could use RDMA CM …but that would be a ton more code Verbs mac[0] = gid->raw[8] ^ 2; mac[1] = gid->raw[9]; mac[2] = gid->raw[10]; mac[3] = gid->raw[13]; mac[4] = gid->raw[14]; mac[5] = gid->raw[15];
  • 15. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 15 •  Can use IP addressing directly Libfabric Everything is awesome
  • 16. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 16 •  Can use IP addressing directly Libfabric Everything is awesomeDONE
  • 17. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 17 •  Generic send call ibv_post_send(…SG list…) Lots of branches •  Wasteful allocations •  No prefixed receive •  Branching in completions Verbs
  • 18. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 18 •  Multiple types of send calls fi_send(buffer, …) •  Variable-length prefix receive Provider-specific •  Fewer branches in completions Libfabric
  • 19. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 19 1.9 1.95 2 2.05 2.1 2.15 2.2 2.25 2.3 2.35 2.4 0.1 1 10 100 Time(microseconds) Buffer size Open MPI with usNIC: IMB PingPong Latency imb-pingpong-ompi-1.8-verbs.out imb-pingpong-ompi-1.8-libfabric.out
  • 20. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 20 61000 62000 63000 64000 65000 66000 67000 68000 69000 1e+06 Bandwidth(megabits/second) Buffer size Open MPI with usNIC: IMB SendRecv Bandwidth imb-sendrecv-ompi-1.8-verbs.out imb-sendrecv-ompi-1.8-libfabric.out
  • 21. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 21 •  Performance issues •  Memory registration still a problem •  No MPI-style tag matching •  One-sided capabilities do not match MPI •  Network topology is a separate API Verbs
  • 22. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 22 •  Performance happiness •  Many MPI-helpful features: Tag matching One-sided operations Triggered operations •  Inherently designed to be more than just point-to-point •  More work to be done… but promising MMU notify Network topology Libfabric
  • 23. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 23 •  Long design discussions about how to expose Ethernet / VIC concepts in the verbs API …usually with few good answers Especially problematic with new VIC features over time •  Conclusion: possible (obviously), but not preferable •  Whole API designed with multiple vendor hardware models in mind •  Much easier to match our hardware to core Libfabric concepts •  Conclusion: much more preferable than verbs LibfabricVerbs
  • 24. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24 Ok, so let’s do libfabric!
  • 25. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25 Does it play well with MPI?
  • 26. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 26 Byte Transport Layer (BTL) plugins Matching Transport Layer (MTL) plugins MPI_Send(…)
  • 27. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 27 •  Inherently multi-device •  Round-robin for small messages •  Striping for large messages •  Major protocol decisions and MPI message matching driven by an Open MPI engine Byte Transport Layer (BTL) plugins
  • 28. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 28 Matching Transport Layer (MTL) plugins •  Most details hidden by network API • MXM • Portals • PSM •  As a side effect, must handle: • Process loopback • Server loopback (usually via shared memory)
  • 29. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 29 Byte Transport Layer (BTL) plugins Matching Transport Layer (MTL) plugins •  IB / iWarp (verbs) •  Portals •  SCIF •  Shared memory •  TCP •  uGNI •  usNIC (verbs) •  MXM •  Portals •  PSM •  PSM2
  • 30. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 30 •  IB / iWarp (verbs) •  Portals •  SCIF •  Shared memory •  TCP •  uGNI •  usNIC Byte Transport Layer (BTL) plugins Matching Transport Layer (MTL) plugins •  MXM •  Portals •  PSM •  PSM2 •  ofi libfabric
  • 31. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 31 libfabric usnic BTL ofi MTL •  Cisco developed •  usNIC-specific •  OFI point-to-point / UD •  Tested with usNIC •  Intel developed •  Provider neutral •  OFI tag matching •  Tested with PSM / PSM2
  • 32. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 32 Bootstrapping Message passing There are two main parts of the usNIC BTL
  • 33. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 33 verbs bootstrapping verbs message passing These two parts were previously written to the Verbs API
  • 34. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 34 verbs bootstrapping verbs message passing sideband bootstrapping 1.  Find the corresponding ethX device 2.  Obtain MTU 3.  Open usNIC-specific configuration options Per the previous slides, the Verbs API requires some… help… in the form of sideband bootstrapping
  • 35. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 35 verbs bootstrapping verbs message passing sideband bootstrapping libfabric bootstrapping à libfabric message passing à Now let’s convert to use the libfabric API
  • 36. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 36 verbs bootstrapping verbs message passing sideband bootstrapping libfabric bootstrapping à libfabric message passing àPretty much a ~1:1 swap of verbs à libfabric calls Bootstrapping sequence totally different / not comparable …but libfabric needs no sideband bootstrapping (got to delete several hundred lines of OMPI code – yay!)
  • 37. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 37 •  For a specific provider Ask fi_getinfo() for prov_name=“usnic” •  Use usNIC extensions Netmask, link speed, IP device name, etc. •  usNIC-specific error messages •  For any tag-matching provider •  No extension use 100% portable •  Generic error messages usnic BTL ofi MTL
  • 38. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 38 •  For a specific provider Ask fi_getinfo() for prov_name=“usnic” •  Use usNIC extensions Netmask, link speed, IP device name, etc. •  usNIC-specific error messages •  For any tag-matching provider •  No extension use 100% portable •  Generic error messages usnic BTL ofi MTL Both libfabric usage models co-exist (and play well with each other) inside a single MPI implementation. Proof positive of successful co-design of libfabric and MPI implementations.
  • 39. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 39 •  For a specific provider Ask fi_getinfo() for prov_name=“usnic” •  Use usNIC extensions Netmask, link speed, IP device name, etc. •  usNIC-specific error messages •  For any tag-matching provider •  No extension use 100% portable •  Generic error messages usnic BTL ofi MTL
  • 40. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 40 •  Libfabric is the Way Forward for Cisco Open community Matches our hardware Performance benefits Features benefits •  Libfabric matches MPI Has features MPI has been asking for… for years Optimistic about its future (come join us!) http://libfabric.org
  • 41. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 41 Thank you.