Today's data center managers are burdened by a lack of aligned information of multiple layers. Work-flow events like 'job starts' aligned with performance metrics and events extracted from log facilities are low-hanging fruit that is on the edge to become use-able due to open-source software like Graphite, StatsD, logstash and alike.
This talk aims to show off the benefits of merging multiple layers of information within an InfiniBand cluster by using use-cases for level 1/2/3 personnel.
5. Cluster?
5
„A computer cluster consists of a set of loosely connected or tightly connected computers
that work together so that in many respects they can be viewed as a single system.“ - wikipedia.org
User
6. HPC-Cluster
6
High Performance Computing
❖ HPC: Surfing the bottleneck
❖ Weakest link breaks performance
7. Cluster Layers
7
(rough estimate)
Software: End user application
Services: Storage, Job Scheduler, sshd
MiddleWare: MPI, ISV-libs
Operating System: Kernel, Userland tools
Hardware: IMPI, lm_sensors, IB counter
End
User
Excel: KPI, SLA
Mgmt
SysOps
Power User/ISV
SysOps Mgmt
ISV Mgmt
SysOps L2
SysOps L1
Events Metrics
SysOps L3
8. Layern
❖ Every Layer is composed of layers
❖ How deep to go?
8
9. Little Data w/o Connection
❖ No way of connecting them
❖ Connecting is manual labour
❖ Experience driven
❖ Niche solutions misleading
9
❖ Multiple data sources
12. Modular Switch
12
❖ Looks like one „switch“
❖ Composed of a network itself
13. Modular Switch
13
❖ Looks like one „switch“
❖ Composed of a network itself
❖ Which route is taken is transparent to
application
❖ LB1<>FB1<>LB4
14. Modular Switch
14
❖ Looks like one „switch“
❖ Composed of a network itself
❖ Which route is taken is transparent to
application
❖ LB1<>FB1<>LB4
❖ LB1<>FB2<>LB4
15. Modular Switch
15
❖ Looks like one „switch“
❖ Composed of a network itself
❖ Which route is taken is transparent to
application
❖ LB1<>FB1<>LB4
❖ LB1<>FB2<>LB4
❖ LB1 ->FB1 ->LB4 / LB1 <-FB2 <-LB4
16. ❖ 96 port switch
Debug-Nightmare
❖ multiple autonomous job-cells
❖ Relevant information
❖ Job status (Resource Scheduler)
❖ Routes (IB Subnet Manager)
❖ IB Counter (Command Line)
❖ changing one plug, recomputes routes :)
16
❖ Job seems to fail due to bad internal link
17. Communication Networks
IBPM: An Open-Source-Based Framework for
InfiniBand Performance Monitoring
Michael Hoefling1, Michael Menth1, Christian Kniep2, Marcus Camen2
Background: InfiniBand (IB) IBPM: Demo Overview
Rate Measurement in IB Networks
f State-of-the art communication technology for interconnection in
high-performance computing data centers
f Point-to-point bidirectional links
f High throughput (40 Gbit/s with QDR)
f Low latency
f Dynamic on-line network reconfiguration
in cooperation with
Idea
f Extract raw network information from IB network
f Analyze output
f Derive statistics about performance of the network
Topology Extraction
f Subnet discovery using ibnetdiscover
f Produces human readable file of network topology
f Process output to produce graphical representation of the
network
Remote Counter Readout
f Each port has its own set of performance counters
f Counters measure, e.g., transferred data, congestion, errors,
link states changes
ibsim-Based Network Simulation
f ibsim simulates an IB network
f Simple topology changes possible (GUI)
f ibsim limitations
ƒ No performance simulation possible
ƒ No data rate changes possible
Real IB Network
f Physical network
f Allows performance measurements
f GUI controlled traffic scenarios
17
18. ❖ OpenSM Performance Manager
❖ Sends token to all ports
❖ All ports reply with metrics
OpenSM
❖ Callback triggered for every reply
❖ Dumps info to file
Sw
18
OpenSM
PerfMgmt
osmeventplugin
Sw
node
node node
node node
node
node
❖ osmeventplugin
19. OpenSM
OpenSM
PerfMgmt
qnqinbinbg
19
❖ qnib
❖ sends metrics to RRDtool
❖ events to PostgreSQL
❖ qnibng
❖ sends metrics to graphite
❖ events to logstash
24. Cluster Stack Mock-Up
❖ IB events and metrics are not enough
❖ How to get real-world behavior?
❖ Wanted:
❖ Slurm (Resource Scheduler)
❖ MPI enabled compute nodes
❖ As much additional cluster stack as possible
(Graphite,elasticsearch/logstash/kibana, Icinga, Cluster-FS, …)
24
25. Classical Virtualization
❖ Big overhead for simple node
❖ Resources provisioned in advance
❖ Host resources allocated
25
26. LXC (docker)
❖ minimal overhead ( couple of MB)
❖ no resource pinning
❖ cgroups option
❖ highly automatable
26
NOW: Watch OSDC2014 talk ‚Docker‘ by ‚Tobias Schwab‘
28. Master Node
❖ takes care of inventory (etcd)
❖ provides DNS (+PTR)
❖ Integrate Rudder, ansible, chef,…?
28
29. Non-Master Nodes (in general)
❖ are started with master as DNS
❖ mounting /scratch, /chome (sits on SSDs)
❖ supervisord kicks in and starts services and setup-scripts
❖ sending metrics to graphite
❖ logs to logstash
29
44. pipework / mininet
❖ Currently all containers are bound to docker0 bridge
❖ Creating topology with virtual/real switches would be nice
❖ First iteration might use pipework
❖ More complete one should use vSwitches (mininet?)
44