Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Tutorial: Network State Awareness Troubleshooting
1. Network State Awareness
and Troubleshooting
Faraz Shamim Aamer Akhter
Technical Leader Director Product Management
Cisco Systems Inc
2. • Troubleshooting Methodology
• Packet Forwarding Review
• Control Plane
• Active Monitoring
• Logging
• Routing Protocol Stability
• Data Plane
• Active Monitoring
• Passive Flow Monitoring
• QoS
• Getting Started
Agenda
3. • This session is about basic network troubleshooting,
focusing on fault detection & isolation
• Mostly, vendor neutral
• For context, we will cover some basic methodologies
and functional elements of network behavior
• This session is NOT about
• Architectures of specific platforms
• Data Center technologies
• Routing Protocols Troubleshooting
• This is a 90 min tour. ;-)
Keeping Focused: What This Session is About
3
4. The Big Picture
network
Network Operator
Server
Client
Application Operator
Not
happy
It’s not
the
network
It’s the
network
Is it
Monday?
Pings
fine!
Can’t
ping it.
Internet’s
down.
Somebody's
downloading
something.
(?)
4
5. Enterprise
DC
• A lot of stuff going on
• Multiple networks
• Multiple applications
• Multiple layered services
• Mis-information / inconsistency
Some More (network) Detail
LAN
Server A
Client
Not
happy
ISP A
Enterprise
WAN
Server B
Internet
DNS
DHCP
802.1x
DNS
5
6. ISP B
Enterprise
DC
• Redundant paths / ECMP / LAG
• Overlays
• Load balancers
• Firewalls
• NATs
… and it keeps on going
LAN
Server A
Client
Not
happy
ISP A
Enterprise
WAN
Server B
Internet
DNS
DHCP
802.1x
DNS
6
7. Why network state awareness?
• What is it:
• View of network, what it is doing, and why
• Monitoring of data network performance,
in comparison with previous working states
• Quick detection of hard failures
• Early warning for
• soft failures
• performance issues
• and tomorrows’ problems
• Faster problem resolution
• Greater confidence in network by users and application operators
7
8. Find the Suspects Question Suspects Improve
Be Prepared
Think Like a Network Detective
8
9. • Control Plane
• Processes variety of information
sources and policies, creates
routing information base (RIB)
• Best known intention w/o actual
packet in hand
• Data Plane
• The actual forwarding process
(might be SW or HW based)
• Granted some decision flexibility
• Driven by arriving packet details,
traffic conditions etc.
Control Plane & Data Plane
Control Plane
Data Plane
Int A
Int B
Int C
packet
Routing
Protocol(s)
APIs Statics
Check routes
check L3 routing
Check policy
check forwarding
Gossip from
other routers
Passive Measurements
ifmib *FlowCbQoS
check policy-map int…
check interface
check flow monitor
PfR
9
Admin Edict
10. • Control plane: condenses options driven by policies and (relatively) slower
moving , aggregated information, eg. prefix reachability, interface state
• Data plane responds to packet conditions
• Destination prefix to egress interface matching
• Multi-path (ECMP / LAG) member selection
• Interface congestion
• QoS class state
• Access Lists
• Packet processing fields (TTL expire, etc)
• IPv4 fragmentation, etc
Data Plane Decision Flexibility
10
11. • Each network device makes an independent forwarding decision
• Explicit Local / domain policies
• Device perspective might not be symmetric
• Data plane flexibility
• Generally happens at WAN-edge and admin boundaries (traffic engineering)
• Asymmetric routing
Network as a System: Independent Decisions
A B
R1 R2 R5
R6
R4
R3
your network You don’t control
Congested link
R5 is doing
ECMP hash
11
12. • Change is normal, but some
changes are more interesting:
• Single change that causes loss
of reachability or suboptimal
performance
• Instability: high rate of change
• 3Ws: when, where, and what
Data Plane and Control Plane Changes
14. 14BRKARC-2025
What do I have?
• Establish inventory baseline
• Device names, IPs, configuration
• Modular HW configuration
• Serial # (for support & replacement)
• History (where has it been placed)
• Clearly label devices, ownership
and contact info
• Establish standards for location,
device/port names
• Check for changes periodically
(tooling)
<owner/dept>
<device-name>
<IP address>
<Contact>
<current-location> to
<destination-location>
<circuit src/dst id>
Example device label
Example cable label
15. How is it wired together?
• Establish network topology baseline
• Be prepared to be surprised!
• L2 protocol for discovery (LLDP?)
• Cisco, Foundry, Nortel
• Visual inspection J
R1 R2SW1
16. 16
Tools for Topology & Inventory Management
• Most NMS tools have some element of inventory and topology awareness
• NetBrain
• (open source) NetDisco
http://www.netdisco.org
• (open source) Netdot
https://osl.uoregon.edu/redmine/projects/netdot
17. Logging
• Centrally: for ease of analysis and search
• Moogsoft - automates early detection of service failures, collaboration & knowledge base
• syslog-ng – preprocessing, relay and store(file/db)
• Logstash(ELK), fluentd – multisource collection, storage and analysis
• Locally: in case logs can’t get home
17
18. State of the Routing Table
• Be familiar with normal behavior of important service prefixes
• Establish quickly if problem is control plane or data plane
• Check routing table/ ipRouteTable MIB / check ip traffic (Drop stats)
• Track objects
18
#show ip route 192.168.2.2
Routing entry for 192.168.2.2/32
Known via "ospf 1", distance 110, metric 11, type intra area
Last update from 10.0.0.2 on FastEthernet0/0, 00:00:13 ago
Routing Descriptor Blocks:
* 10.0.0.2, from 2.2.2.2, 00:00:13 ago, via FastEthernet0/0
Route metric is 11, traffic share count is 1
19. • Remember that OSPF data in area
should be consistent
• Understand ‘normal’ rate of changes
• LSA refresh /30-min unless a change
• Track SPF runs over time
• number of LSAs expected
• OSPF-MIB: OspfSpfRuns,
ospfAreaLSACount
• Route missing?
• Where is the network supposed to be
attached? Is it still?
• check interface (on advertising router)
• Check ospf database …
OSPF Area / AS-Wide
# show ip ospf
Routing Process "ospf 1" with ID 192.168.0.1
Start time: 00:01:46.195, Time elapsed: 00:48:27.308
Supports only single TOS(TOS0) routes
Supports opaque LSA
Supports Link-local Signaling (LLS)
Supports area transit capability
Supports NSSA (compatible with RFC 3101)
Supports Database Exchange Summary List Optimization (RFC 5243)
Event-log enabled, Maximum number of events: 1000, Mode: cyclic
Router is not originating router-LSAs with maximum metric
Initial SPF schedule delay 5000 msecs
Minimum hold time between two consecutive SPFs 10000 msecs
Maximum wait time between two consecutive SPFs 10000 msecs
Incremental-SPF disabled
Minimum LSA interval 5 secs
Minimum LSA arrival 1000 msecs
LSA group pacing timer 240 secs
Interface flood pacing timer 33 msecs
Retransmission pacing timer 66 msecs
Number of external LSA 0. Checksum Sum 0x000000
Number of opaque AS LSA 0. Checksum Sum 0x000000
Number of DCbitless external and opaque AS LSA 0
Number of DoNotAge external and opaque AS LSA 0
Number of areas in this router is 1. 1 normal 0 stub 0 nssa
Number of areas transit capable is 0
External flood list length 0
IETF NSF helper support enabled
Cisco NSF helper support enabled
Reference bandwidth unit is 100 mbps
Area BACKBONE(0)
Number of interfaces in this area is 4 (1 loopback)
Area has no authentication
SPF algorithm last executed 00:47:05.379 ago
SPF algorithm executed 4 times
Area ranges are
Number of LSA 16. Checksum Sum 0x078460
Number of opaque link LSA 0. Checksum Sum 0x000000
Number of DCbitless LSA 0
Number of indication LSA 0
Number of DoNotAge LSA 0
Flood list length 0
20. OSPF Neighborships
• neighbor adjacencies
• Check ospf neighbor detail (OSPF-MIB: ospfNbrState, ospfNbrEvents, ospfNbrLSRetransQLen)
• How many state changes occur?
• What is the current state?
• Any retransmission happening?
• Check the interface queue
20
# show ip ospf neighbor detail
Neighbor 192.168.0.7, interface address 10.0.0.3
In the area 0 via interface GigabitEthernet0/1
Neighbor priority is 1, State is FULL, 6 state changes
DR is 10.0.0.3 BDR is 10.0.0.4
Options is 0x12 in Hello (E-bit, L-bit)
Options is 0x52 in DBD (E-bit, L-bit, O-bit)
LLS Options is 0x1 (LR)
Dead timer due in 00:00:39
Neighbor is up for 00:33:10
Index 2/2/2, retransmission queue length 0, number of retransmission 0
First 0x0(0)/0x0(0)/0x0(0) Next 0x0(0)/0x0(0)/0x0(0)
Last retransmission scan length is 0, maximum is 0
Last retransmission scan time is 0 msec, maximum is 0 msec
22. • IETF RFC 7854
• BMP client (router) provides pre-policy view of the ADJ-RIB-IN of a peer
• Update messages from peer sent to BMP receiver
• Example uses:
• Realtime visualizer of BGP state
• Traffic engineering analytics
• BGP policy exploration
BGP Monitoring Protocol
22
25. User / Agent Checks
• Treat network as a black box: are your beacon services working?
• Synthetic service check (HTTP, DNS, etc.)
• Ping (not all remotes will respond)
• Data plane is exercised and tested
• Variety = better coverage (multiple IP addresses / L4 ports per location)
• Validate similar treatment (QoS) as real user traffic
• Uptime and performance (loss, latency) metrics
• Look for patterns, changes from normal. All down vs some down.
• Capture and validate real user (human) incidents. What got missed?
• Use wisely: network and server resources consumed
A B
R1 R2 R5
R6
R3
25
26. Latency
Network
Jitter
Dist. of
Stats Connectivity
Packet
Loss
FTP DNS DHCP TCPJitter ICMP UDPDLSW HTTP
Network
Performance
Monitoring
Service Level
Agreement
(SLA)
Monitoring
Network
Assessment
Multiprotocol
Label
Switching
(MPLS)
Monitoring
VoIP
MonitoringAvailability
Trouble
Shooting
Operations
Measurement Metrics
Uses
MIB Data Active Generated Traffic to Measure the
Network
Destination
Source
Responder
LDP H.323 SIP RTP
IP SLA
IP SLA*(RFC 6812): Synthetic Traffic Measurements
IP SLA
IP SLA
26
*IP SLA can be replaced with other monitoring tools used by other vendors such as RPM of Juniper etc
• IPSLA on router/switch –
Shadow Router?
• User end-system based
agent software
• Dedicated Agent
27. traceroute
• Understand the limitations
• Sends 3 packets (default) at each TTL
• Implementations
• Linux/Cisco: UDP (ICMP and TCP-SYN are Linux optional)
• UDP DST port # used to keep track of packets, increments per packet. Initial= 33434 (default)
• SRC port #: randomized (linux), incrementing per packet (Cisco IOS)
• Linux (GNU inetutils-traceroute)
• UDP DST port# increments per TTL (not per packet)
• SRC port is random but fixed per entire run
• Windows: ICMP Echo request
Widest dispersion
against possibilities.
Difficult to
understand though.
ICMP blocked
frequently L
Narrower
dispersion.
Story might be
misleading.
Internet: aka the
TCP/80 network
27
28. Unix traceroute
• Multiple path options
• Topology ‘shortcuts’ (same router seen at diff hop)
• Ultimately all paths result in similar e2e delay
28
$ traceroute 62.2.88.172
traceroute to 62.2.88.172 (62.2.88.172), 30 hops max, 60 byte packets
1 152.22.242.65 (152.22.242.65) 1.044 ms 1.371 ms 1.585 ms
2 152.22.240.8 (152.22.240.8) 0.219 ms 0.328 ms 0.327 ms
3 128.109.70.9 (128.109.70.9) 1.066 ms 1.059 ms 1.168 ms
4 rtp7600-gw-to-dep7600-gw2.ncren.net (128.109.70.137) 1.634 ms 1.628 ms 1.736 ms
5 rlasr-gw-link1-to-rtp7600-gw.ncren.net (128.109.9.17) 5.354 ms 5.446 ms 5.557 ms
6 128.109.9.117 (128.109.9.117) 5.671 ms 128.109.9.170 (128.109.9.170) 7.141 ms 128.109.9.117 (128.109.9.117) 5.433 ms
7 wscrs-gw-to-ws-a1a-ip-asr-gw-sec.ncren.net (128.109.1.105) 9.174 ms 128.109.1.209 (128.109.1.209) 8.256 ms 6.397 ms
8 dcp-brdr-03.inet.qwest.net (205.171.251.110) 18.414 ms chr-edge-03.inet.qwest.net (65.114.0.205) 27.353 ms 27.438 ms
9 dcp-brdr-03.inet.qwest.net (205.171.251.110) 21.739 ms 63-235-40-106.dia.static.qwest.net (63.235.40.106) 17.750 ms
dcp-brdr-03.inet.qwest.net (205.171.251.110) 22.450 ms
10 63-235-40-106.dia.static.qwest.net (63.235.40.106) 22.531 ms 22.516 ms 84-116-130-173.aorta.net (84.116.130.173) 140.738 ms
11 nl-ams02a-rd1-te0-2-0-2.aorta.net (84.116.130.65) 140.831 ms 140.816 ms 84-116-130-173.aorta.net (84.116.130.173) 144.819 ms
12 nl-ams02a-rd1-te0-2-0-2.aorta.net (84.116.130.65) 144.074 ms 144.761 ms 84-116-130-58.aorta.net (84.116.130.58) 138.455 ms
13 84-116-130-58.aorta.net (84.116.130.58) 141.844 ms 141.924 ms 142.459 ms
14 84.116.204.234 (84.116.204.234) 145.603 ms 145.891 ms 145.987 ms
15 * * *
16 62-2-88-172.static.cablecom.ch (62.2.88.172) 268.281 ms 268.245 ms 268.176 ms
1 AAA
2 BBB
3 CCC
4 DDD
5 EEE
6 FGF
7 HII
8 JKK +10ms (unsustained)
9 JLJ
10 LLM +120ms (sustained)
11 NNM
12 NNO
13 PPP
14 QQQ
15 ***
16 RRR ~268ms (all three)
filter + > 100 ms
delay
+120ms
Atlantic
crossing
Reference
31. MTR
• Interactive combined traceroute and ping
• Gives a sense of health of path (loss, delay Standard Deviation)
• Narrow path view
31
Reference
$ mtr 62.2.88.172
aakhter-nlr-ubuntu-01 (0.0.0.0) Sat May 30 18:57:09 2015
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. 152.22.242.65 0.0% 145 0.8 0.9 0.7 10.0 0.8
2. 152.22.240.8 0.0% 145 0.3 0.2 0.2 0.3 0.0
3. 128.109.70.9 0.0% 145 1.0 3.3 1.0 182.3 17.2
4. rtp7600-gw-to-dep7600-gw2.ncren.net 1.0% 145 9.2 4.1 1.6 203.4 18.6
5. rlasr-gw-link1-to-rtp7600-gw.ncren.net 0.0% 145 5.3 5.3 5.1 6.8 0.2
6. 128.109.9.166 0.0% 145 7.1 7.3 7.1 16.1 0.8
7. wscrs-gw-to-ws-a1a-ip-asr-gw-sec.ncren.net 0.0% 145 6.8 8.3 6.2 10.6 1.0
8. chr-edge-03.inet.qwest.net 0.0% 145 9.4 12.3 9.3 62.1 9.5
9. dcp-brdr-03.inet.qwest.net 0.0% 145 21.8 22.8 21.7 70.7 5.5
10. 63-235-40-106.dia.static.qwest.net 0.0% 145 21.8 24.5 21.7 86.1 10.6
11. 84-116-130-173.aorta.net 0.0% 145 144.8 145.0 144.7 152.9 1.0
12. nl-ams02a-rd1-te0-2-0-2.aorta.net 0.0% 145 144.1 145.5 144.0 165.4 3.7
13. 84-116-130-58.aorta.net 5.0% 144 142.9 142.3 142.0 145.6 0.4
14. 84.116.204.234 5.0% 144 145.1 145.1 144.9 145.3 0.0
15. 217-168-62-150.static.cablecom.ch 5.0% 144 145.9 146.1 145.2 164.3 1.9
16. 62-2-88-172.static.cablecom.ch 5.0% 144 313.0 260.3 152.6 508.0 80.0
Note
variability,
probably just
the end
system
Just local noise, no
carry over to later
hops Sustained loss.
Likely something
wrong 12->13, or
way back
32. Check interface
• Classic command
• Check interface ‘up’ status
• Stability: check log event or check
routing table stability
• Monitor in/out bit/packet changes
# show interface
GigabitEthernet1 is up, line protocol is up
Hardware is CSR vNIC, address is 000c.291a.7f97 (bia
000c.291a.7f97)
Internet address is 192.168.225.130/24
MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full Duplex, 1000Mbps, link type is auto, media type is RJ45
output flow-control is unsupported, input flow-control is
unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:05:35, output 00:09:58, output hang never
Last clearing of "show interface" counters never
Input queue: 0/375/0/0 (size/max/drops/flushes); Total output
drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 0 bits/sec, 0 packets/sec
25349 packets input, 2381158 bytes, 0 no buffer
Received 0 broadcasts (0 IP multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 0 multicast, 0 pause input
3958 packets output, 312408 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
56 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out
33. Follow the Flow with NetFlow(RFC 3954)
• Per-Node: Data plane observations and decisions captured
• Src/dst mac/IP/port#s, DSCP values, in/out interfaces, etc.
• Network view: flows centrally analyzed- NetFlow collector/analyzer
• Biggest value: strategically placed partial views
(eg WAN edge)
33
A B
R1 R2 R5
R6
R4
R3
NetFlow Collector
LiveAction
34. • Developed and patented at Cisco
Systems in 1996
• NetFlow is the de facto standard for
acquiring IP operational data
• Standardized in IETF via IPFIX
(RFC 7011)
• Provides network and security
monitoring, network planning, traffic
analysis, and IP accounting
• Packet capture is like a wire tap
• NetFlow is like a phone bill
NetFlow(RFC 3954)—What Is It?
Network World Article—NetFlow Adoption on the Rise
http://www.networkworld.com/newsletters/nsm/2005/0314nsm1.html 34
35. Src.
IP
Dest.
IP
Source
Port
Dest.
Port
Protocol TOS
Input
I/F
… Pkts
3.3.3.3 2.2.2.2 23 22078 6 0 E0 … 1100
Traffic Analysis Cache
Flow
Monitor 1
Traffic
Non-Key Fields
Packets
Bytes
Timestamps
Next Hop Address
Source IP Dest. IP Input I/F Flag … Pkts
3.3.3.3 2.2.2.2 E0 0 … 11000
Security Analysis Cache
Flow
Monitor 2
Key Fields Packet 1
Source IP 3.3.3.3
Dest IP 2.2.2.2
Input Interface Ethernet 0
SYN Flag 0
Non-Key Fields
Packets
Timestamps
Flexible NetFlow
Multiple Monitors with Unique Key Fields
Key Fields Packet 1
Source IP 3.3.3.3
Destination IP 2.2.2.2
Source Port 23
Destination Port 22078
Layer 3 Protocol TCP - 6
TOS Byte 0
Input Interface Ethernet 0
35
36. • Flexible NetFlow Forwarding
Status field captures
forwarding (and drop reason)
for flow.
• Drop Count increments on any
explicit drop by router
NetFlow Forwarding Status & Drop Count Fields
RFC 7270
36
37. Network nodes are able to discover & validate RTP, TCP and IP-CBR traffic on hop by hop
basis
À la carte metric (loss, latency, jitter etc.) selections, applied on operator selected sets of traffic
Allows for fault isolation and network span validation
Per-application threshold and altering.
Network Performance Monitor
37
38. • RTP SSRC
• RTP Jitter (min/max/mean)
• Transport Counter (expected/loss)
• Media Counter
(bytes/packets/rate)
• Media Event
• Collection interval
• TCP MSS
• TCP round-trip time
Performance Monitor Information Elements
• CND - Client Network Delay (min/max/sum)
• SND – Server Network Delay (min/max/sum)
• ND – Network Delay (min/max/sum)
• AD – Application Delay (min/max/sum)
• Total Response Time (min/max/sum)
• Total Transaction Time (min/max/sum)
• Number of New Connections
• Number of Late Responses
• Number of Responses by Response Time (7-
bucket histogram)
• Number of Retransmissions
• Number of Transactions
• Client/Server Bytes
• Client/Server Packets
• L3 counter (bytes/packets)
• Flow event
• Flow direction
• Client and server address
• Source and destination address
• Transport information
• Input and output interfaces
• L3 information (TTL, DSCP,
TOS, etc.)
• Application information (from
deep packet inspection tool)
• Monitoring class hierarchy
Media Monitoring Application Response Time Other Metrics
38
39. NetFlow QoS Analysis
39
Cisco Prime Infra
LiveAction
flow 5-tuple DPI/NBAR QoS processing DSCP
How is my flow being classified?
Did this QoS class drop traffic?
40. Dedicated Protocol Analyzers
• Wireshark and other protocol analyzers are great
• Detailed analysis for variety of protocols at deep level
• Dedicated probes are expensive to deploy pervasively
• Operator has to make difficult judgment calls on where the problem is going to be– before it
happens
• Can be challenging after the fact- need on-site trained personnel.
40
41. Embedded Packet Capture & Analyze
• Capture packets locally to buffer on router
• Store to flash, USB, FTP, TFTP for analysis in protocol analyzer
• Capture does not add traffic to network
LY-2851-8#monitor capture buffer pcap-buffer1 size 10000 max-size 1550
LY-2851-8#monitor capture point ip cef pcap-point1 g0/0 both
LY-2851-8#monitor capture point associate pcap-point1 pcap-buffer1
LY-2851-8#monitor capture point start pcap-point1
LY-2851-8#monitor capture point stop pcap-point1
LY-2851-8#monitor capture buffer pcap-buffer1 export ftp://10.17.0.252/images/test.cap
Gig0/0
42. in-band OAM for IPv6 (iOAM6)
• New IPv6 extension header defined on user packets
• vs. IPv4 record-route option header
• RFC 2460 does not define an option to record the route
• Minimal performance hit (handled in data plane)
• Packets continue on regular path
• Instrumentation
• Packet sequence numbers => detect packet loss
• Time stamps => one way delay
• Node and ingress/egress interface names => path recording
• draft-brockners-inband-oam-requirements-03
42
Network
Element
Apps/Controller
v6 traffic
matrix
Live flow
tracing
Delay
distribution
Bi-castíng
control
Loss matrix/
monitor
App data
monitoring
Enhanced Telemetry
Per hop and end-to-end data added to
(selected) data traffic into the packet
Node-ID Ingress i/f egress i/f
Sequence# Timestamp App-Data
43. iOAM6 Path Trace
• Extended Ping
H1#ping
Protocol [ip]: ipv6
Target IPv6 address: ::A:1:1:0:1D
Repeat count [5]: 1
Datagram size [100]: 300
Timeout in seconds [2]:
Extended commands? [no]: yes
Source address or interface: gig0/1
UDP protocol? [no]:
Verbose? [no]: yes
Precedence [0]:
DSCP [0]:
Include hop by hop Path Record option? [no]: yes
Sweep range of sizes? [no]:
Type escape sequence to abort.
Sending 1, 300-byte ICMP Echos to ::A:1:1:0:1D, timeout is 2 seconds:
(Gi0/1)R1(Gi0/2)----(Gi0/1)R4(Gi0/2)----(Gi0/2)R3(Gi0/3)----H3----(Gi0/3)R3(Gi0/2)----(Gi0/2)R4(Gi0/1)----(Gi0/2)R1(Gi0/1)
Reply to request 0 (35 ms)
Success rate is 100 percent (1/1), round-trip min/avg/max = 35/35/35 ms
H1 R1 R3
H3
::A:1:1:0:1D
R2
R4
43
V6 extension
header
applied/decapped
V6 extension
header
applied/decapped
End system ICMP
stack iOAM6 enabled
45. Be Prepared!
• Be prepared and have data collection systems enabled
• Enable passive monitoring on endpoints and network
• Enable active tests
• Helpdesk
• Interview Script => establish & maintain checklists
• Multi-group access to tools, logs, etc.
• Firefighters run drills, so should your teams!
• Be familiar with the tools and how they respond on your network
• Red phone: Cross-domain teams (applications, UC, security, servers)
45
46. Expanding your Toolbox and Knowledge
• Commercial and open source tools to look at
• Network topology & IP address management: netdot, GestióIP
• Performance tests: iperf3, netperf
• Service checks: Nagios Core, Zenoss Core
• NetFlow / Log analysis/moniroting: logstash, fluentd, splunk
• Template driven config generation: ansible
• Control Plane Troubleshooting (Troubleshooting IP Routing
Protocols
46
47. Network Documentation Tool (netdot)
• Open source
• Started in 2002
• Network interfaces discovery via SNMP
• Discovery of L2 & L3 devices
• Dynamically draws a diagram/topology of your network
• Management of IPv4 and IPv6 address via IPAM
47
49. GestióIP
• Web based IP Management software
• Concurrent users support
• Better search and filter capabilities than traditional spreadsheet
• Better statistical data
• Less chance of human error compare to spreadsheet
• Migration assistance from managed IPv4 to IPv6 via tool
49
51. iperf3
• Active measurement tool to discover available path capacity
• worst link and worst host configurations
• Test can be in either direction (only static NAT works)
• TCP (retransmissions, rate, cwd), SCTP and UDP (loss, jitter, out of order) tests
51
sender receiver
TCP/5201
Test traffic: TCP,
SCTP, UDP
52. ∫∫∫∫∫∫∫
$ bwctl -T iperf3 -t 30 -O 4 -s "56m-ps-4x10.sox.net:4823"
bwctl: Using tool: iperf3
bwctl: 40 seconds until test results available
SENDER START
Connecting to host 152.22.242.103, port 5160
[ 15] local 143.215.194.123 port 45609 connected to 152.22.242.103 port 5160
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 15] 0.00-1.00 sec 107 MBytes 898 Mbits/sec 0 3.06 MBytes (omitted)
[ 15] 1.00-2.00 sec 112 MBytes 944 Mbits/sec 0 3.06 MBytes (omitted)
…
[ 15] 29.00-30.00 sec 112 MBytes 944 Mbits/sec 0 3.06 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 15] 0.00-30.00 sec 3.29 GBytes 942 Mbits/sec 0 sender
[ 15] 0.00-30.00 sec 3.29 GBytes 943 Mbits/sec receiver
iperf Done.
SENDER END
Iperf3
examples
$ $ bwctl -T iperf3 -t 30 -O 4 -c "56m-ps-4x10.sox.net:4823"
bwctl: Using tool: iperf3
bwctl: 39 seconds until test results available
SENDER START
Connecting to host 143.215.194.123, port 5327
[ 15] local 152.22.242.103 port 44855 connected to 143.215.194.123 port 5327
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 15] 0.00-1.00 sec 5.14 MBytes 43.1 Mbits/sec 411 25.5 KBytes (omitted)
[ 15] 1.00-2.00 sec 2.26 MBytes 19.0 Mbits/sec 15 19.8 KBytes (omitted)
…
[ 15] 28.00-29.00 sec 2.26 MBytes 18.9 Mbits/sec 16 25.5 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 15] 0.00-30.00 sec 59.8 MBytes 16.7 Mbits/sec 539 sender
[ 15] 0.00-30.00 sec 60.7 MBytes 17.0 Mbits/sec receiver
iperf Done.
SENDER END
Client to server
(local to remote)
Throw away stats
from first 4 sec
Run for 30 sec
~19mbps (local to
remote)
retransmissions
~940 mbps (remote
to local)
Use –P for parallel
streams
53. ∫∫∫∫∫∫∫
• Similar to iperf3 but:
• Works bidirectionally in a
NAT environment
• additional connection/per
second and transaction/per
second tests
• statistical confidence
intervals (-I)
netperf
> netperf -t TCP_STREAM -H 162.209.79.211 -i 30,10 -I 95,5 -j -l 60
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 162.209.79.211 ()
port 0 AF_INET : +/-2.500% @ 95% conf. : demo
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput : 8.965%
!!! Local CPU util : 0.000%
!!! Remote CPU util : 0.000%
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 60.52 13.91
download
54. Nagios Core
54
• Monitoring and alerting engine
• Open source written in C
• Flexible and scalable architecture
• Event scheduler, event processor and alert manager
• APIs can be used to extend the capabilities to perform additional tasks
• Designed for Unix & Linux systems
56. Zenoss Core
56
• Network Management System
• Open source
• Written in 90% Python, 10% Java
• GNU based license
• Web based interface for monitoring
• Used by Govt sector, Retails, financial institutions, SP and tech industry
58. logstash
58
• Open source tool for managing events and logs
• Data/logs Collection(many sources), filter and display logs
• Scalable data processing
• Analysis, Archiving, Monitoring & Alerting
• Elasticsearch API is used for storage
• Kibana (a broswer based analytics) is developed to view Logstash data
59. fluentd
59
• Open source data collector
• Written in C and Ruby
• Requires very little system resource
• Simple and flexible/extensible
• 5000+ companies are using fluentd for data collection
• Provides unified logging layer
• Decouples data sources from backend systems
60. ansible
60
• Automates software provisioning and configuration management
• Use for application deployment/migration
• Reduces complexity and repetition
• Comes with Fedora distribution of Linux
• Used for Linux, Unix and Windows
• Written in python and powershell
• Can be used in the cloud environment also such as AWS, Azure etc.
61. Splunk
61
• Accessible via standard browser or via mobile app
• Collects and index data
• Powerful statistical search of the data
• Correlate and investigate between events and activities
• Display reports in a customized dashboard
• Turn searches into real time alert
• Notifies via email or RSS
63. 63
Alerting & Collaboration
• Routing of alerts / interesting events
• Is this noise or signal?
• Which team(s) to alert?
• Who is on duty?
• How to contact: SMS, IM, phone call…
• Pagerduty, Openduty
• Coordinating response
• IM tools (Spark, hipchat etc.)
• Email
• Ticketing tools (OTRS, Jira,
ServiceNow, Moogsoft…)
PagerDuty