This document provides an overview of network state awareness and troubleshooting techniques. The agenda covers troubleshooting methodology, packet forwarding review, active and passive monitoring, quality of service, control plane, and routing protocol stability. It distinguishes between the control plane, which creates routing information based on aggregated data, and the data plane, which makes forwarding decisions based on packet details. Various troubleshooting tools are discussed like traceroute, interface statistics, NetFlow, and performance monitoring to analyze the network from the data plane perspective.
2. Network State Awareness & troubleshooting
Abstract
â˘âŻ Network state awareness and troubleshooting is a large and skilled part of
operating a network. This session will cover basic network data plane
troubleshooting best practices and techniques to plan for failures. We will also
do demos and a review of the troubleshooting tool chain: NetFlow, perf-mon,
CBQoS, ICMP/traceroute, interface stats, but also touching on RP stability (SPF
runs, unstable neighbors etc), and SDN methodologies along the same lines.
2
3. Network State Awareness & troubleshooting
â˘âŻ Troubleshooting Methodology
â˘âŻ Packet Forwarding Review
â˘âŻ Data Plane
â˘âŻ Active Monitoring
â˘âŻ Passive Flow Monitoring
â˘âŻ QoS
â˘âŻ Control Plane
â˘âŻ Logging
â˘âŻ Routing Protocol Stability
â˘âŻ Getting Started
Agenda
4. Network State Awareness & troubleshooting
â˘âŻ This session is about basic network troubleshooting,
focusing on fault detection & isolation
â˘âŻ Mostly, vendor neutral
â˘âŻ For context, we will cover some basic methodologies
and functional elements of network behavior
â˘âŻ This session is NOT about
â˘âŻ Architectures of specific platforms
â˘âŻ Data Center technologies
â˘âŻ This is the 90 min tour. ;-)
Keeping Focused: What This Session is About
4
5. Network State Awareness & troubleshooting
The Big Picture
network
Network Operator
Server
Client
Application Operator
Not
happy
Itâs not
the
network
Itâs the
network
Is it
Monday?
Pings
fine!
Canât
ping it.
Internetâs
down.
Somebody's
downloading
something.
(?)
5
6. Network State Awareness & troubleshooting
Enterprise
DC
â˘âŻ A lot of stuff going on
â˘âŻ Multiple networks
â˘âŻ Multiple applications
â˘âŻ Multiple layered services
â˘âŻ Mis-information / inconsistency
Some More (network) Detail
LAN
Server A
Client
Not
happy
ISP A
Enterprise
WAN
Server B
Internet
DNS
DHCP
802.1x
DNS
6
7. Network State Awareness & troubleshooting
ISP B
Enterprise
DC
â˘âŻ Redundant paths / ECMP / LAG
â˘âŻ Overlays
â˘âŻ Load balancers
â˘âŻ Firewalls
â˘âŻ NATs
⌠and it keeps on going
LAN
Server A
Client
Not
happy
ISP A
Enterprise
WAN
Server B
Internet
DNS
DHCP
802.1x
DNS
7
8. Network State Awareness & troubleshooting
Why network state awareness?
â˘âŻ Quick detection of hard failures
â˘âŻ Early warning for
â˘âŻ soft failures
â˘âŻ performance issues
â˘âŻ and tomorrowsâ problems
â˘âŻ Faster problem resolution
â˘âŻ Greater confidence in network by users and application operators
8
9. Network State Awareness & troubleshooting
Find the Suspects Ques/on Suspects Improve
Be Prepared
Think Like a Network Detective
9
10. Network State Awareness & troubleshooting
â˘âŻ Control Plane
â˘âŻ Processes variety of information
sources and policies, creates
routing information base (RIB)
â˘âŻ Best known intention w/o actual
packet in hand
â˘âŻ Data Plane
â˘âŻ The actual forwarding process
(might be SW or HW based)
â˘âŻ Granted some decision flexibility
â˘âŻ Driven by arriving packet details,
traffic conditions etc.
Control Plane & Data Plane
Control Plane
Data Plane
Int A
Int B
Int C
packet
Routing
Protocol(s)
APIs Statics
Check routes
check L3 routing
Check policy
check forwarding
Gossip from
other routers
Passive Measurements
ifmib *FlowCbQoS
check policy-map intâŚ
check interface
check flow monitor
PfR
10
Admin Edict
11. Network State Awareness & troubleshooting
â˘âŻ Control plane: condenses options driven by policies and (relatively) slower
moving , aggregated information, eg. prefix reachability, interface state
â˘âŻ Data plane responds to packet conditions
â˘âŻ Destination prefix to egress interface matching
â˘âŻ Multi-path (ECMP / LAG) member selection
â˘âŻ Interface congestion
â˘âŻ QoS class state
â˘âŻ Access Lists
â˘âŻ Packet processing fields (TTL expire, etc)
â˘âŻ IPv4 fragmentation, etc
Data Plane Decision Flexibility
11
12. Network State Awareness & troubleshooting
â˘âŻ Each network device makes an independent forwarding decision
â˘âŻ Explicit Local / domain policies
â˘âŻ Device perspective might not be symmetric
â˘âŻ Data plane flexibility
â˘âŻ Generally happens at WAN-edge and admin boundaries (traffic engineering)
â˘âŻ Asymmetric routing
Network as a System: Independent Decisions
A B
R1 R2 R5
R6
R4
R3
your network You donât control
Congested link
R5 is doing
ECMP hash
12
14. Network State Awareness & troubleshooting
User / Agent Checks
â˘âŻ Treat network as a black box: are your beacon services working?
â˘âŻ Synthetic service check (HTTP, DNS, etc.)
â˘âŻ Ping (not all remotes will respond)
â˘âŻ Data plane is exercised and tested
â˘âŻ Variety = better coverage (multiple IP addresses / L4 ports per location)
â˘âŻ Validate similar treatment (QoS) as real user traffic
â˘âŻ Uptime and performance (loss, latency) metrics
â˘âŻ Look for patterns, changes from normal. All down vs some down.
â˘âŻ Capture and validate real user (human) incidents. What got missed?
â˘âŻ Use wisely: network and server resources consumed
A B
R1 R2 R5
R6
R3
14
15. Network State Awareness & troubleshooting
Latency
Network
Jitter
Dist. of
Stats Connectivity
Packet
Loss
FTP DNS DHCP TCPJitter ICMP UDPDLSW HTTP
Network
Performance
Monitoring
Service Level
Agreement
(SLA)
Monitoring
Network
Assessment
Multiprotocol
Label
Switching
(MPLS)
Monitoring
VoIP
MonitoringAvailability
Trouble
Shooting
Operations
Measurement Metrics
Uses
MIB Data Active Generated Traffic to Measure the
Network
DestinationSource
Responder
LDP H.323 SIP RTP
IP SLA
IP SLA*(RFC 6812): Synthetic Traffic Measurements
IP SLA
IP SLA
15
*IP SLA can be replaced with other monitoring tools used by other vendors such as RPM of Juniper etc
â˘âŻ IPSLA on router/switch â
Shadow Router?
â˘âŻ User end-system based
agent software
â˘âŻ Dedicated Agent
16. Network State Awareness & troubleshooting
Check interface
â˘âŻ Classic command
â˘âŻ Check interface âupâ status
â˘âŻ Stability: check log event or check
routing table stability
â˘âŻ Monitor in/out bit/packet changes
# show interface
GigabitEthernet1 is up, line protocol is up
Hardware is CSR vNIC, address is 000c.291a.7f97 (bia 000c.291a.
7f97)
Internet address is 192.168.225.130/24
MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full Duplex, 1000Mbps, link type is auto, media type is RJ45
output flow-control is unsupported, input flow-control is
unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:05:35, output 00:09:58, output hang never
Last clearing of "show interface" counters never
Input queue: 0/375/0/0 (size/max/drops/flushes); Total output
drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 0 bits/sec, 0 packets/sec
25349 packets input, 2381158 bytes, 0 no buffer
Received 0 broadcasts (0 IP multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 0 multicast, 0 pause input
3958 packets output, 312408 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
56 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out
17. Network State Awareness & troubleshooting
traceroute
â˘âŻ Understand the limitations
â˘âŻ Sends 3 packets (default) at each TTL
â˘âŻ Implementations
â˘âŻ Linux/Cisco: UDP (ICMP and TCP-SYN are Linux optional)
â˘âŻ UDP DST port # used to keep track of packets, increments per packet. Initial= 33434 (default)
â˘âŻ SRC port #: randomized (linux), incrementing per packet (Cisco IOS)
â˘âŻ Linux (GNU inetutils-traceroute)
â˘âŻ UDP DST port# increments per TTL (not per packet)
â˘âŻ SRC port is random but fixed per entire run
â˘âŻ Windows: ICMP Echo request
Widest dispersion
against possibilities.
Difficult to
understand though.
ICMP blocked
frequently L
Narrower
dispersion.
Story might be
misleading.
Internet: aka the
TCP/80 network
17
18. Network State Awareness & troubleshooting
Unix traceroute
â˘âŻ Multiple path options
â˘âŻ Topology âshortcutsâ (same router seen at diff hop)
â˘âŻ Ultimately all paths result in similar e2e delay
18
$ traceroute 62.2.88.172
traceroute to 62.2.88.172 (62.2.88.172), 30 hops max, 60 byte packets
1 152.22.242.65 (152.22.242.65) 1.044 ms 1.371 ms 1.585 ms
2 152.22.240.8 (152.22.240.8) 0.219 ms 0.328 ms 0.327 ms
3 128.109.70.9 (128.109.70.9) 1.066 ms 1.059 ms 1.168 ms
4 rtp7600-gw-to-dep7600-gw2.ncren.net (128.109.70.137) 1.634 ms 1.628 ms 1.736 ms
5 rlasr-gw-link1-to-rtp7600-gw.ncren.net (128.109.9.17) 5.354 ms 5.446 ms 5.557 ms
6 128.109.9.117 (128.109.9.117) 5.671 ms 128.109.9.170 (128.109.9.170) 7.141 ms 128.109.9.117 (128.109.9.117) 5.433 ms
7 wscrs-gw-to-ws-a1a-ip-asr-gw-sec.ncren.net (128.109.1.105) 9.174 ms 128.109.1.209 (128.109.1.209) 8.256 ms 6.397 ms
8 dcp-brdr-03.inet.qwest.net (205.171.251.110) 18.414 ms chr-edge-03.inet.qwest.net (65.114.0.205) 27.353 ms 27.438 ms
9⯠dcp-brdr-03.inet.qwest.net (205.171.251.110) 21.739 ms 63-235-40-106.dia.static.qwest.net (63.235.40.106) 17.750 ms
dcp-brdr-03.inet.qwest.net (205.171.251.110) 22.450 ms
10 63-235-40-106.dia.static.qwest.net (63.235.40.106) 22.531 ms 22.516 ms 84-116-130-173.aorta.net (84.116.130.173) 140.738 ms
11 nl-ams02a-rd1-te0-2-0-2.aorta.net (84.116.130.65) 140.831 ms 140.816 ms 84-116-130-173.aorta.net (84.116.130.173) 144.819
ms
12 nl-ams02a-rd1-te0-2-0-2.aorta.net (84.116.130.65) 144.074 ms 144.761 ms 84-116-130-58.aorta.net (84.116.130.58) 138.455 ms
13 84-116-130-58.aorta.net (84.116.130.58) 141.844 ms 141.924 ms 142.459 ms
14 84.116.204.234 (84.116.204.234) 145.603 ms 145.891 ms 145.987 ms
15 * * *
16 62-2-88-172.static.cablecom.ch (62.2.88.172) 268.281 ms 268.245 ms 268.176 ms
1 AAA
2 BBB
3 CCC
4 DDD
5 EEE
6 FGF
7 HII
8 JKK +10ms (unsustained)
9 JLJ
10 LLM +120ms (sustained)
11 NNM
12 NNO
13 PPP
14 QQQ
15 ***
16 RRR ~268ms (all three)
filter + > 100 ms
delay
+120ms
Atlantic
crossing
Reference
19. Network State Awareness & troubleshooting
Unix inetutils traceroute
â˘âŻ Narrower view (no alternate paths directly seen)
â˘âŻ Repeating nodes suggests multipath, or (unlikely) routing issue
19
$ inetutils-traceroute --resolve-hostname 62.2.88.172
traceroute to 62.2.88.172 (62.2.88.172), 64 hops max
1 152.22.242.65 (152.22.242.65) 0.783ms 0.727ms 0.798ms
2 152.22.240.8 (152.22.240.8) 0.226ms 0.228ms 0.221ms
3 128.109.70.9 (128.109.70.9) 0.967ms 0.980ms 0.962ms
4 128.109.70.137 (rtp7600-gw-to-dep7600-gw2.ncren.net) 1.576ms 1.598ms 1.567ms
5 128.109.9.17 (rlasr-gw-link1-to-rtp7600-gw.ncren.net) 5.149ms 5.140ms 5.126ms
6 128.109.9.166 (128.109.9.166) 7.113ms 7.098ms 7.306ms
7 128.109.1.209 (128.109.1.209) 7.835ms 8.326ms 7.958ms
8 65.114.0.205 (chr-edge-03.inet.qwest.net) 19.944ms 9.299ms 40.372ms
9 63.235.40.106 (63-235-40-106.dia.static.qwest.net) 18.442ms 18.412ms 18.432ms
10 63.235.40.106 (63-235-40-106.dia.static.qwest.net) 22.424ms 22.391ms 75.960ms
11 84.116.130.173 (84-116-130-173.aorta.net) 145.434ms 146.301ms 145.445ms
12 84.116.130.58 (84-116-130-58.aorta.net) 137.583ms 137.556ms 137.661ms
13 84.116.130.58 (84-116-130-58.aorta.net) 142.476ms 141.886ms 141.819ms
14 84.116.204.234 (84.116.204.234) 144.841ms 145.034ms 144.964ms
15 * * *
16 62.2.88.172 (62-2-88-172.static.cablecom.ch) 287.318ms 176.670ms 254.237ms
Packets for hop 9,12 took a
âshortcutâ and packets for
hop 10,13 went long way
Reference
21. Network State Awareness & troubleshooting
mtr
â˘âŻ Interactive combined traceroute and ping
â˘âŻ Gives a sense of health of path (loss, delay Standard Deviation)
â˘âŻ Narrow path view
21
Reference
$ mtr 62.2.88.172
aakhter-nlr-ubuntu-01 (0.0.0.0) Sat May 30 18:57:09 2015
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. 152.22.242.65 0.0% 145 0.8 0.9 0.7 10.0 0.8
2. 152.22.240.8 0.0% 145 0.3 0.2 0.2 0.3 0.0
3. 128.109.70.9 0.0% 145 1.0 3.3 1.0 182.3 17.2
4. rtp7600-gw-to-dep7600-gw2.ncren.net 1.0% 145 9.2 4.1 1.6 203.4 18.6
5. rlasr-gw-link1-to-rtp7600-gw.ncren.net 0.0% 145 5.3 5.3 5.1 6.8 0.2
6. 128.109.9.166 0.0% 145 7.1 7.3 7.1 16.1 0.8
7. wscrs-gw-to-ws-a1a-ip-asr-gw-sec.ncren.net 0.0% 145 6.8 8.3 6.2 10.6 1.0
8. chr-edge-03.inet.qwest.net 0.0% 145 9.4 12.3 9.3 62.1 9.5
9. dcp-brdr-03.inet.qwest.net 0.0% 145 21.8 22.8 21.7 70.7 5.5
10. 63-235-40-106.dia.static.qwest.net 0.0% 145 21.8 24.5 21.7 86.1 10.6
11. 84-116-130-173.aorta.net 0.0% 145 144.8 145.0 144.7 152.9 1.0
12. nl-ams02a-rd1-te0-2-0-2.aorta.net 0.0% 145 144.1 145.5 144.0 165.4 3.7
13. 84-116-130-58.aorta.net 5.0% 144 142.9 142.3 142.0 145.6 0.4
14. 84.116.204.234 5.0% 144 145.1 145.1 144.9 145.3 0.0
15. 217-168-62-150.static.cablecom.ch 5.0% 144 145.9 146.1 145.2 164.3 1.9
16. 62-2-88-172.static.cablecom.ch 5.0% 144 313.0 260.3 152.6 508.0 80.0
Note variability,
probably just
the end
system
Just local noise, no
carry over to later
hops Sustained loss.
Likely something
wrong 12->13, or
way back
22. Network State Awareness & troubleshooting
Follow the Flow with NetFlow(RFC 3954)
â˘âŻ Per-Node: Data plane observations and decisions captured
â˘âŻ Src/dst mac/IP/port#s, DSCP values, in/out interfaces, etc.
â˘âŻ Network view: flows centrally analyzed- NetFlow collector/analyzer
â˘âŻ Biggest value: strategically placed partial views
(eg WAN edge)
22
A B
R1 R2 R5
R6
R4
R3
NetFlow Collector
LiveAction
23. Network State Awareness & troubleshooting
â˘âŻ Developed and patented at Cisco
Systems in 1996
â˘âŻ NetFlow is the de facto standard for
acquiring IP operational data
â˘âŻ Standardized in IETF via IPFIX
â˘âŻ Provides network and security
monitoring, network planning, traffic
analysis, and IP accounting
â˘âŻ Packet capture is like a wire tap
â˘âŻ NetFlow is like a phone bill
NetFlow(RFC 3954)âWhat Is It?
Network World ArticleâNetFlow Adoption on the Rise
http://www.networkworld.com/newsletters/nsm/2005/0314nsm1.html 23
24. Network State Awareness & troubleshooting
Src.
IP
Dest.
IP
Source
Port
Dest.
Port
Protocol TOS
Input
I/F
⌠Pkts
3.3.3.3 2.2.2.2 23 22078 6 0 E0 ⌠1100
Traffic Analysis Cache
Flow
Monitor 1
Traffic
Non-Key Fields
Packets
Bytes
Timestamps
Next Hop Address
Source IP Dest. IP Input I/F Flag ⌠Pkts
3.3.3.3 2.2.2.2 E0 0 ⌠11000
Security Analysis Cache
Flow
Monitor 2
Key Fields Packet 1
Source IP 3.3.3.3
Dest IP 2.2.2.2
Input Interface Ethernet 0
SYN Flag 0
Non-Key Fields
Packets
Timestamps
Flexible NetFlow
Multiple Monitors with Unique Key Fields
Key Fields Packet 1
Source IP 3.3.3.3
Destination IP 2.2.2.2
Source Port 23
Destination Port 22078
Layer 3 Protocol TCP - 6
TOS Byte 0
Input Interface Ethernet 0
24
25. Network State Awareness & troubleshooting
â˘âŻ Flexible NetFlow Forwarding
Status field captures
forwarding (and drop reason)
for flow.
â˘âŻ Drop Count increments on any
explicit drop by router
NetFlow Forwarding Status & Drop Count Fields
25
26. Network State Awareness & troubleshooting
Network nodes are able to discover & validate RTP, TCP and IP-CBR traffic on hop by hop basis
Ă la carte metric (loss, latency, jitter etc.) selections, applied on operator selected sets of traffic
Allows for fault isolation and network span validation
Per-application threshold and altering.
Network Performance Monitor
26
27. Network State Awareness & troubleshooting
â˘âŻ RTP SSRC
â˘âŻ RTP Jitter (min/max/mean)
â˘âŻ Transport Counter (expected/loss)
â˘âŻ Media Counter (bytes/packets/
rate)
â˘âŻ Media Event
â˘âŻ Collection interval
â˘âŻ TCP MSS
â˘âŻ TCP round-trip time
Performance Monitor Information Elements
â˘âŻ CND - Client Network Delay (min/max/sum)
â˘âŻ SND â Server Network Delay (min/max/sum)
â˘âŻ ND â Network Delay (min/max/sum)
â˘âŻ AD â Application Delay (min/max/sum)
â˘âŻ Total Response Time (min/max/sum)
â˘âŻ Total Transaction Time (min/max/sum)
â˘âŻ Number of New Connections
â˘âŻ Number of Late Responses
â˘âŻ Number of Responses by Response Time (7-
bucket histogram)
â˘âŻ Number of Retransmissions
â˘âŻ Number of Transactions
â˘âŻ Client/Server Bytes
â˘âŻ Client/Server Packets
â˘âŻ L3 counter (bytes/packets)
â˘âŻ Flow event
â˘âŻ Flow direction
â˘âŻ Client and server address
â˘âŻ Source and destination address
â˘âŻ Transport information
â˘âŻ Input and output interfaces
â˘âŻ L3 information (TTL, DSCP,
TOS, etc.)
â˘âŻ Application information (from
deep packet inspection tool)
â˘âŻ Monitoring class hierarchy
Media Monitoring Application Response Time Other Metrics
27
28. Network State Awareness & troubleshooting
NetFlow QoS Analysis
28
Cisco Prime Infra
LiveAction
flow 5-tuple DPI/NBAR QoS processingDSCP
How is my flow being classified?
Did this QoS class drop traffic?
29. Network State Awareness & troubleshooting
Dedicated Protocol Analyzers
â˘âŻ Wireshark and other protocol analyzers are great
â˘âŻ Detailed analysis for variety of protocols at deep level
â˘âŻ Dedicated probes are expensive to deploy pervasively
â˘âŻ Operator has to make difficult judgment calls on where the problem is going to beâ before it
happens
â˘âŻ Can be challenging after the fact- need on-site trained personnel.
29
30. Network State Awareness & troubleshooting
Embedded Packet Capture & Analyze
â˘âŻ Capture packets locally to buffer on router
â˘âŻ Store to flash, USB, FTP, TFTP for analysis in protocol analyzer
â˘âŻ Capture does not add traffic to network
LY-2851-8#monitor capture buffer pcap-buffer1 size 10000 max-size 1550
LY-2851-8#monitor capture point ip cef pcap-point1 g0/0 both
LY-2851-8#monitor capture point associate pcap-point1 pcap-buffer1
LY-2851-8#monitor capture point start pcap-point1
LY-2851-8#monitor capture point stop pcap-point1
LY-2851-8#monitor capture buffer pcap-buffer1 export ftp://10.17.0.252/images/test.cap
Gig0/0
31. Network State Awareness & troubleshooting
iOAM6(prototype)
â˘âŻ Instrumented IPv6 extension header on user packets
â˘âŻ vs. IPv4 record-route option header
â˘âŻ v6 Ext Headers better designed
â˘âŻ Domain level control
â˘âŻ Minimal performance hit (handled in data plane)
â˘âŻ Packets continue on regular path
â˘âŻ Instrumentation
â˘âŻ Packet sequence numbers => detect packet loss
â˘âŻ Time stamps => one way delay
â˘âŻ Node and ingress/egress interface names => path recording
31
Network
Element
Apps/Controller
v6 traffic
matrix
Live flow
tracing
Delay
distribution
Bi-castĂng
control
Loss matrix/
monitor
App data
monitoring
Enhanced Telemetry
Per hop and end-to-end data added to
(selected) data traffic into the packet
Node-ID Ingress i/f egress i/f
Sequence# Timestamp App-Data
32. Network State Awareness & troubleshooting
iOAM6 Path Trace
â˘âŻ Extended Ping
H1#ping
Protocol [ip]: ipv6
Target IPv6 address: ::A:1:1:0:1D
Repeat count [5]: 1
Datagram size [100]: 300
Timeout in seconds [2]:
Extended commands? [no]: yes
Source address or interface: gig0/1
UDP protocol? [no]:
Verbose? [no]: yes
Precedence [0]:
DSCP [0]:
Include hop by hop Path Record option? [no]: yes
Sweep range of sizes? [no]:
Type escape sequence to abort.
Sending 1, 300-byte ICMP Echos to ::A:1:1:0:1D, timeout is 2 seconds:
(Gi0/1)R1(Gi0/2)----(Gi0/1)R4(Gi0/2)----(Gi0/2)R3(Gi0/3)----H3----(Gi0/3)R3(Gi0/2)----(Gi0/2)R4(Gi0/1)----(Gi0/2)R1(Gi0/1)
Reply to request 0 (35 ms)
Success rate is 100 percent (1/1), round-trip min/avg/max = 35/35/35 ms
H1 R1 R3
H3
::A:1:1:0:1D
R2
R4
32
V6 extension
header applied/
decapped
V6 extension
header applied/
decapped
End system ICMP
stack iOAM6 enabled
34. Network State Awareness & troubleshooting
â˘âŻ 3Ws: When, where, and what
â˘âŻ Change is normal, but some
changes are more interesting:
â˘âŻ Single change that causes loss
of reachability or suboptimal
performance
â˘âŻ Instability: high rate of change
Control Plane
34
35. Network State Awareness & troubleshooting
Logging
â˘âŻ Centrally: for ease of analysis and search
â˘âŻ syslog-ng â preprocessing, relay and store(file/db)
â˘âŻ Logstash(ELK), fluentd â multisource collection, storage and analysis
â˘âŻ Locally: in case logs canât get home
35
36. Network State Awareness & troubleshooting
State of the Routing Table
â˘âŻ Be familiar with normal behavior of important service prefixes
â˘âŻ Establish quickly if problem is control plane or data plane
â˘âŻ Check routing table/ ipRouteTable MIB / check ip traffic (Drop stats)
â˘âŻ Track objects
36
#show ip route 192.168.2.2
Routing entry for 192.168.2.2/32
Known via "ospf 1", distance 110, metric 11, type intra area
Last update from 10.0.0.2 on FastEthernet0/0, 00:00:13 ago
Routing Descriptor Blocks:
* 10.0.0.2, from 2.2.2.2, 00:00:13 ago, via FastEthernet0/0
Route metric is 11, traffic share count is 1
37. Network State Awareness & troubleshooting
â˘âŻ Remember that OSPF data in area
should be consistent
â˘âŻ Understand ânormalâ rate of changes
â˘âŻ LSA refresh /30-min unless a change
â˘âŻ Track SPF runs over time
â˘âŻ number of LSAs expected
â˘âŻ OSPF-MIB: OspfSpfRuns,
ospfAreaLSACount
â˘âŻ Route missing?
â˘âŻ Where is the network supposed to be
attached? Is it still?
â˘âŻ check interface (on advertising router)
â˘âŻ Check ospf database âŚ
OSPF Area / AS-Wide
# show ip ospf
Routing Process "ospf 1" with ID 192.168.0.1
Start time: 00:01:46.195, Time elapsed: 00:48:27.308
Supports only single TOS(TOS0) routes
Supports opaque LSA
Supports Link-local Signaling (LLS)
Supports area transit capability
Supports NSSA (compatible with RFC 3101)
Supports Database Exchange Summary List Optimization (RFC 5243)
Event-log enabled, Maximum number of events: 1000, Mode: cyclic
Router is not originating router-LSAs with maximum metric
Initial SPF schedule delay 5000 msecs
Minimum hold time between two consecutive SPFs 10000 msecs
Maximum wait time between two consecutive SPFs 10000 msecs
Incremental-SPF disabled
Minimum LSA interval 5 secs
Minimum LSA arrival 1000 msecs
LSA group pacing timer 240 secs
Interface flood pacing timer 33 msecs
Retransmission pacing timer 66 msecs
Number of external LSA 0. Checksum Sum 0x000000
Number of opaque AS LSA 0. Checksum Sum 0x000000
Number of DCbitless external and opaque AS LSA 0
Number of DoNotAge external and opaque AS LSA 0
Number of areas in this router is 1. 1 normal 0 stub 0 nssa
Number of areas transit capable is 0
External flood list length 0
IETF NSF helper support enabled
Cisco NSF helper support enabled
Reference bandwidth unit is 100 mbps
Area BACKBONE(0)
Number of interfaces in this area is 4 (1 loopback)
Area has no authentication
SPF algorithm last executed 00:47:05.379 ago
SPF algorithm executed 4 times
Area ranges are
Number of LSA 16. Checksum Sum 0x078460
Number of opaque link LSA 0. Checksum Sum 0x000000
Number of DCbitless LSA 0
Number of indication LSA 0
Number of DoNotAge LSA 0
Flood list length 0
38. Network State Awareness & troubleshooting
OSPF Neighborships
â˘âŻ neighbor adjacencies
â˘âŻ Check ospf neighbor detail (OSPF-MIB: ospfNbrState, ospfNbrEvents, ospfNbrLSRetransQLen)
â˘âŻ How many state changes occur?
â˘âŻ What is the current state?
â˘âŻ Any retransmission happening?
â˘âŻ Check the interface queue
38
# show ip ospf neighbor detail
Neighbor 192.168.0.7, interface address 10.0.0.3
In the area 0 via interface GigabitEthernet0/1
Neighbor priority is 1, State is FULL, 6 state changes
DR is 10.0.0.3 BDR is 10.0.0.4
Options is 0x12 in Hello (E-bit, L-bit)
Options is 0x52 in DBD (E-bit, L-bit, O-bit)
LLS Options is 0x1 (LR)
Dead timer due in 00:00:39
Neighbor is up for 00:33:10
Index 2/2/2, retransmission queue length 0, number of retransmission 0
First 0x0(0)/0x0(0)/0x0(0) Next 0x0(0)/0x0(0)/0x0(0)
Last retransmission scan length is 0, maximum is 0
Last retransmission scan time is 0 msec, maximum is 0 msec
39. Network State Awareness & troubleshooting
Neighbors
â˘âŻ Logs will tells us why the neighbor is
bouncingâbut what do they mean?
â˘âŻ eg: if peer restarted it means you
have to ask the peer; heâs the one
that restarted the session
Are the neighbors bouncing constantly?
39
Neighbor 10.1.1.1 (Ethernet0) is down: peer restarted
Neighbor 10.1.1.1 (Ethernet0) is up: new adjacency
Neighbor 10.1.1.1 (Ethernet0) is down: holding time expired
Neighbor 10.1.1.1 (Ethernet0) is down: retry limit exceeded Others, but not often
41. Network State Awareness & troubleshooting
â˘âŻ IETF draft-ietf-grow-bmp-14
â˘âŻ BMP client (router) provides pre-policy view of the ADJ-RIB-IN of a peer
â˘âŻ Update messages from peer sent to BMP receiver
â˘âŻ Example uses:
â˘âŻ Realtime visualizer of BGP state
â˘âŻ Traffic engineering analytics
â˘âŻ BGP policy exploration
BGP Monitoring Protocol
41
42. Network State Awareness & troubleshooting
OpenBMP
Historical record of prefix withdraws
Current route views and peer status
42
http://www.openbmp.org
44. Network State Awareness & troubleshooting
Be Prepared!
â˘âŻ Be prepared and have data collection systems enabled
â˘âŻ Enable passive monitoring on endpoints and network
â˘âŻ Enable active tests
â˘âŻ Helpdesk
â˘âŻ Interview Script => establish & maintain checklists
â˘âŻ Multi-group access to tools, logs, etc.
â˘âŻ Firefighters run drills, so should your teams!
â˘âŻ Be familiar with the tools and how they respond on your network
â˘âŻ Red phone: Cross-domain teams (applications, UC, security, servers)
44
45. Network State Awareness & troubleshooting
Expanding your Toolbox and Knowledge
â˘âŻ Great open source tools to look at
â˘âŻ Network topology & IP address management: netdot, GestiĂłIP
â˘âŻ Performance tests: iperf3
â˘âŻ Service checks: Nagios Core, Zenoss Community
â˘âŻ NetFlow / Log analysis: logstash, fluentd
â˘âŻ Template driven config generation: ansible
45