A short but packed course on TCP Dynamic Behavior. It starts by explaining TCP from scratch so the dynamic parts can be understood. Then it dives deep into how TCP behaves in real IP networks in the face of packet losses, delays and other phenomena.
2. TCP: The Basics
Transmission Control Protocol, invented in 1974 by
Cerf&Kahn
Provides connection-oriented, reliable, causal octet
delivery between hosts in the Internet...
...at the expense of potentially long delays and low
throughput
One of the possible ______ choices to exchange data on
top of IP
four
3. TCP: The Basics
TCP packet format (from Wikipedia, the free
Encyclopedia)
Payload
4. TCP: The Basics
The TCP Sliding Window (the LEN value is inferred)
Alice Bob
SEQ=x, LEN=dA
SEQ=y, ACK=x+dA, LEN=dB
SEQ=x+dA, ACK=y+dB, LEN=dA
5. TCP: The Basics
Connection establishment
Alice Bob
SYN, SEQ=x, LEN=0
SYN, SEQ=y, ACK=x+dA, LEN=0
SEQ=x, ACK=y, LEN=dA
Presence of ACK allows
Bob to tell this message
apart from the initial one
6. TCP: The Basics
Connection release - ordered
Alice Bob
FIN, SEQ=x, LEN=0
FIN, SEQ=y, ACK=x+1, LEN=0
possibly k more octetsFIN, SEQ=y+k, ACK=x+1, LEN=0
FIN, SEQ=x, ACK=y+k+1, LEN=0
The additional 1 allows Bob
to tell this message apart
from a spontaneous FIN
from Alice
The additional 1 allows Alice
to tell this message apart
from a re-transmisison of the
initial FIN from Bob
8. TCP: The Basics
Alice Bob
Segment Exchange, Nagle algorithm
SEQ=x, ACK=y, LEN=dA
SEQ=x+dA, ACK=y, LEN=dA
SEQ=x+2dA, ACK=y, LEN=dA/3
time-out
SEQ=y, ACK=x+dA, LEN=0
Nagle criteria #1: send when
have complete segment
dA
Nagle criteria #3: send when
time-out happens
9. TCP: The Basics
Alice Bob
SEQ=x, ACK=y, LEN=dA
SEQ=y, ACK=x+dA, LEN=dB
SEQ=x+dA, ACK=y+dB, LEN=dASEQ=x+2dA, ACK=y+dB, LEN=dASEQ=x+3dA, ACK=y+dB, LEN=dA
SEQ=y+dB, ACK=x+3dA, LEN=dB
typically
>200ms
typically
>200ms
Single ACK acknowledges
multiple segments
One ACK for every two full
segments
Segment exchange, delayed ACK
SEQ=y+2dB, ACK=x+4dA, LEN=dB
10. SEQ=x, ACK=y, LEN=dA
SEQ=x+dA, ACK=y, LEN=dA
SEQ=x+2dA, ACK=y+dB, LEN=dA
SEQ=y, ACK=x+dA, LEN=dB
TCP: The Basics
Segment exchange, loss
Alice Bob
RTO
SEQ=x, ACK=y, LEN=dA
SEQ=y+dB, ACK=x+2dA, LEN=0
SEQ=x+dA, ACK=y, LEN=dA
11. SEQ=x, ACK=y, LEN=dA
SEQ=y, ACK=x+dA, LEN=dB
SEQ=x+dA, ACK=y, LEN=dA
SEQ=x+3dA, ACK=y+dB, LEN=dA
SEQ=x+dA, ACK=y+2dB, LEN=dA
SEQ=y+dB, ACK=x+dA, LEN=dB
TCP: The Basics
Segment exchange, loss with fast recovery
Alice Bob
SEQ=y+dB, ACK=x+3dA, LEN=0
SEQ=x+4dA, ACK=y+2dB, LEN=dA
:O
:O :O :O
SEQ=x+2dA, ACK=y, LEN=dA
14. TCP: Congestion Control
Congestion and how it shows up:
•Congestion is the situation in which a router within the IP
network is not able to route all the traffic offered to it
•A router is congested when one or more of its ingress
queues are full
•Congestion manifests at end hosts in the form of one or
more lost TCP segments, which is known as a "congestion
event"
15. TCP: Congestion Control
How does TCP react to a congestion event?
•A congestion event may take one of two shapes:
1. A burst loss, detected as no ACK for the sent bytes received within RTO
seconds from sending a byte
2. A single loss, detected as a delayed ACK (an ACK for a byte sent later than a
not-yet-acknowledged byte)
•On detection of a congestion event at a sender, TCP throttles down its sending rate by
decreasing its send window size (represented as W), entering a state known as
Congestion Avoidance mode
•While in Congestion Avoidance mode, the sending window is known as the congestion
window and its size is represented as cWnd
16. Congestion Avoidance mode in TCP-Reno
•On a congestion event, TCP-Reno slashes its sending window as follows:
• While in congestion avoidance mode, TCP-Reno increases its congestion window by n
segments for each n acknowledged segments:
•Additionally, in the face of a burst loss TCP Reno shuts down its sending window and
starts a slow-start phase until it reaches the congestion window size:
TCP: Congestion Control
20. TCP: Congestion Control
Other CC algorithms:
•TCP-Ledbat
•TCP-Ericsson-Akamai
To know which CC algorithm your box is running:
> $ cat /proc/sys/net/ipv4/tcp_congestion_control
reno
To change the CC algorithm in your box:
•edit /boot/config-x.y.zz-generic
•change CONFIG_DEFAULT_TCP_CONG
21. Main parameters affecting host performance:
TCP: End host charact.
*: tcp_wmem overrides /proc/sys/net/core/wmem_default
**: tcp_wmem overridden by /proc/sys/net/core/wmem_max
***: setsockopt() changing buffer sizes disables auto-tuning!!
Browse with (requires super-user privileges):
sysctl -a | fgrep net.ipv4.tcp
sysctl name
(net.ipv4)
meaning default explanation
tcp_low_latency Nagle algorithm status 0 Nagle enabled
tcp_window_scaling Window scaling status 1 Window scaling
enabled
tcp_adv_win_scale Window scaling factor 2 W = w2
tcp_wmem Sending buffer size 4KB 85KB* 170KB** min/default/max
tcp_moderate_rcvbu
f
Auto-tuning (2.6.7 and later) 1*** Auto-tuning enabled
26. TCP: IP Router behavior
IP router simplified model:
switch fabric
Routing logic
Line cards Line cards
27. Queue management strategies:
Passive: drop every incoming packet when the ingress
queue is full
Active: drop incoming packets selectively when the ingress
queue lenght grows above a threshold
TCP: IP Router behavior
threshold
Router can't choose, Red
flow gets more share than
Green and Blue
Router can choose, thus
share queue space more
evenly
28. Active Queue Management flavors:
Random Early Discard (RED)
Weighted Fair Queueing (WFQ)
TCP: IP Router behavior
minThresmaxThres
token buckets
1
RED mitigates tail-drops
but is unfair
drop probability
distribution
WFQ mitigates tail-drops
and is fair
29. TCP: IP Router behavior
Packet drop probability distributions
Passive
RED
WFQ
Ideal
32. TCP: Dynamic Performance
Window sizes for different BW and D values:
TCP w/o LFN extensions (RFC1323) limits window size to 216 = 64KB
Solutions:
A) open multiple TCP connections (each has its own 64KB window)
B) use a LFN-friendly TCP stack (i.e. supporting RFC1323), like the one in Linux kernel 2.6.16 that
comes with LOTC
33. 2 sources of packet loss:
1) Transmission errors: random, non-correlated
2.a) Router queues: random, correlated
2.b) Router queues with Active Queue Management (e.g. RED): random,
non-correlated
Theoretical throughput model comes given by:
(non-correlated)
Max values are capped by send/receive window size limits
TCP: Dynamic Performance
35. Group was not exported from SlideRocket
TCP: Dynamic Performance
Congestion window size (cWnd) is independent of buffer size at the
receiver (RCVBUF):
Group was not exported from SlideRocket
p
RCVBUF
D
2D
3D
wD
w
wD/2
cWnd=2*w/2=w
cWnd’=(3w/2)/2=3w/4
3wD/2
cWnd’’=(5w/4)/2=5w/8
3w/2
36. TCP: Dynamic Performance
BW limit for different p and D values (W = ∞, non-correlated loss with C=1):
Solutions:
Multiple TCP connections in theory allow more aggregated throughput; however, effect of increasing
load on the network at congestion spots is unknown
37. TCP: Dynamic Performance
Evolution of sender’s buffer with time (uniform arrival rate):
w
w
--
-
2
D 2
D
3
D
w
--- D
2
Queued packets
3w
---
4
Qi = 3w/4 – [(w/2 + i) – Qi-1], Q-1 = 1
Qi = w/4 – i + Qi-1
= w/4 – i + [w/4 – (i-1) + [w/4 – (i-2) + … ]]
= w/4*i – [i + (i-1) + (i-2) + … ]
= w/4*i – [i*i – (1+2+3+…+i)]
= w/4*i – i2 + i(1+i)/2 =
= w/4*i – i2 + (i2+i)/2 =
= (w+2)*i/4 – i2/2
Note: the term 1+2+3+…+i is an arithmetic
progression of n=i elements which initial value a1=1,
final value an=i and coefficient d=1 which sums up to
n(a1+an)/2 = i(1+i)/2 = (i2+i)/2
41. Group was
not
exported
from
SlideRocket
TCP: Aggregation
Flows ingressing a congested back-bone router "resonate"
after a few packet drops:
T2
T3
Gr
ou
p
wa
s
no
t
ex
po
rte
d
fro
m
Sli
de
Ro
ck
et
Gr
ou
p
wa
s
no
t
ex
po
rte
d
fro
m
Sli
de
Ro
ck
et
Gr
ou
p
wa
s
no
t
ex
po
rte
d
fro
m
Sli
de
Ro
ck
et
Gr
ou
p
wa
s
no
t
ex
po
rte
d
fro
m
Sli
de
Ro
ck
et
Gr
ou
p
wa
s
no
t
ex
po
rte
d
fro
m
Sli
de
Ro
ck
et
Tk
Gr
ou
p
wa
s
no
t
ex
po
rte
d
fro
m
Sli
de
Ro
ck
et
Gr
ou
p
wa
s
no
t
ex
po
rte
d
fro
m
Sli
de
Ro
ck
et
Resonance takes place once every flow has suffered at least one drop
Resonance period tends to the average of all periods weighted by the
flow size
42. Group was
not
exported
from
SlideRocket
Group was
not
exported
from
SlideRocket
TCP: Aggregation
Resonating flows with similar RTT have their own micro-
resonance ; this phenomenon is known as 'flocking':
T2
T3
Gr
ou
p
wa
s
no
t
ex
po
rte
d
fro
m
Sli
de
Ro
ck
et
Gr
ou
p
wa
s
no
t
ex
po
rte
d
fro
m
Sli
de
Ro
ck
et
Gr
ou
p
wa
s
no
t
ex
po
rte
d
fro
m
Sli
de
Ro
ck
et
Gr
ou
p
wa
s
no
t
ex
po
rte
d
fro
m
Sli
de
Ro
ck
et
Gr
ou
p
wa
s
no
t
ex
po
rte
d
fro
m
Sli
de
Ro
ck
et
Gr
ou
p
wa
s
no
t
ex
po
rte
d
fro
m
Sli
de
Ro
ck
et
Gr
ou
p
wa
s
no
t
ex
po
rte
d
fro
m
Sli
de
Ro
ck
et
There is no theoretical approach to calculating the micro-resonance
thresholds
Flows with short micro-resonance periods steal throughput from flows
with long micro-resonance periods
43. TCP: Application Behavior
Misbehaved applications:
A host is allowed to open as many flows as its resources can afford to the
same target host
IP routers deal with each flow blindly, ignoring the fact that many flows
might start and end in the same host pair
Therefore the bandwidth share taken by an application at a congested
link depends on the number of flows it puts through that link:
BW = k*BWi
Corollary: applications holding many flows between end-hosts are both
unfair to applications holding fewer flows and a potential congestion
cause
44. TCP: Application Behavior
Misbehaved applications:
A host may start transmitting at a high rate (if the receiver
has enough buffer) then drop the connection when stable
state has been reached (connection is trained)
If the host re-connects very quickly and starts trasmitting
at high rate again, on average it shall take more bandwidth
than its peers maintaining trained connections
Corollary: dropping a trained connection is never a good
idea if there's a chance it shall be used soon
45. TCP: Application Behavior
Well-behaved applications:
A sensible application shall open more connections if
needed, but close them if it perceives it is causing
congestion
A sensible application shall try to maintain and re-use
trained connections as much as possible
46. TCP: Application Behavior
De-bunking send&receive window limitations:
From Linux kernel 2.6.7, the TCP receiving window adjusts itself
depending on the free space in the buffer
Remember the /proc/sys/net/ipv4/tcp_write_mem system variable?:
Window starts at the middle value (default 16KB)
Window grows and shrinks as needed depending on number of queued
segments
Window growth is limited by the right-most value (default 1MB), and
never shrinks to less than the left-most value (default 4KB)
Sending window has been self-adjusting from much earlier than 2.6.7
kernel
47. TCP: Research Issues
Research topic #1: improving congestion control by
routers
Though much more harmless than tail drops, single packet
drops drive TCP into congestion avoidance mode
Even when using WFQ, traffic spikes can cause the much
undesirable tail drops
If a router could signal transmitters when congestion is
coming, trasmitters might adjust their transmission rates
without drops and without entering congestion avoidance
mode
This research field focuses on the use of the ECN bit in
the TCP packet header
48. TCP: Research Issues
Research topic #2: how to conciliate the congestion-bound
TCP traffic with the unbound, heavyweight UDP traffic
TCP traffic can be throttled up and down before, during
and after congestion
UDP traffic on the contrary cannot be throttled, and is
potentially causing more congestion than TCP
Focus of this research area is on congestion-controlling
UDP (for instance the just-created RMCAT IETF WG,
check https://datatracker.ietf.org/wg/rmcat/charter/)
49. TCP: Research Issues
Research topic #3: congestion control algorithms for 4G
radio accesses
Which of the existing CC algorithms (Reno, Vegas,
Cubic...) performs best in 4G RANs?
Would new CC algorithms perform better than existing
ones?
How can the more aggressive algorithms co-exist
pacifically with the more conservative ones?
Research report on CC algorithm performance in LTE
networks (link)
50. TCP: References
Macroscopic Behavior of the TCP Congestion Avoidance
Algorithm, http://citeseerx.ist.psu.edu/viewdoc/summary?
doi=10.1.1.25.3452
Modeling TCP Throughput: A Simple Model and its
Empirical Validation, http://?