Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
An Overview of Distributed Debugging
1. An Overview of Distributed Debugging
Anant Narayanan
Advanced Topics in Distributed Systems
November 17, 2009
2. The Problem
Introduction
Offline
Anything that can go wrong will go wrong
liblog
Pervasiveness
TTVM
MaceMC
Online
D3 S
CrystalBall
Conclusion
Debugging is frustrating. Distributed debugging even more so!
Nov. 17, ATDS, Vrije Universiteit — Anant
Distributed Debugging
2 of 21
3. Why is this hard?
Introduction
Offline
liblog
Pervasiveness
TTVM
MaceMC
Online
D3 S
CrystalBall
Errors are rarely reproducible
Non-determinism plays a big role in distributed systems
Remote machines appear to crash more often!
Conclusion
Interactions between several different components
(possibly written in different languages) running on
different computers are extremely intricate
Communication is unreliable and asynchronous
Existing debuggers are simply inadequate
Nov. 17, ATDS, Vrije Universiteit — Anant
Distributed Debugging
3 of 21
6. Logging
Introduction
Offline
liblog
Pervasiveness
TTVM
MaceMC
Online
D3 S
CrystalBall
Conclusion
Example
printf("The value of x at node %d:
%d", nr, x);
The most primitive form of debugging, we all do it!
However, extremely difficult to capture all state, and thus
can be used only for small bugs
Won’t it be a good idea to automatically capture and
store all state information so we can analyze and possibly
replay it at a later time?
Nov. 17, ATDS, Vrije Universiteit — Anant
Distributed Debugging
6 of 21
7. Yes, it would!
Introduction
Offline
liblog
Pervasiveness
TTVM
MaceMC
Online
D3 S
CrystalBall
Conclusion
application
application
other libs
other libs
liblog
libc
logger
libc
liblog
libc
GNU/Linux
x86 Hardware
Figure 1: Logging: liblog intercepts calls to libc and
sends results to logger process. The latter asynchronously
compresses and writes the logs to local storage.
2.2
Message Tagging and
The second defining aspect of our
proach to replaying network com
contents of all incoming message
process can be replayed independ
This flexibility comes at the
space (cf. Section 5) but is w
projects have tried the alternative,
and regenerating message conte
cannot do so because we operate i
with non-logging processes. Eve
tion logs may be unavailable for
ing disk or network failure.
Intercepts all calls to libc using LD PRELOAD So far we satisfy one requireme
to coordinate these individual re
vide another, Consistent Group R
2.1 Shared logging with deterministic and Lamport clocks
Provides continuous Library Implementation
we embed 8-byte
ing messages during execution an
The core of our debugging tool is a shared library (the
consistent groupliblog), which intercepts calls to libc (e.g., clocks to schedule replay. The c
eponym replay in a mixed environment
select, gettimeofday) and logs their results. Our
ensures that the timestamps in eac
start-up scripts use provide central replay in a familiar
Integrates with gdb to the LD PRELOAD linker variable to “happens-before” relationship. Th
venient way to correlate messag
interpose liblog between libc and the application
environment and its other libraries (see Figure 1). liblog runs on ception events, so we can trace co
Linux/x86 computers and supports POSIX C/C++ applications.
Nov. 17, ATDS, Vrije Universiteit —chose to build Distributed Debugging
We Anant
a library-based tool because op-
chine to machine.
To make the virtual clocks mor
7 of 21
8. Challenges
Introduction
Offline
liblog
Pervasiveness
TTVM
MaceMC
Online
D3 S
CrystalBall
Conclusion
Signals and Threads
User-level cooperative scheduler on top of OS scheduler
Unsafe Memory Access
All malloc calls are effectively calloc
Consistent Replay for UDP/TCP
Packets are annotated
Finding Peers in a Mixed Environment
Local ports are tracked
Initialization with other liblog hosts occurs
Is liblog for you?
High disk usage; heterogenous systems and tight spin-locks
disallowed; 16 byte per-message network overhead; and finally,
limited consistency
Nov. 17, ATDS, Vrije Universiteit — Anant
Distributed Debugging
8 of 21
9. A Pervasive Debugger
Introduction
Offline
liblog
Pervasiveness
TTVM
MaceMC
Online
Debuggers are unable to access all the state that we
sometimes need because it is just another program!
Debugging is usually either vertical or horizontal:
D3 S
CrystalBall
Conclusion
Java
client
Java
client
Java
Virtual
Machine
Linux
operating
system
C web
server
FreeBSD
operating
system
virtual machine monitor
(a)
Java
Virtual
Machine
Linux
operating
system
C web
server
FreeBSD
operating
system
virtual machine monitor
(b)
Nov. 17, ATDS, Vrije Universiteit — Anant
Distributed Debugging
Figure 1: (a) Horizontal debugging with
multiple
Although
and cont
point in
the targe
issued ex
debug ta
2.2 E
The st
of its me
of each e
debug ta
software
each targ
threads,
Auxilia
9 of 21
cannot d
10. A Pervasive Debugger
Introduction
Offline
process
liblog
Pervasiveness
TTVM
MaceMC
virtual
machine
Online
threads
process
operating system
D3 S
CrystalBall
process
operating
system
Conclusion
pervasive debugger
Nov.
3.1
Ext
Each deb
its environm
memory ad
ters) of the
amine the
include the
example, a
file that ha
3.1.1
Ob
Since the
target, the
Figure 2: The pervasive debugger enables both horenvironmen
izontal and vertical debugging.
scheme sim
Why are debuggers peers of the application being
vironment.
identify a fi
debugged rather thanthread. Pervasive debugging extends
nipulate a process or being placed in the underlying
length to id
this to include the underlying software and hardware layers;
system?
connections
all interactions between the process and its environment can
the
This be examined. Weallows us to perform both vertical and name a
architecture term this vertical debugging.
If we con
The environment in which an operating system executes
then we can
horizontal debugging just memory and processor state.
encompasses more than
records in t
The operating system can access all of the various hardware
be 10 of 21
read, th
17, ATDS, Vrije Universiteit — Anant
Distributed Debugging
devices attached to the machine. Therefore, in the context
11. Let’s Look at an Application
Introduction
Offline
liblog
Pervasiveness
TTVM
MaceMC
Online
D3 S
CrystalBall
Conclusion
A Virtual Machine Monitor (VMM) is capable of
monitoring and logging a lot more state than is possible by
a userspace library!
By running an application inside a VM, we are able to log
not just CPU instructions, memory, network and disk I/O,
but also interrupts, clock values, signals
We can also log byte-for-byte network, memory and disk
Remember, device drivers can have bugs too!
Time-traveling virtual machines take advantage of all this
by using User Mode Linux (UML) and integrating with gdb
to provide a unified, easy to use debugging environment
Nov. 17, ATDS, Vrije Universiteit — Anant
Distributed Debugging
11 of 21
12. How This Works
Introduction
Offline
liblog
Pervasiveness
TTVM
MaceMC
gdb
guest-user
host process
guest-kernel
host process
TTVM functionality
(checkpointing, logging, replay)
Online
D3 S
CrystalBall
host operating system
Conclusion
Figure 1: System structure: UML runs as two user pro-
card is emu
driver; the
mmap and m
vice interru
GIO signals
output. The
uses these h
ware.
In additionon the host Linux OS,mentioned state parameters,
cesses to all the earlier the guest-kernel host proUsing a
cess and the guest-user checkpoints at regular intervals the is
the system takes systemhost process. TTVM’s ability to
raises
travel forward and back in time is implemented
to to
The host operating system, UML and gdb by modare modified an OS t
ifying the host OS. We extend gdb to make use of this
that one ca
allowtime traveling functionality. gdb communicates with the ging the gu
time-travel back to earlier checkpoints, replaying
execution with host process via a remote serial protocol.
guest-kernel breakpoints
VMM: as th
Performance
Checkpointing everyVMM adds just 4%and well-defined inFinally, a 25s offers a narrow overhead!
terface: the interface of a physical machine. This inter-
Nov. 17, ATDS, Vrije Universiteit — Anant
Distributed Debugging
the hardwa
para-virtual
and it becom
be debugge
may also m
12 of 21
13. Model Checking
Introduction
Offline
liblog
Pervasiveness
TTVM
MaceMC
Online
D3 S
CrystalBall
Conclusion
We’ve seen what tools we can use after a bug has been found,
is there anything we can do before deploying an application?
Model checkers, which basically perform state space
exploration, can be used to gain confidence in a system
MaceMC is one such model checker, tailored for verifying
large distributed applications
Definition
Safety Property
A property that should always be satisfied
Liveness Property
A property that should always be eventually satisfied
Nov. 17, ATDS, Vrije Universiteit — Anant
Distributed Debugging
13 of 21
14. Life, Death and the Critical Transition
Introduction
Offline
liblog
Pervasiveness
TTVM
MaceMC
Online
D3 S
CrystalBall
Conclusion
Each node is a state machine
At each step in the execution, an event handler for a
particular pending event at a node is called
Thus, the entire system is to be represented as a giant
state machine with specific event handlers defined
Of course, liveness and safety properties are required by
MaceMC to start the checks
Definition
Critical transition
A transition from a live state to a dead state, from which a
liveness property can never be satisfied
Nov. 17, ATDS, Vrije Universiteit — Anant
Distributed Debugging
14 of 21
15. 3 step process
Introduction
Offline
liblog
Pervasiveness
TTVM
MaceMC
1
2
3
Bounded depth-first search
Random walks
Isolating critical transitions
operty would have to capture the following: “AlOnline
eachSmessage in the inflight queue or retransD3
mer queue, either the message is in flight (in the
CrystalBall
or in the destination’s receive socket buffer, or
Conclusion
ver’s corresponding IncomingConnection.next
han the message sequence number, or an acgment is in flight from the destination to the
ith a sequence number greater than or equal to
age sequence number, or the same acknowledgn the sender’s receive socket buffer, or a rege is in flight between the sender and receiver
direction), or . . .” Thus, attempting to spec- Figure 1: State Exploration We perform bounded depth-first
n conditions with safety properties quickly be- search (BDFS) from the initial state (or search prefix): most peerwhelming and hopelessly complicated, espe- riphery states are indeterminate, i.e., not live, and thus are either
en contrasted with the simplicity and succinct- dead or transient. We execute random walks from the periphery
he liveness property: “Eventually, for all n in states and flag walks not reaching live states as suspected violatinflightSize() = 0,” i.e., that eventually there ing executions.
no packets in flight.
we recommend the following iterative process for
ubtle protocol errors in complex concurrent envi- tialization typically takes many more transitions (cf. Fig. A developer begins by writing desirable high- ure 2), the vast majority of states reached at the periphery
ness properties. As these liveness properties typ- of the exhaustive search are not live. We call these states
Nov. 17, ATDS, Vrije Universiteit — Anant
Distributed Debugging
Is MaceMC for you?
Requires a concrete and theoretical model of your system.
Existing code must be understood and represented as a state
machine and properties! Too much work?
15 of 21
16. Debugging Deployed Solutions
Introduction
Offline
liblog
Pervasiveness
TTVM
MaceMC
Online
D3 S
CrystalBall
Conclusion
Because real debuggers run on a live, deployed system!
Instead of verifying liveness properties in advance, why not
let the system itself do a state space search for you?
D3 S does exactly that by letting the developer specify
predicates that are automatically verified by the system
on-the-fly.
Key Challenge
Allowing developers to express predicates easily, verify those
predicates in a distributed manner with minimal overhead, and
without disrupting the system!
Nov. 17, ATDS, Vrije Universiteit — Anant
Distributed Debugging
16 of 21
32. 1 /
/
Simple C++Exclusive holder and no Shared holders, or there and state
API for specifying predicates
is no Exclusive holders. Because clients cache locks
Verifier and State reduce traffic between the clients and the lock on different
locally (to exposer processes can be
server), only the clients know the current state of a lock.
machines, allowing shows partitioned execution
Figure 2 for the code that the developer writes
to monitor and check the properties of Boxwood’s disSafety property violations are immediately logged, liveness
tributed lock service. The developer organizes the predicate
how the
properties after checking in severalanstages and expressesdeveloper
stages a timeout acyclic graph; the
are connected in
describes this graph with the script part of the code. In
Nov. 17, ATDS, Vrije Universiteit example there are Distributed Debugging a single
the — Anant
only two stages that form
34. Steering Deployed Solutions
Introduction
Offline
liblog
Pervasiveness
TTVM
MaceMC
Online
D3 S
CrystalBall
Conclusion
So, D3 S can detect property violations but can we do
anything about it?
CrystalBall attempts to give us an ultimate solution by
gazing at the future and steering the application away
from disaster!
Many distributed application block on network I/O, let’s
use those free CPU cycles for some useful work...
Packet transmission is faster in simulation than in reality
Can we stay one-state-step ahead at all times?
Nov. 17, ATDS, Vrije Universiteit — Anant
Distributed Debugging
18 of 21
35. CrystalBall Architecture
Introduction
Offline
;:,%#)*0,%2,$
$+0$3*%$
/3,/40*2+%$
!#$%'('')
!*+%*'',
!*+$,-.,+/,)
0,12/%2*+
52*'%2*+$
'*/')/3,/40*2+%
6,%7*4
liblog
n low-probability events,
Pervasiveness
timately triggered the inTTVM
MaceMC
can take a very long time
Online
ncounter such a scenario,
D3 S
ssible bugs difficult. Our
CrystalBall
ves system debugging by
Conclusion
e that combines some of
nd static analysis.
+,2839*)2+:*
,5,+%):2'%,
=.+%2,
,$$8,$A)
;,52/,
?$%%,)
,$$8,$)
%2,$
consistencies before they
/32+,@
ssible because the model
!#$%'('')+*1,
ket transmission in time
tency, and because it can
me shorter than than the
Figure 4: High-level overview of CrystalBall
aspect of our approach
sibility: adapt the behavDeep join request of n and:toPropertyTCP connection with
online debugging break the violations recorded
13
n the fly and avoid an init. Node n13 eventually succeeds joining conditions
Execution Steering : Avoids erroneousthe random treereported
echnique execution steer(perhaps after some other nodes have joined first). The
ly on a history of past instale information aboutDistributedn9 is removed once n9
n13 in Debugging
Nov. 17, ATDS, Vrije Universiteit — Anant
19 of 21
36. Challenges
Introduction
Offline
liblog
Pervasiveness
TTVM
MaceMC
Online
D3 S
CrystalBall
Conclusion
Specifying state and properties: Uses MaceMC
Consistent snapshots: Only neighbors are involved
Consequence prediction: Refined state-space search
Steering without disruption: Filters rely on the distributed
system handling “dropped” messages
How did it do?
Bugs found in RandTree, Chord, and Bullet’ while in deep
online debugging mode
As for execution steering, Bullet’ ran for 1.4 hours with 121
inconsistent states that were never reached, no false negatives.
When run on Paxos, inconsistencies at runtime were avoided
between 74 and 89% of the time
Nov. 17, ATDS, Vrije Universiteit — Anant
Distributed Debugging
20 of 21
37. Your Takeaways
Introduction
Offline
liblog
Pervasiveness
TTVM
MaceMC
Online
D3 S
CrystalBall
Tools
liblog and TTVM at your disposal for debugging using the
familiar gdb environment after a crash occurs
MaceMC model checking gives you theoretical confidence in
your system before you deploy it
Conclusion
Systems
D3 S detects and logs the reason for property violations based
on your specifications
CrystalBall can take this one step further and prevent your
distributed system from executing towards bad states
Recommendation
Use a combination of these tools and systems to make all your
debugging problems go away!
Nov. 17, ATDS, Vrije Universiteit — Anant
Distributed Debugging
21 of 21
38. Performance: liblog
Offline
liblog
Pervasiveness
TTVM
MaceMC
Online
D3 S
CrystalBall
70000
120
60000
50000
Receive Rate (MB/s)
Sending Rate (Packets/second)
Introduction
40000
30000
20000
10000
0
32
Conclusion
64
128
256
512
1024
Packet Size (bytes)
No liblog
Liblog
512
1024
LAN
600
500
6050
40
40
30
2020
0
Gigabit
file. Each pair of bars represents a diffe
8060
010
20
Figure 6: TCP throughput for wget
Time (usecs)
Receive Rate (MB/s)
256
40
No liblog
120
various datagram sizes. The maximum standard deviation over
all points is 1.3 percent of the mean.
100
Sending Rate (MB/s)
128
60
Liblog
Figure 4: Packet rate reduction: Maximum UDP send rate for
64
80
0
Packet Size (bytes)
No liblog
100
400
300
200
100
Gigabit
1
Nov. 17, ATDS, Vrije Universiteit — Anant
LAN
2
US
3
No liblog 4 Liblog 5
Packet Size (bytes)
Distributed Debugging
Australia
6
0
1000 Bbps
22No liblog
of 21
39. Performance: TTVM
Offline
liblog
Pervasiveness
TTVM
MaceMC
Online
D3 S
CrystalBall
Conclusion
200
kernel build
SPECweb
PostMark
6
4
2
0
0
200
600
800
400
Checkpoint interval (sec)
Time to restore (sec)
1000
15
15
10
5
0
0
200
4
Distance
Figure 6: Time to
6 Experience and
10
ten after a checkpoint. More frequent checkpoints thus
cause the disk block allocation to resemble a pure logging disk, which improved the spatial locality for writes
5
for PostMark.
0
Because checkpointing adds 600 time overhead, it
little
0
200
800
400
1000
is reasonable to Distance to restore point (sec) runs while
perform long debugging
checkpointing relatively often (say, every 25 seconds).
ce overhead of checkpoints. For long
Nov. 17, ATDS, Vrije
ers will cap the maximum space used by Universiteit — Anant
kernel build
SPECweb
PostMark
20
1000
25
Figure 5: Space overhead of checkpoints. For long
kernel build
runs, programmers will cap the maximum space used by
SPECweb
PostMark
20
checkpoints by deleting selected checkpoints.
kernel build
SPECweb
PostMark
600
800
400
Checkpoint interval (sec)
25
8
Time to restore (sec)
Checkpoint space overhead (MB/sec)
Introduction
Distributed Debugging
In this section, we descr
to track down four kern
verse gdb commands si
rience provides anecdot
reverse debugging is a u
does not constitute an u
ing the benefits of rever
23 of 21
40. Performance: D3 S
Introduction
Offline
8.00
7.00
slowdown (%)
liblog
Pervasiveness
TTVM
MaceMC
D3 S
CrystalBall
26
31
4.00
2 threads
3.00
4 threads
2.00
8 threads
16 threads
0.00
50
36
41
46
51
100
56
Peer index
ons of peers (free riders are 46∼56).
200
600
1000
avg
frequency
(a) Slowdown with average packet size 390 bytes and different exposing frequencies.
8.00
7.00
n whether or not we have useful
When a system already has specs (e.g., at the component level),
omplex, well designed systems,
e the predicates can check the inedicates is mostly an easy task for
ey are allowed to use sequential
pshots. When a system doesn’t
on (e.g., in performance debug-
slowdown (%)
Conclusion
1 thread
5.00
1.00
Online
21
6.00
6.00
5.00
1 thread
4.00
2 threads
3.00
4 threads
2.00
8 threads
1.00
16 threads
0.00
8
64
128
512
1024
avg
packet size (bytes)
(b) Slowdown with average frequency 347 /s and different
exposing packet sizes.
Nov. 17, ATDS, Vrije Universiteit — Anant
Distributed Debugging
24 of 21
41. Performance: CrystalBall
Increased Memory Size (kB) Size (kB)
Increased Memory
2500
Consequence Search on RandTree
2000
2500
Consequence Search on RandTree
1500
2000
1000
1500
500
1000
0
500
0
2
4
6
8
Depth (levels)
10
12
0
Figure 10:0The memory consumed 6 consequence10
by
prediction
2
4
8
12
(RandTree, depths 7 to 8) fits Depth L2 CPU cache.
in an (levels)
1
Figure 10: The memory consumed (baseline)
BulletPrime by consequence prediction
BulletPrime (CrystalBall)
(RandTree, depths 7 to 8) fits in an L2 CPU cache.
0.8
Fraction of nodes of nodes
Fraction
tween rounds uniformly at random be0 seconds. As we can see in Figure 9,
xecution steering is successful in avoidIntroduction
ween at runtime 74% at 89% of the
stency rounds uniformlyand random beOffline
seconds. As we can In these cases,
and bug2, respectively.see in Figure 9,
liblog
ecution steering is successful in avoidts Pervasiveness
model checking after node C recontency at runtime 74% and participants.
ves checkpoints from other 89% of the
TTVM
ndmodel respectively. In these cases,
bug2,
he MaceMCchecker for 3.3 seconds, C sucs Onlinethe scenarioafter node C reconts model checking in the second round
that
3S
esDcheckpoints from other participants.
violation of the safety property, and it
CrystalBall
eevent filter. The for 3.3 seconds, C sucmodel checker avoidance by execution
s that the scenario in Propose message
nsConclusionrejects the the second round
when C
violation of the more property, and it
cution steering is safetyeffective for bug2
event former involves resetting B. This
as the filter. The avoidance by execution
s when C rejects the Propose to redismore time for the model checkermessage
ution steering is more effective for or ii)
em by: i) consequence prediction, bug2
s the former involves resetting B. This
iously identified erroneous scenario. Imore time for the model checker the time,
check engages 25% and 7% of to redism by:when model checking did not have
cases i) consequence prediction, or ii)
ously identified erroneous scenario. Imuncover the inconsistency), and prevents
heck engages 25% and by of the time,
cy from occurring later, 7% dropping the
cases C at node B.checking did could not
from when model CrystalBall not have
uncover the inconsistency), and prevents
ation for only 1% and 4% of the runs, reycause for these false negatives was the
from occurring later, by dropping the
rom C set node B. CrystalBall could not
of the at of checkpoints.
tion for only 1% and 4% of the runs, remance Impact of CrystalBall the
cause for these false negatives was
of and set of checkpoints.
, the bandwidth consumption. Be-
1
BulletPrime (baseline)
BulletPrime (CrystalBall)
0.6
0.8
0.4
0.6
0.2
0.4
0
0.2
0
Nov. 17, ATDS, Vrije Universiteit — Anant
50
100
150
200
250
download time(s)
Distributed Debugging
25 of 21