Large-scale distributed systems must be built to anticipate and mitigate a variety of hardware and software failures. In order to build confidence that fault-tolerant systems are correctly implemented, an increasing number of large-scale sites practice Chaos Engineering, running regular failure drills in which faults are deliberately injected in their production system. While fault injection infrastructures are becoming relatively mature, existing approaches either explore the space of potential failures randomly or exploit the “hunches” of domain experts to guide the search—the combinatorial space of failure scenarios is too large to search exhaustively. Random strategies waste resources testing “uninteresting” faults, while programmer-guided approaches are only as good as the intuition of a programmer and only scale with human effort.
In this talk, I will present intuition, experience and research directions related to lineage-driven fault injection (LDFI), a novel approach to automating failure testing. LDFI utilizes existing tracing or logging infrastructures to work backwards from good outcomes, identifying redundant computations that allow it to aggressively prune the space of faults that must be explored via fault injection. I will describe LDFI’s theoretical roots in the database research notion of provenance, present results from the lab as well as the field, and present a call to arms for the reliability community to improve our understanding of when and how our fault-tolerant systems actually tolerate faults.
62. But how do we know redundancy when we see it?
Hard question: “Could a bad thing ever happen?”
Easier: “Exactly why did a good thing happen?”
“What could have gone wrong?”
63. Lineage-driven fault injection
Why did a good thing happen?
Consider its lineage.
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
64. Lineage-driven fault injection
Why did a good thing happen?
Consider its lineage.
What could have gone wrong?
Faults are cuts in the lineage graph.
Is there a cut that breaks all supports?
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
65. Lineage-driven fault injection
Why did a good thing happen?
Consider its lineage.
What could have gone wrong?
Faults are cuts in the lineage graph.
Is there a cut that breaks all supports?
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
66. What would have to go wrong?
(RepA OR Bcast1)
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast2
Client Client
Bcast1
67. What would have to go wrong?
(RepA OR Bcast1)
AND (RepA OR Bcast2)
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
68. What would have to go wrong?
(RepA OR Bcast1)
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1
Client Client
Bcast2
69. What would have to go wrong?
(RepA OR Bcast1)
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
AND (RepB OR Bcast1)
The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
70. Lineage-driven fault injection The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
Hypothesis: {Bcast1, Bcast2}
71. Lineage-driven fault injection The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
Bcast3
Client
(RepA OR Bcast1)
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
AND (RepB OR Bcast1)
72. Lineage-driven fault injection The write
is stable
Stored on
RepA
Stored on
RepB
Bcast1 Bcast2
Client Client
Bcast3
Client
(RepA OR Bcast1)
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
AND (RepB OR Bcast1)
AND (RepA OR Bcast3)
AND (RepB OR Bcast3)
81. Case study: “Netflix AppBoot”
Services ~100
Search space (executions) 2100
(1,000,000,000,000,000,000,000,000,000,000)
Experiments performed 200
Critical bugs found 11
89. Work with us
Search prioritization
Input generation
Richer lineage collection
90. Search prioritization: minimal hitting sets
(A ∨ B ∨ C ) ∧ (C ∨ D ∨ E ∨ F) ∧ (D ∨ E ∨ F ∨ G) ∧ (H ∨ I)
(A, B, C), (C, D, E, F), (D, E, F, G), (H, I)
⇩
91. Search prioritization: minimal hitting sets
(A ∨ B ∨ C ) ∧ (C ∨ D ∨ E ∨ F) ∧ (D ∨ E ∨ F ∨ G) ∧ (H ∨ I)
(A, B, C), (C, D, E, F), (D, E, F, G), (H, I)
⇩
e.g. (C, E, H) ✔
X X X X X