O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Fast dynamic analysis, Kostya Serebryany

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 18 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Quem viu também gostou (10)

Anúncio

Semelhante a Fast dynamic analysis, Kostya Serebryany (20)

Mais de yaevents (20)

Anúncio

Mais recentes (20)

Fast dynamic analysis, Kostya Serebryany

  1. 1. Fast dynamic program analysis Race detection Konstantin Serebryany <kcc@google.com> May 20 2011
  2. 2. Agenda ● Dynamic program analysis ● Race detection: theory ● ThreadSanitizer: race detector ● Making ThreadSanitizer faster ● Announcement of a new tool (premiere) ● War stories
  3. 3. Dynamic analysis ● Execute program and monitor interesting events ● Lightweight: no need to monitor memory accesses ○ Leak detection (monitor malloc/free) ○ Deadlock detection (monitor lock/unlock) ● Heavyweight: monitor memory accesses: ○ Memory bugs: ■ Ouf-of-bound, use-after-free, uninitialized reads ○ Races ○ Pointer taintedness analysis ● Many more: profiling, coverage, ...
  4. 4. Data races are scary A data race occurs when two or more threads concurrently access a shared memory location and at least one of the accesses is a write. std::map<int,int> my_map; void Thread1() { void Thread2() { my_map[123] = 1; my_map[345] = 2; } } Our goal: find races in Google code
  5. 5. Happens-before (precedes) partial order on all events Segment: a sequence of READ/WRITE events of one thread Signal(obj) Wait(obj) is a happens-before arc Seg1 h.b. Seg4 -- segments belong to the same thread. Seg1 h.b.Seg5 -- due to Signal/Wait pair with a macthing object. Seg1 h.b. Seg7 -- happens-before is transitive. Seg3 and Seg6 -- no ordering constraint.
  6. 6. LockSet void Thread1() { void Thread2() { mu1.Lock(); mu1.Lock(); mu2.Lock(); mu3.Lock(); *X = 1; *X = 2; mu2.Unlock(); mu3.Unlock(); mu1.Unlock(); ... mu1.Unlock(); ... ● LockSet: a set of locks held during a memory access ○ Thread1: {mu1, mu2} ○ Thread2: {mu1, mu3} ● Common LockSet: intersection of LockSets ○ {mu1}
  7. 7. Dynamic race detector: state machine ● Intercepts program events at run-time ○ Memory access: READ, WRITE ○ Synchronization: LOCK, UNLOCK, SIGNAL, WAIT ● Maintains global state ○ Locks, other synchronization events, threads ○ Memory allocation ● Maintains shadow state for each memory location (byte) ○ Records previous accesses ○ Reports race in appropriate state. E.g. current WRITE ■ ... does not happen-before previous READ ■ ... and previous WRITE have no common Locks.
  8. 8. ThreadSanitizer ● Implemented in late 2008, opensource. ● Initially based on Valgrind binary translation framework. ● SLOW, 20x-50x slowdown. ○ Binary translation overhead is 1.5-3.x ○ Serializes threads (up to 8x on our machines) ○ Slow generalized state machine. ● Slow is bad: ○ Many tests (and bugs) are timing dependent ○ Users are unhappy ○ Machines cost money ● Still very useful -- found thousands races all over Google. ○ Server-side software (e.g. bigtable, GWS) ○ Google Chrome browser
  9. 9. ThreadSanitizer: algorithm
  10. 10. Speedup #1: fast path sate machine ● Observation: 90%-99% of reads/writes are thread-private ● Simplification: special case for thread-private access ○ Very few global objects touched ○ No loops (~20 hand-written if/else statements) ○ 1.5x speedup
  11. 11. Speedup #2: parallel fast path ● Fast path does not touch global state (almost) ○ easy to parallelize (fast path w/o a lock, fallback to serialized slow path) ● Valgrind is not parallel, so used PIN (pintool.org) ○ Good alternative, also works on Windows. ○ But non-opensource is a huge disadvantage. ● Up to #CPUs times speedup (for Chrome: ~2x). ● Problem: how to fight with races (Valgrind can't run PIN)? ○ OUCH!
  12. 12. Speedup #3: faster instrumentation ● Valgrind/PIN add 1.5x-3x slowdown. Why pay that price? ● Use compiler instrumentation ○ + Less run-time overhead ○ - Need to recompile all libraries to catch races there ● Implemented LLVM and GCC plugins. Indeed 1.5x-3x faster. ● Bonus: now can detect races in the parallel race detector ○ TSan-Valgrind over TSan-LLVM ● Result: up to 50M memory events per second
  13. 13. Speedup #4: sampling ● Idea: ignore some accesses in hot region ○ LiteRace, PLDI'09 ● Execution counter for every code region (function or smaller). ● Until the counter is small, don't ignore the region ● Larger counter -- ignore more frequently ● Moderate sampling rate: looses no races, 2x-4x speedup. if (num_to_skip-- <= 0) { HandleThisRegion(); num_to_skip = (counter>>(sampling_rate))+1; counter += num_to_skip }
  14. 14. Results ● 1.5x-4x slowdown ● Can run Chrome interactively ○ Play Farmville or use GMail. ● Finds more bugs per day.
  15. 15. Premiere: AddressSantizer (ASAN) ● Many memory error detectors exist: ○ Slow: Valgrind, DrMemory, Purify, Boundschecker, Insure++, Intel Inspector, mudflap, ... ○ Incomplete: libgmalloc, Electric Fence, Page Heap, ... ● AddressSanitizer (ASAN): fast address sanity checker ○ Use-after-free ○ Out-of-bound (aka buffer overflow) for heap and stack ○ Double-free, etc ○ Linux, Mac, ChromeOS ○ 2x-2.5x slowdown (faster than Debug build!) ○ LLVM instrumentation module + specialized malloc
  16. 16. Generic addressability checking ● malloc()/free() replacement library (most tools): ○ poison redzones around malloc-ed memory ○ poison memory on free() ○ delay reuse of free-ed memory ● Stack poisoning (few tools) ● Instrument all loads and stores ○ if (IsPoisoned(mem)) BANG(); ● The tricky part: how to implement IsPoisoned and BANG
  17. 17. AddressSanitizer algorithm [0x80000000, 0xffffffff] Mem => Shadow is a 8 to 1 mapping Instrumenting 8 byte access to Mem: Shadow = (Mem>>3)+0x20000000; [0x60000000, 0x7fffffff] if (*Shadow) { // 1 byte load Bad = Shadow * 2; [0x40000000, 0x47ffffff] [0x30000000, 0x3fffffff] *Bad = 0; // SEGV! } [0x20000000, 0x23ffffff] [0x00000000, 0x1fffffff]
  18. 18. AddressSanitizer demo

×