Anúncio

Data race

12 de May de 2016
Anúncio

Mais conteúdo relacionado

Anúncio
Anúncio

Data race

  1. 1 Thread 1 Thread 2 X++ T=Y Z=2 T=X What is a Data Race?  Two concurrent accesses to a shared location, at least one of them for writing.  Indicative of a bug
  2. 2 Lock(m) Unlock(m) Lock(m) Unlock(m) How Can Data Races be Prevented?  Explicit synchronization between threads:  Locks  Critical Sections  Barriers  Mutexes  Semaphores  Monitors  Events  Etc. Thread 1 Thread 2 X++ T=X
  3. 3 Is This Sufficient?  Yes!  No!  Programmer dependent  Correctness – programmer may forget to synch  Need tools to detect data races  Expensive  Efficiency – to achieve correctness, programmer may overdo.  Need tools to remove excessive synch’s
  4. 4 #define N 100 Type g_stack = new Type[N]; int g_counter = 0; Lock g_lock; void push( Type& obj ){lock(g_lock);...unlock(g_lock);} void pop( Type& obj ) {lock(g_lock);...unlock(g_lock);} void popAll( ) { lock(g_lock); delete[] g_stack; g_stack = new Type[N]; g_counter = 0; unlock(g_lock); } int find( Type& obj, int number ) { lock(g_lock); for (int i = 0; i < number; i++) if (obj == g_stack[i]) break; // Found!!! if (i == number) i = -1; // Not found… Return -1 to caller unlock(g_lock); return i; } int find( Type& obj ) { return find( obj, g_counter ); } Where is Waldo?
  5. 5 #define N 100 Type g_stack = new Type[N]; int g_counter = 0; Lock g_lock; void push( Type& obj ){lock(g_lock);...unlock(g_lock);} void pop( Type& obj ) {lock(g_lock);...unlock(g_lock);} void popAll( ) { lock(g_lock); delete[] g_stack; g_stack = new Type[N]; g_counter = 0; unlock(g_lock); } int find( Type& obj, int number ) { lock(g_lock); for (int i = 0; i < number; i++) if (obj == g_stack[i]) break; // Found!!! if (i == number) i = -1; // Not found… Return -1 to caller unlock(g_lock); return i; } int find( Type& obj ) { return find( obj, g_counter ); } Can You Find the Race? Similar problem was found in java.util.Vector write read
  6. 6 Detecting Data Races?  NP-hard [Netzer&Miller 1990]  Input size = # instructions performed  Even for 3 threads only  Even with no loops/recursion  Execution orders/scheduling (#threads)thread_length  # inputs  Detection-code’s side-effects  Weak memory, instruction reorder, atomicity
  7. 7 Motivation Run-time framework goals  Collect a complete trace of a program’s user-mode execution  Keep the tracing overhead for both space and time low  Re-simulate the traced execution deterministically based on the collected trace with full fidelity down to the instruction level  Full fidelity: user mode only, no tracing of kernel, only user-mode I/O callbacks Advantages  Complete program trace that can be analyzed from multiple perspectives (replay analyzers: debuggers, locality, etc)  Trace can be collected on one machine and re-played on other machines (or perform live analysis by streaming) Challenges: Trace Size and Performance
  8. 8 Original Record-Replay Approaches  InstantReplay ’87  Record order or memory accesses  overhead may affect program behavior  RecPlay ’00  Record only synchronizations  Not deterministic if have data races  Netzer ’93  Record optimal trace  too expensive to keep track of all memory locations  Bacon & Goldstein ’91  Record memory bus transactions with hardware  high logging bandwidth
  9. 9 Motivation Increasing use and development for multi-core processors  MT program behavior is non-deterministic  To effectively debug software, developers must be able to replay executions that exhibit concurrency bugs  Shared memory updates happen in different order
  10. 10 Related Concepts  Runtime interpretation/translation of binary instructions  Requires no static instrumentation, or special symbol information  Handle dynamically generated code, self modifying code  Recording/Logging: ~100-200x  More recent logging  Proposed hardware support (for MT domain)  FDR (Flight Data Recorder)  BugNet (cache bits set on first load)  RTR (Regulated Transitive Reduction)  DeLorean (ISCA 2008- chunks of instructions)  Strata (time layer across all the logs for the running threads)  iDNA (Diagnostic infrastructure using NirvanA- Microsoft)
  11. 11 Deterministic Replay Re-execute the exact same sequence of instructions as recorded in a previous run  Single threaded programs  Record Load Values needed for reproducing behavior of a run (Load Log)  Registers updated by system calls and signal handlers (Reg Log)  Output of special instructions: RDTSC, CPUID (Reg Log)  System call (virtualization- cloning arguments, updates)  Checkpointing (log summary ~10Million)  Multi-threaded programs  Log interleaving among threads (shared memory updates ordering – SMO Log)
  12. 12 PinSEL – System Effect Log (SEL) Logging program load values needed for deterministic replay: – First access from a memory location – Values modified by the system (system effect) and read by program – Machine and time sensitive instructions (cpuid,rdtsc) Load A; (A = 111) Logged Not Logged Syscall modifies location (B -> 0) and (C -> 99) Load C; (C = 99) Load D; (D = 10) Store A; (A  111) Store B; (B  55) Load B; (B = 0) system call Program execution Load C; (C = 9) Load D; (D = 10) •Trace size is ~4-5 bytes per instruction
  13. 13 reads  Observation: Hardware caches eliminate most off-chip reads  Optimize logging:  Logger and replayer simulate identical cache memories  Simple cache (the memory copy structure) to decide which values to log. No tags or valid bits to check. If the values mismatch they are logged.  Average trace size is <1 bit per instruction i = 1; for (j = 0; j < 10; j++) { i = i + j; } k = i; // value read is 46 System_call(); k = i; // value read is 0 (not predicted)  The only read not predicted and logged follows the system call
  14. 14 Example Overhead  PinSEL and PinPLAY  Initial work (2006) with single threaded programs:  SPEC2000 ref runs: 130x slowdown for pinSEL and ~80x for PinPLAY (w/o in-lining)  Working with a subset of SPLASH2 benchmarks: 230x slowdown for PinSEL  Now: Geo-mean SPEC2006  Pin 1.4x  Logger 83.6x  Replayer 1.4x
  15. 15 Example: Microsoft iDNA Trace Writer Performance Applicatio n Simulated Instructions (millions) Trace File Size Trace File Bits / Instructio n Native Execution Time Execution Time While Tracing Execution Overhead Gzip 24,097 245 MB 0.09 11.7s 187s 15.98 Excel 1,781 99 MB 0.47 18.2s 105s 5.76 Power Point 7,392 528 MB 0.60 43.6s 247s 5.66 IE 116 5 MB 0.50 0.499s 6.94s 13.90 Vulcan 2,408 152 MB 0.53 2.74s 46.6s 17.01 Satsolver 9,431 1300 MB 1.16 9.78s 127s 12.98 •Memchecker and valgrind are in 30-40x range on CPU 2006 •iDNA ~11x, (does not log shared-memory dependences explicitly) •Use a sequential number for every lock prefixed memory operation: offline data race analysis
  16. 16 Logging Shared Memory Ordering (Cristiano’s PinSEL/PLAY Overview)  Emulation of Directory Based Cache Coherence  Identifies RAW, WAR, WAW dependences  Indexed by hashing effective address  Each entry represents an address range Store A Load B Program execution hash Dir Entry Dir Entry Dir Entry Dir Entry Directory
  17. 17 Directory Entries  Every DirEntry maintains:  Thread id of the last_writer  A timestamp is the # of memory ref. the thread has executed  Vector of timestamps of last access for each thread to that entry  On Loads: update the timestamp for the thread in the entry  On Stores: update the timestamp and the last_writer fields Programexecution Thread T1 Thread T2 Last writer id:1: Store A 2: Load A DirEntry: [A:D] Last writer id: DirEntry: [E:H] Directory T1: T2: T1: T2: 1: Load F 2: Store A 3: Load F 3: Store F T1 1 1 T2 22 3 T1 3 Vector
  18. 18 Detecting Dependences  RAW dependency between threads T and T’ is established if:  T executes a load that maps to the directory entry A  T’ is the last_writer for the same entry  WAW dependency between T and T’ is established if:  T executes a store that maps to the directory entry A  T’ is the last_writer for the same entry  WAR dependency between T and T’ is established if:  T executes a store that maps to the directory entry A  T’ has accessed the same entry in the past and T is not the last_writer
  19. 19 ExampleProgramexecution Thread T1 Thread T2 Last writer id:1: Store A 2: Load A DirEntry: [A:D] Last writer id: DirEntry: [E:H] T1: T2: T1: T2: 1: Load F 2: Store A 3: Load F 3: Store F T1 1 1 T2 22 3 T1 3 WAW RAW WAR T1 2 T2 2 T1 3 T2 3 T2 2 T1 1 SMO logs: Thread T1 cannot execute memory reference 2 until T2 executes its memory reference 2 Thread T2 cannot execute memory reference 2 until T1 executes its memory reference 1 Last access to the DirEntry Last_writer Last access to the DirEntry
  20. 20 Ordering Memory Accesses (Reducing log size)  Preserving order will reproduce execution  a→b: “a happens-before b”  Ordering is transitive: a→b, b→c means a→c  Two instructions must be ordered if:  they both access the same memory, and  one of them is a write
  21. 21 Constraints: Enforcing Order  To guarantee a→d:  a→d  b→d  a→c  b→c  Suppose we need b→c  b→c is necessary  a→d is redundant P1 a b c d P2 overconstrained
  22. 22  Reproduce exact same conflicts: no more, no less Problem Formulation ld A Thread I Thread J Recording st B st C sub ld B add st C ld B st A st C Thread I Thread J Replay Log ld D st D ld A st B st C sub ld B add st C ld B st A st C ld D st D Conflicts (red) Dependence (black)
  23. 23   Detect conflicts  Write log Log All Conflicts 1 2 3 4 5 6 1 2 3 4 5 6 ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D Log J: 2→3 1→4 3→5 4→6 Log I: 2→3 Log Size: 5*16=80 bytes (10 integers) Dependence Log 16 bytes  Assign IC  (logical Timestamps)  But too many conflicts
  24. 24 Netzer’s Transitive Reduction 1 2 3 4 5 6 1 2 3 4 5 6 ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D TR reduced Log J: 2→3 3→5 4→6 Log I: 2→3 Log Size: 64 bytes (8 integers) TR Reduced Log
  25. 25 RTR (Regulated Transitive Reduction): Stricter Dependences to Aid Vectorization 1 2 3 4 5 6 1 2 3 4 5 6 ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D Log J: 2→3 4→5 Log I: 2→3 Log Size: 48 bytes (6 integers) New Reduced Log stricte r Reduce d 4% Overhead RTR+FDR (simulated on GEMs) .2 MB/core/second logging (Apache)

Notas do Editor

  1. &amp;lt;number&amp;gt; Talking about previous solutions, let’s have a short survey. Most previous record-replay solutions are in software. For example, InstantReplay and Netzer both try to record the software execution in software. But both of them suffered from high performance overhead due to high data bandwidth or high computation overhead. Bacon and Goldstein proposed a hardware recorder solution but have high logging bandwidth and required a central memory bus. Recently, RecPlay took a unique approach to reduce the performance overhead by only record synchronizations. Unfortunately it does not work for programs that contain data races. So, we have to record more than just synchronizations.
  2. Talk about the big picture first, we need to log and recreate dependencies, then we need to reduce the log size. Define dependencies using words Mention assume in-order replay Go slow, define everything Reproduce the same conflict, as we will see a naïve way to do that is to record all conflicts
  3. It is sufficient to log all, but is it necessary?
  4. Mention Bart’s PhD Netzer
Anúncio