Lecture 1

Multithreaded Programming in Cilk L ECTURE 1 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Example applications:

Shared-Memory Multiprocessor In particular, over the next decade, chip multiprocessors (CMP’s) will be an increasingly important platform! P P P Network … Memory I/O $ $ $

Cilk Is Simple ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Minicourse Outline ,[object Object],[object Object],[object Object],[object Object]

L ECTURE 1 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Fibonacci Cilk is a faithful extension of C. A Cilk program’s serial elision is always a legal implementation of Cilk semantics. Cilk provides no new data types. int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } } C elision cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } } Cilk code

Basic Cilk Keywords cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } } Identifies a function as a Cilk procedure , capable of being spawned in parallel. The named child Cilk procedure can execute in parallel with the parent caller. Control cannot pass this point until all spawned children have returned.

Dynamic Multithreading cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync ; return (x+y); } } The computation dag unfolds dynamically. Example: fib(4) “ Processor oblivious” 4 3 2 2 1 1 1 0 0

Multithreaded Computation ,[object Object],[object Object],[object Object],spawn edge return edge continue edge initial thread final thread

Cactus Stack Cilk supports C’s rule for pointers: A pointer to stack space can be passed from parent to child, but not from child to parent. (Cilk also supports malloc .) Cilk’s cactus stack supports several views in parallel. B A C E D A A B A C A C D A C E Views of stack C B A D E

LECTURE 1 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Algorithmic Complexity Measures T P = execution time on P processors

Algorithmic Complexity Measures T P = execution time on P processors T 1 = work

Algorithmic Complexity Measures T P = execution time on P processors T 1 = work T 1 = span * * Also called critical-path length or computational depth .

Algorithmic Complexity Measures T P = execution time on P processors T 1 = work ,[object Object],[object Object],[object Object],* Also called critical-path length or computational depth . T 1 = span *

Speedup Definition: T 1 /T P = speedup on P processors. If T 1 /T P =  ( P ) · P , we have linear speedup ; = P , we have perfect linear speedup ; > P , we have superlinear speedup , which is not possible in our model, because of the lower bound T P ¸ T 1 / P .

Parallelism ,[object Object],[object Object],[object Object]

Example: fib(4) Span: T 1 = ? Work: T 1 = ? Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Span: T 1 = 8 3 4 5 6 1 2 7 8 Work: T 1 = 17

Example: fib(4) Parallelism: T 1 / T 1 = 2.125 Span: T 1 = ? Work: T 1 = ? Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Span: T 1 = 8 Work: T 1 = 17 Using many more than 2 processors makes little sense.

Parallelizing Vector Addition void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } C

Parallelizing Vector Addition C C if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { void vadd (real *A, real *B, int n){ vadd (A, B, n/2); vadd (A+n/2, B+n/2, n-n/2); ,[object Object],[object Object],void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } } }

Parallelizing Vector Addition if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { C ,[object Object],[object Object],[object Object],void vadd (real *A, real *B, int n){ cil k spawn vadd (A, B, n/2; vadd (A+n/2, B+n/2, n-n/2; spawn Side benefit: D&C is generally good for caches! sync ; C ilk void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } } }

Vector Addition cil k void vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { spawn vadd (A, B, n/2); spawn vadd (A+n/2, B+n/2, n-n/2); sync ; } }

Vector Addition Analysis ,[object Object],[object Object],[object Object],[object Object], ( n /lg n )  ( n )  (lg n ) BASE

Another Parallelization C void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { vadd1(A+j, B+j, min(BASE, n-j)); } } Cilk cilk void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } cilk void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { spawn vadd1(A+j, B+j, min(BASE, n-j)); } sync; }

Analysis ,[object Object], (1)  ( n ) … …  ( n ) BASE ,[object Object],[object Object],[object Object],PUNY!

Optimal Choice of BASE ,[object Object],[object Object], ( √ n ) … ,[object Object], ( n ) BASE … ,[object Object], (BASE + n /BASE) ,[object Object]

Scheduling ,[object Object],[object Object],[object Object],P P P Network … Memory I/O $ $ $

Greedy Scheduling ,[object Object],Definition: A thread is ready if all its predecessors have executed .

Greedy Scheduling ,[object Object],[object Object],[object Object],[object Object],Definition: A thread is ready if all its predecessors have executed . P = 3

Greedy Scheduling ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Definition: A thread is ready if all its predecessors have executed . P = 3

Greedy-Scheduling Theorem Theorem [Graham ’68 & Brent ’75]. Any greedy scheduler achieves T P  T 1 / P + T  . ,[object Object],[object Object],[object Object],P = 3

Optimality of Greedy ,[object Object],Proof . Let T P * be the execution time produced by the optimal scheduler. Since T P * ¸ max{ T 1 / P , T 1 } (lower bounds), we have T P · T 1 / P + T 1 · 2 ¢ max{ T 1 / P , T 1 } · 2 T P * . ■

Linear Speedup ,[object Object],Proof. Since P ¿ T 1 / T 1 is equivalent to T 1 ¿ T 1 / P , the Greedy Scheduling Theorem gives us T P · T 1 / P + T 1 ¼ T 1 / P . Thus, the speedup is T 1 / T P ¼ P . ■ Definition. The quantity ( T 1 / T 1 )/ P is called the parallel slackness .

Cilk Performance ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Cilk Chess Programs ,[object Object],[object Object],[object Object],[object Object]

 Socrates Normalized Speedup T P = T 1 / P + T  measured speedup 0.01 0.1 1 0.01 0.1 1 T P = T  T P = T 1 / P T 1 /T P T 1 /T  P T 1 /T 

Developing  Socrates ,[object Object],[object Object],[object Object],[object Object]

 Socrates Speedup Paradox T P  T 1 / P + T  T 32 = 2048/32 + 1 = 65 seconds = 40 seconds T  32 = 1024/32 + 8 Original program Proposed program T 32 = 65 seconds T  32 = 40 seconds T 1 = 2048 seconds T  = 1 second T  1 = 1024 seconds T   = 8 seconds T 512 = 2048/512 + 1 = 5 seconds T  512 = 1024/512 + 8 = 10 seconds

Lesson Work and span can predict performance on large machines better than running times on small machines can.

Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Spawn!

Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Spawn! Spawn!

Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Return!

Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Steal! When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P Spawn! When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

Performance of Work-Stealing Theorem : Cilk’s work-stealing scheduler achieves an expected running time of T P  T 1 / P + O ( T 1 ) on P processors. Pseudoproof . A processor is either working or stealing . The total time all processors spend working is T 1 . Each steal has a 1/ P chance of reducing the span by 1 . Thus, the expected cost of all steals is O ( PT 1 ) . Since there are P processors, the expected time is ( T 1 + O ( PT 1 ))/ P = T 1 / P + O ( T 1 ) . ■

Space Bounds Theorem. Let S 1 be the stack space required by a serial execution of a Cilk program. Then, the space required by a P -processor execution is at most S P · PS 1 . Proof (by induction). The work-stealing algorithm maintains the busy-leaves property : every extant procedure frame with no extant descendents has a processor working on it. ■ P = 3 P P P S 1

Linguistic Implications ,[object Object],for (i=1; i<1000000000; i++) { spawn foo(i); } sync; M ORAL Better to steal parents than children!

Key Ideas ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Lecture 1

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Semelhante a Lecture 1

Semelhante a Lecture 1 (20)

Último

Último (20)

Lecture 1