SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
Marc	
  Snir	
  
July	
  2013	
  
Programming Models for HPC
Goals of Lecture
•  Establish	
  a	
  taxonomy	
  of	
  HPC	
  programming	
  models	
  and	
  
systems	
  
•  Introduce	
  the	
  main	
  programming	
  models	
  and	
  systems	
  in	
  
current	
  use	
  
•  Understand	
  there	
  is	
  no	
  magic	
  bullet,	
  but	
  trade-­‐offs	
  
	
  
2	
  
INTRODUCTION
3	
  
Some Taxonomy
•  Programming	
  model:	
  Set	
  of	
  operaFons	
  and	
  constructs	
  used	
  to	
  
express	
  parallelism	
  
–  Message-­‐passing,	
  Shared	
  memory,	
  Bulk-­‐synchronous…	
  
•  Programming	
  system:	
  Library	
  or	
  language	
  that	
  embodies	
  one	
  or	
  
more	
  programming	
  models	
  
–  MPI,	
  OpenMP,	
  Cuda…	
  
–  C++	
  and	
  Java	
  are	
  systems	
  that	
  implement	
  the	
  same	
  
programming	
  model	
  
–  MPI	
  supports	
  mulFple	
  programing	
  models:	
  message-­‐passing,	
  
one-­‐sided	
  communicaFon,	
  bulk	
  synchronous,	
  asynchronous…	
  
•  Model:	
  defines	
  semanFcs	
  of	
  execuFon	
  and	
  defines	
  performance	
  of	
  
execu0on	
  
–  Usually,	
  semanFcs	
  are	
  carefully	
  defined,	
  but	
  not	
  so	
  performance	
  
model	
  
4	
  
Implementation Stack
5	
  
Language	
  &	
  Libraries	
  
Compiler, autotuner
Executable	
  
Run-­‐Fme	
  
Hardware	
  
High-Level
Execution model
(defined by run-time)
Low-Level
Execution model
(defined by HW)
Possibly, more layers
Generic Hardware -- Node
6	
  
…
Physical	
  
Thread	
  
Cache	
  
Physical	
  
Thread	
  
Physical	
  
Thread	
  
Cache	
  
Physical	
  
Thread	
  
Cache	
  
Physical	
  
Thread	
  
Cache	
  
Physical	
  
Thread	
  
Physical	
  
Thread	
  
Cache	
  
Physical	
  
Thread	
  
Cache	
  
Shared	
  Memory	
  
Vector	
  unit	
   Vector	
  unit	
   Vector	
  unit	
   Vector	
  unit	
  
Communication & Synchronization
•  Load/store	
  (from	
  memory	
  to	
  cache	
  or	
  cache	
  to	
  cache)	
  
•  Hardware	
  and	
  so[ware	
  cache	
  prefetch	
  
•  Read-­‐modify-­‐write	
  	
  operaFons	
  
–  Mutual	
  exclusion	
  (locks,	
  atomic	
  updates)	
  –	
  non-­‐ordering,	
  
symmetric	
  synchronizaFon	
  
–  Monitor/wait	
  –	
  ordering,	
  asymmetric	
  synchronizaFon	
  	
  
•  CPUs	
  provide	
  extensive	
  support	
  to	
  the	
  former;	
  HPC	
  so:ware	
  
has	
  strong	
  need	
  for	
  the	
  later.	
  
7	
  
Node Architecture -- Evolution
•  Increasing	
  number	
  of	
  physical	
  threads	
  
•  Increasing	
  number	
  of	
  cache	
  levels	
  
–  Lower	
  level	
  caches	
  may	
  be	
  replaced	
  with	
  scratchpads	
  
(local	
  memory)	
  
•  Shared	
  memory	
  may	
  not	
  be	
  coherent	
  and	
  will	
  be	
  NUMA	
  
–  May	
  be	
  parFFoned	
  
•  Physical	
  threads	
  may	
  not	
  be	
  all	
  iden0cal	
  
–  E.g.,	
  CPU-­‐like	
  vs.	
  GPU-­‐like	
  physical	
  threads	
  
–  May	
  have	
  same	
  ISA	
  and	
  different	
  performance,	
  or	
  
different	
  ISAs	
  
8	
  
Generic Hardware -- System
9	
  
Node	
   …
InterconnecFon	
  Network	
  
Communication &
Synchronization
Primitives:
•  Point-to-point: send/
receive…
•  One-sided (rDMA): put,
get, accumulate…
•  Collective: barrier,
reduce, broadcast…
Node	
  
•  Flat network vs. partially or completely exposed topology
PROGRAMMING MODELS
10	
  
Low-Level Programming Model
•  Fixed	
  number	
  of	
  threads	
  (1/2/4	
  per	
  core)	
  
–  Threads	
  communicate	
  via	
  shared	
  memory	
  
–  Fixed	
  amount	
  of	
  memory	
  per	
  node	
  
•  Fixed	
  number	
  of	
  nodes	
  	
  
–  Nodes	
  communicate	
  via	
  message-­‐passing	
  (point-­‐to-­‐point,	
  
one-­‐sided,	
  collecFve)	
  
•  Performance	
  model:	
  	
  
–  All	
  threads	
  compute	
  at	
  same	
  speed	
  (equal	
  work	
  =	
  equal	
  
Fme)	
  
–  Memory	
  is	
  flat	
  
–  CommunicaFon	
  Fme	
  predicted	
  by	
  simple	
  model	
  (e.g.	
  	
  	
  	
  	
  	
  	
  	
  	
  
a+bm,	
  where	
  m	
  is	
  message	
  size	
  
•  Model	
  can	
  be	
  implemented	
  via	
  MPI	
  +	
  pthreads,	
  or	
  MPI	
  +	
  
restricted	
  use	
  of	
  OpenMP	
  
	
   11	
  
Low-Level Programming Model Pros & Cons
✔ Very	
  hardware	
  specific	
  –	
  can	
  achieve	
  best	
  performance	
  on	
  
current	
  systems	
  
✗  Very	
  low-­‐level:	
  Hard	
  to	
  get	
  correct	
  programs	
  
✗  Very	
  hardware-­‐specific:	
  number	
  of	
  threads,	
  amount	
  of	
  
memory	
  and	
  number	
  of	
  nodes	
  are	
  code	
  parameters	
  
✗  Does	
  not	
  accommodate	
  heterogeneous	
  cores	
  	
  
✗  Performance	
  model	
  may	
  not	
  work	
  in	
  the	
  future,	
  because	
  of	
  
power	
  management	
  and	
  error	
  handling:	
  Equal	
  work	
  ≠	
  equal	
  
compute	
  Fme	
  
12	
  
Higher-level Programming Models
1.  Virtualize	
  resource	
  and	
  have	
  the	
  run-­‐Fme	
  handle	
  the	
  mapping	
  of	
  
virtual	
  resources	
  to	
  physical	
  resources	
  
–  “Virtual	
  model	
  execuFon”	
  has	
  well-­‐defined	
  semanFcs	
  and	
  
performance	
  model	
  
2.  Encourages/mandates	
  use	
  of	
  programming	
  pamerns	
  that	
  reduce	
  
the	
  likelihood	
  of	
  errors	
  and	
  facilitate	
  virtualizaFon	
  
–  Ensures	
  that	
  high-­‐level	
  performance	
  model	
  gives	
  correct	
  
predicFons	
  
✔ Simplifies	
  programming	
  	
  
✔ Provides	
  more	
  portability:	
  program	
  need	
  not	
  exactly	
  match	
  the	
  
physical	
  resources	
  –	
  e.g.,	
  number	
  of	
  threads	
  or	
  nodes	
  
✔ Can	
  hide	
  imperfecFons	
  in	
  the	
  low-­‐level	
  model	
  (e.g.,	
  assumpFon	
  
of	
  flat	
  memory	
  or	
  fixed	
  compute	
  speed)	
  
✗  Can	
  be	
  inefficient	
  
•  Need	
  to	
  understand	
  when	
  high-­‐level	
  performance	
  model	
  is	
  valid	
  
13	
  
Example: Dijskra, Go-to Statement Considered
Harmful -- Restrict to Facilitate Programming
•  If	
  program	
  uses	
  go-­‐to’s	
  arbitrarily	
  then	
  it	
  is	
  hard	
  to	
  
understand	
  relaFon	
  between	
  staFc	
  program	
  and	
  dynamic	
  
execuFon	
  state	
  	
  
•  If	
  program	
  uses	
  only	
  well-­‐nested	
  loops	
  and	
  funcFon	
  calls,	
  the	
  
relaFon	
  is	
  simple:	
  ExecuFon	
  state	
  is	
  defined	
  by	
  	
  
–  Program	
  counter	
  (what	
  statement	
  is	
  currently	
  executed)	
  
–  Call	
  stack	
  (nesFng	
  of	
  funcFon	
  calls)	
  
–  Values	
  of	
  loop	
  indices	
  (which	
  iteraFon	
  is	
  currently	
  
executed)	
  
•  Much	
  easier	
  to	
  get	
  understand	
  seman0cs	
  and	
  performance	
  
•  Not	
  restric0ve:	
  General	
  programs	
  can	
  be	
  transformed	
  into	
  
well-­‐structured	
  programs	
  without	
  significant	
  performance	
  
loss.	
  
14	
  
SHARED-MEMORY
PROGRAMMING
15	
  
Locality in Shared Memory: Virtualization and
Performance Model
•  Caching	
  =	
  memory	
  virtualiza0on:	
  Name	
  (address)	
  of	
  variable	
  
does	
  not	
  determine	
  its	
  locaFon;	
  caching	
  hardware	
  does	
  so	
  
•  Performance	
  model:	
  Memory	
  capacity	
  is	
  DRAM	
  size;	
  memory	
  
latency	
  is	
  L1	
  speed	
  
–  Reasonable	
  approximaFon	
  if	
  cache	
  hit	
  rate	
  is	
  high	
  
–  Good	
  cache	
  hit	
  rate	
  is	
  essen0al	
  to	
  performance,	
  because	
  of	
  
high	
  memory	
  latency	
  
–  >99%	
  of	
  power	
  consumed	
  in	
  future	
  chips	
  is	
  spent	
  moving	
  
data;	
  reduced	
  bandwidth	
  consump0on	
  is	
  essen0al	
  
•  For	
  high	
  cache	
  hit	
  rate,	
  need	
  good	
  locality:	
  
–  Temporal	
  locality,	
  spaFal	
  locality,	
  thread	
  locality	
  
•  Performance	
  model	
  is	
  valid	
  only	
  for	
  programs	
  with	
  good	
  
locality	
  
16	
  
Dynamic Task Scheduling: Virtualization and
Performance Model
•  Is	
  is	
  convenient	
  to	
  express	
  parallel	
  computaFons	
  as	
  a	
  set	
  of	
  
tasks	
  and	
  dependencies	
  between	
  tasks	
  (task	
  graph)	
  
•  E.g.,	
  reducFon:	
  
17	
  
+	
  
+	
   +	
  
+	
   +	
   +	
   +	
  
•  Example	
  has	
  staFc	
  task	
  graph;	
  in	
  general	
  it	
  can	
  be	
  dynamic	
  
(new	
  tasks	
  spawned	
  during	
  execuFon)	
  
Mapping Tasks to Physical Threads
•  Work:	
  Total	
  number	
  of	
  operaFons	
  (sequenFal	
  compute	
  Fme)	
  
•  Depth:	
  CriFcal	
  path	
  length	
  (compute	
  Fme	
  with	
  unbounded	
  number	
  
of	
  threads)	
  
•  Desired	
  performance	
  model:	
  	
  Time	
  =	
  max	
  (Work/P,	
  Depth)	
  
–  Can	
  be	
  achieved,	
  with	
  suitable	
  restricFons	
  on	
  task	
  graph	
  
•  Restric3ons	
  on	
  task	
  granularity	
  (size)	
  –	
  need	
  to	
  be	
  large	
  enough	
  so	
  
as	
  to	
  amorFze	
  cost	
  of	
  scheduling	
  task	
  
–  Cost:	
  overhead	
  of	
  scheduling	
  +	
  overhead	
  of	
  moving	
  data	
  to	
  
place	
  where	
  task	
  is	
  scheduled	
  (1000’s	
  instrucFons	
  in	
  shared	
  
memory)	
  
•  Restric3ons	
  on	
  task	
  graph:	
  
–  StaFc	
  graph	
  
–  “Well-­‐structured”	
  dynamic	
  task	
  graph	
  
18	
  
Well-Structured Program
•  Task	
  can	
  spawn	
  children	
  tasks	
  and	
  wait	
  for	
  their	
  
compleFon	
  
–  Spawn	
  
–  Parallel	
  loop	
  
•  No	
  other	
  dependencies	
  exist,	
  except	
  parent	
  to	
  
child	
  
–  Series-­‐parallel	
  (and/or)	
  execuFon	
  graph	
  
•  Parallel	
  equivalent	
  of	
  go-­‐to	
  less	
  program	
  
•  Can	
  be	
  implemented	
  in	
  suitably	
  restricted	
  
OpenMP	
  (no	
  synchronizaFon	
  between	
  
concurrent	
  tasks,	
  etc.)	
  
–  Scheduling	
  can	
  be	
  done	
  using	
  work-­‐stealing	
  
19	
  
Work Stealing
•  A	
  thread	
  appends	
  a	
  newly	
  created	
  task	
  to	
  local	
  queue	
  
•  An	
  idle	
  thread	
  picks	
  a	
  task	
  from	
  its	
  local	
  queue	
  
•  if	
  local	
  queue	
  is	
  empty,	
  it	
  steals	
  a	
  task	
  from	
  another	
  queue	
  
20	
  
✔ TheoreFcally	
  opFmal	
  and	
  works	
  well	
  in	
  
pracFce	
  
✔ Handles	
  well	
  variance	
  in	
  thread	
  speed	
  
✗  Needs	
  many	
  more	
  tasks	
  than	
  physical	
  
threads	
  (over-­‐decomposi0on)	
  
Can a Shared-Memory Programming Model Work on
Distributed Memory?
•  Naïve	
  mapping:	
  Distributed	
  virtual	
  shared	
  memory	
  (DVSM):	
  
So[ware	
  emulaFon	
  of	
  HW	
  coherence	
  
–  Cache	
  line	
  =	
  page	
  
–  Cache	
  miss	
  =	
  page	
  miss	
  
•  Abysmal	
  performance	
  –	
  lines	
  too	
  large	
  and	
  coherence	
  
overhead	
  too	
  large	
  
•  Next	
  step:	
  Give	
  up	
  caching	
  &	
  coherence;	
  give	
  up	
  dynamic	
  task	
  
scheduling	
  –	
  ParFFoned	
  Global	
  Address	
  Space	
  (PGAS)	
  
–  Languages:	
  UPC,	
  Co-­‐Array	
  Fortran	
  
–  Libraries:	
  Global	
  Arrays	
  
21	
  
PGAS model
•  Fixed	
  number	
  of	
  (single-­‐threaded)	
  locales	
  
•  Private	
  memory	
  at	
  each	
  locale	
  
•  Shared	
  memory	
  and	
  shared	
  arrays	
  are	
  parFFoned	
  across	
  locales	
  
•  Library	
  implementaFon:	
  remote	
  shared	
  variables	
  accessed	
  via	
  put/
get	
  operaFon	
  
•  Language	
  implementaFon:	
  references	
  to	
  shared	
  memory	
  variables	
  
are	
  disFnct	
  from	
  references	
  to	
  local	
  variables	
  
22	
  
…
Local pointers
Global pointers
PGAS Performance Model
•  Not	
  same	
  as	
  shared	
  memory:	
  Remote	
  accesses	
  are	
  much	
  more	
  
expensive	
  than	
  local	
  accesses	
  
•  Temporal/spaFal/thread	
  locality	
  do	
  not	
  cure	
  the	
  problem:	
  No	
  
hardware	
  caching	
  &	
  no	
  efficient	
  SW	
  caching	
  or	
  aggregaFon	
  of	
  
mulFple	
  remote	
  accesses	
  
•  Performance	
  model	
  is	
  same	
  as	
  distributed	
  memory	
  
☛ 	
  Programming	
  style	
  for	
  performance	
  must	
  be	
  distributed-­‐
memory-­‐like:	
  In	
  order	
  to	
  obtain	
  good	
  performance	
  need	
  to	
  
explicitly	
  get/put	
  remote	
  data	
  in	
  large	
  chunks	
  and	
  copy	
  to	
  local	
  
memory	
  
•  Open	
  problem:	
  Can	
  we	
  define	
  a	
  “shared-­‐memory	
  like”	
  
programming	
  model	
  for	
  distributed	
  memory	
  that	
  provides	
  shared	
  
memory	
  performance	
  model	
  (with	
  reasonable	
  restricFons	
  on	
  
programs)?	
  
	
  
23	
  
How About the Vector Units?
•  Low-­‐level:	
  vector	
  instrucFons	
  
•  High-­‐level:	
  compiler	
  handles	
  
•  High-­‐level	
  performance	
  model:	
  floaFng-­‐point	
  performance	
  is	
  
a	
  (large)	
  fracFon	
  of	
  vector	
  unit	
  performance	
  
–  Requires	
  suitable	
  programming	
  style	
  to	
  write	
  vectorizable	
  
loops	
  
–  Direct	
  addressing,	
  aligned	
  dense	
  operands…	
  
24	
  
DISTRIBUTED-MEMORY
PROGRAMMING
25	
  
Data Partitioning
•  Shared	
  memory:	
  program	
  does	
  not	
  control	
  explicitly	
  data	
  
locaFon;	
  focus	
  is	
  on	
  control	
  parFFoning,	
  and	
  data	
  follows	
  
control.	
  
–  Performs	
  well	
  if	
  code	
  saFsfies	
  restricFons	
  listed	
  for	
  
efficient	
  dynamic	
  tasks	
  scheduling	
  and	
  locality	
  
•  Distributed	
  memory:	
  data	
  locaFon	
  is	
  (mostly)	
  staFc	
  and	
  
controlled	
  by	
  program;	
  focus	
  is	
  on	
  data	
  parFFoning	
  and	
  
control	
  follows	
  data	
  
–  	
  Performs	
  well	
  if	
  (mostly)	
  staFc	
  control	
  parFFon	
  works	
  and	
  
communicaFon	
  across	
  data	
  parFFons	
  is	
  limited	
  
26	
  
Data Partitioning
•  Locale	
  =	
  physical	
  locaFon	
  (usually,	
  node)	
  
–  Global	
  address	
  =	
  <locale	
  id,	
  local	
  address>	
  
•  Global	
  view	
  of	
  data:	
  Aggregates	
  (e.g.	
  arrays)	
  are	
  parFFoned	
  
across	
  locales	
  
–  ParFFon	
  is	
  staFc	
  	
  (or	
  slow	
  changing)	
  
–  Predefined	
  parFFon	
  types	
  and	
  (possibly)	
  user-­‐defined	
  
parFFons	
  
–  Data	
  is	
  accessed	
  with	
  “global	
  name”	
  (e.g.,	
  array	
  indices);	
  
compiler	
  translates	
  to	
  global	
  address.	
  
•  Local	
  view	
  of	
  data:	
  Data	
  is	
  accessed	
  using	
  global	
  address	
  (if	
  
remote)	
  or	
  local	
  address	
  (if	
  local)	
  
–  SyntacFc	
  sugar:	
  Co-­‐arrays,	
  with	
  a	
  locale	
  index	
  
A(7,3)[8]	
  –	
  entry	
  with	
  indices	
  (7,3)	
  on	
  locale	
  8.	
  
27	
  
Address Computation
28	
  
….
0 1 P-1
N
•  Array index: I
•  Block_size = (N+1)/P
•  Locale_id = i/Block_size
•  Local_address = i%Block_size+base
•  Computation is expensive (and even more
so for fancier partitions)
Control Partitioning
	
  
•  Local	
  view	
  of	
  control:	
  each	
  local	
  runs	
  its	
  own	
  (possibly	
  
mulFthreaded)	
  code;	
  execuFons	
  in	
  disFnct	
  locales	
  
synchronize	
  explicitly	
  (e.g.,	
  with	
  barriers).	
  MPI,	
  UPC,	
  etc.	
  
–  Variant:	
  support	
  remote	
  task	
  invocaFon	
  (code	
  on	
  one	
  
local	
  can	
  spawn	
  computaFon	
  on	
  another	
  locale).	
  Charm++,	
  
Chapel	
  
•  Global	
  view	
  of	
  control:	
  one	
  global	
  parallel	
  loop;	
  compiler	
  &	
  
run-­‐Fme	
  maps	
  each	
  iterate	
  to	
  	
  a	
  locale	
  chosen	
  to	
  minimize	
  
communicaFon	
  (e.g.,	
  owner	
  compute	
  rule)	
  
–  User	
  can	
  associate	
  control	
  with	
  data	
  using	
  “on”	
  
statements	
  
29	
  
Pros & Cons
•  Global	
  view	
  of	
  data	
  &	
  control	
  
✔ Program	
  need	
  not	
  change	
  if	
  number	
  of	
  locales	
  or	
  parFFon	
  
is	
  changed	
  
✗  Hard	
  to	
  idenFfy	
  and	
  reason	
  about	
  communicaFon	
  in	
  the	
  
program	
  (implicitly	
  determined	
  by	
  parFFon	
  of	
  data)	
  
•  Local	
  view	
  of	
  data	
  &	
  control	
  
–  Vice-­‐versa	
  
30	
  
Can a Distributed-Memory Programming Model Work
on Shared Memory?
•  Sure:	
  ParFFon	
  memory	
  into	
  disFnct	
  locales,	
  and	
  associate	
  
each	
  locale	
  with	
  a	
  fixed	
  set	
  of	
  physical	
  threads	
  (one	
  or	
  more)	
  
✗  Looses	
  some	
  flexibility	
  in	
  dynamic	
  scheduling	
  
✗  Uses	
  shared	
  memory	
  inefficiently	
  to	
  emulate	
  send-­‐
receive	
  or	
  put-­‐get	
  
✔ Achieves	
  good	
  locality	
  
1.  Reducing	
  communicaFon	
  is	
  essenFal	
  in	
  order	
  to	
  reduce	
  	
  
power	
  
2.  We	
  may	
  not	
  have	
  coherence	
  
☛ 	
  Will	
  need	
  shared-­‐memory	
  programming	
  model	
  with	
  user	
  
control	
  of	
  communicaFon	
  
31	
  
Non-Coherent, Shared-Memory-Like Programming
Model
•  Cache	
  =	
  
1.  Local	
  memory	
  
2.  Address	
  translaFon	
  (virtualizaFon)	
  
3.  Transfer	
  is	
  implicit	
  
4.  coherence	
  
(1)	
  is	
  essenFal	
  
(2)	
  could	
  be	
  done	
  much	
  more	
  efficiently,	
  with	
  a	
  good	
  HW/SW	
  mix	
  
(e.g.,	
  pointer	
  swizzling)	
  
(3)	
  can	
  be	
  done,	
  if	
  maintain	
  many	
  contexts	
  –	
  can	
  be	
  augmented/
replaced	
  with	
  prefetch	
  
(4)	
  Is	
  too	
  expensive	
  and	
  probably	
  superfluous	
  in	
  scienFfic	
  compuFng	
  
•  Could	
  be	
  common	
  shared	
  memory	
  /	
  distributed	
  memory	
  model?	
  
32	
  
Is HPC Converging to One Architecture?
•  Punctuated	
  equilibrium:	
  RelaFvely	
  long	
  periods	
  with	
  one	
  
dominant	
  architecture,	
  interspersed	
  with	
  periods	
  of	
  fast	
  
change	
  
–  Bipolar	
  vector	
  -­‐>	
  killer	
  micros	
  -­‐>	
  mulFcore	
  -­‐>	
  accelerators	
  
–  We	
  are	
  in	
  a	
  period	
  of	
  fast	
  change	
  (hybrid	
  memory,	
  PIM,	
  
accelerators…)	
  
–  It	
  is	
  not	
  clear	
  there	
  will	
  be	
  a	
  convergence	
  by	
  the	
  exascale	
  
era	
  
–  Need	
  for	
  faster	
  architecture	
  evoluFon	
  as	
  Moore’s	
  Law	
  
slows	
  down	
  
–  Different	
  markets	
  pulling	
  in	
  different	
  direcFons	
  
–  Possible	
  divergence	
  between	
  HPC	
  and	
  commodity	
  
•  Can	
  we	
  “hide”	
  differences	
  across	
  different	
  architectures?	
  
33	
  
Portable Performance
•  Can	
  we	
  redefine	
  compilaFon	
  so	
  that:	
  
–  It	
  supports	
  well	
  a	
  human	
  in	
  the	
  loop	
  (manual	
  high-­‐level	
  
decisions	
  vs.	
  automated	
  low-­‐level	
  transformaFons)	
  
–  It	
  integrates	
  auto-­‐tuning	
  and	
  profile-­‐guided	
  compilaFon	
  
–  It	
  preserves	
  high-­‐level	
  code	
  semanFcs	
  
–  It	
  preserves	
  high-­‐level	
  code	
  “performance	
  semanFcs”	
  
August	
  13	
   MCS	
  	
  -­‐-­‐	
  Marc	
  Snir	
   34	
  
Principle
High-level code
Low-level, platform-
specific codes
“Compilation”
Practice
Code A Code B Code C
Manual conversion
“ifdef” spaghetti
Grand Dream: Not One Programming Model
•  More	
  levels	
  and	
  more	
  
paths	
  possible	
  
•  Automated	
  translaFon	
  
available	
  at	
  each	
  level	
  
(performance	
  feedback	
  
driven)	
  
•  User	
  coding	
  supported	
  
at	
  each	
  level	
  
•  SemanFcs	
  and	
  
performance	
  
preserved	
  by	
  
translaFons	
  
35	
  
Prog	
  Model	
  1	
   Prog	
  Model	
  2	
  
Prog	
  Model	
  3	
  Prog	
  Model	
  4	
  
HW	
  1	
   HW	
  2	
  
Domain
Specific
Architecture
Specific
BASTA
QUESTIONS?
36	
  

Mais conteúdo relacionado

Mais procurados

Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2Marcirio Chaves
 
Introduction 1
Introduction 1Introduction 1
Introduction 1Yasir Khan
 
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...netvis
 
What is simultaneous multithreading
What is simultaneous multithreadingWhat is simultaneous multithreading
What is simultaneous multithreadingFraboni Ec
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingJeff Larkin
 
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...Ritu Arora
 
Chap6 procedures &amp; macros
Chap6 procedures &amp; macrosChap6 procedures &amp; macros
Chap6 procedures &amp; macrosHarshitParkar6677
 
Gpu acceleration for simulating massively parallel many core platforms
Gpu acceleration for simulating massively parallel many core platformsGpu acceleration for simulating massively parallel many core platforms
Gpu acceleration for simulating massively parallel many core platformsWMLab,NCU
 
Instruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar ProcessorsInstruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar ProcessorsSyed Zaid Irshad
 
OpenIot & ELC Europe 2016 Berlin - How to develop the ARM 64bit board, Samsun...
OpenIot & ELC Europe 2016 Berlin - How to develop the ARM 64bit board, Samsun...OpenIot & ELC Europe 2016 Berlin - How to develop the ARM 64bit board, Samsun...
OpenIot & ELC Europe 2016 Berlin - How to develop the ARM 64bit board, Samsun...Chanwoo Choi
 
Cache coherence problem and its solutions
Cache coherence problem and its solutionsCache coherence problem and its solutions
Cache coherence problem and its solutionsMajid Saleem
 
Caffe hands on tutorial
Caffe   hands on tutorialCaffe   hands on tutorial
Caffe hands on tutorialMohanad Kaleia
 

Mais procurados (19)

Parallel concepts1
Parallel concepts1Parallel concepts1
Parallel concepts1
 
TensorRT survey
TensorRT surveyTensorRT survey
TensorRT survey
 
Esl basics
Esl basicsEsl basics
Esl basics
 
Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2
 
Introduction 1
Introduction 1Introduction 1
Introduction 1
 
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
 
Cache coherence
Cache coherenceCache coherence
Cache coherence
 
What is simultaneous multithreading
What is simultaneous multithreadingWhat is simultaneous multithreading
What is simultaneous multithreading
 
Apache Hama 0.4
Apache Hama 0.4Apache Hama 0.4
Apache Hama 0.4
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
 
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
 
Lecture4
Lecture4Lecture4
Lecture4
 
Chap6 procedures &amp; macros
Chap6 procedures &amp; macrosChap6 procedures &amp; macros
Chap6 procedures &amp; macros
 
Gpu acceleration for simulating massively parallel many core platforms
Gpu acceleration for simulating massively parallel many core platformsGpu acceleration for simulating massively parallel many core platforms
Gpu acceleration for simulating massively parallel many core platforms
 
Instruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar ProcessorsInstruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar Processors
 
OpenIot & ELC Europe 2016 Berlin - How to develop the ARM 64bit board, Samsun...
OpenIot & ELC Europe 2016 Berlin - How to develop the ARM 64bit board, Samsun...OpenIot & ELC Europe 2016 Berlin - How to develop the ARM 64bit board, Samsun...
OpenIot & ELC Europe 2016 Berlin - How to develop the ARM 64bit board, Samsun...
 
Cache coherence problem and its solutions
Cache coherence problem and its solutionsCache coherence problem and its solutions
Cache coherence problem and its solutions
 
Caffe hands on tutorial
Caffe   hands on tutorialCaffe   hands on tutorial
Caffe hands on tutorial
 
Lecture1
Lecture1Lecture1
Lecture1
 

Semelhante a Esctp snir

Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# WayBishnu Rawal
 
openmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.pptopenmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.pptMALARMANNANA1
 
Parallel & Distributed processing
Parallel & Distributed processingParallel & Distributed processing
Parallel & Distributed processingSyed Zaid Irshad
 
01-MessagePassingFundamentals.ppt
01-MessagePassingFundamentals.ppt01-MessagePassingFundamentals.ppt
01-MessagePassingFundamentals.pptHarshitPal37
 
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Sudarshan Mondal
 
Parallel Computing - Lec 5
Parallel Computing - Lec 5Parallel Computing - Lec 5
Parallel Computing - Lec 5Shah Zaib
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCHimanshu Bedi
 
The Challenges facing Libraries and Imperative Languages from Massively Paral...
The Challenges facing Libraries and Imperative Languages from Massively Paral...The Challenges facing Libraries and Imperative Languages from Massively Paral...
The Challenges facing Libraries and Imperative Languages from Massively Paral...Jason Hearne-McGuiness
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceTim Ellison
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Qualcomm Developer Network
 
Lecture 2
Lecture 2Lecture 2
Lecture 2Mr SMAK
 
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Databricks
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialRoger Rafanell Mas
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Centerinside-BigData.com
 

Semelhante a Esctp snir (20)

Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# Way
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 
openmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.pptopenmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.ppt
 
Parallel & Distributed processing
Parallel & Distributed processingParallel & Distributed processing
Parallel & Distributed processing
 
Back to the CORE
Back to the COREBack to the CORE
Back to the CORE
 
01-MessagePassingFundamentals.ppt
01-MessagePassingFundamentals.ppt01-MessagePassingFundamentals.ppt
01-MessagePassingFundamentals.ppt
 
Neptune @ SoCal
Neptune @ SoCalNeptune @ SoCal
Neptune @ SoCal
 
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)
 
Parallel Computing - Lec 5
Parallel Computing - Lec 5Parallel Computing - Lec 5
Parallel Computing - Lec 5
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARC
 
The Challenges facing Libraries and Imperative Languages from Massively Paral...
The Challenges facing Libraries and Imperative Languages from Massively Paral...The Challenges facing Libraries and Imperative Languages from Massively Paral...
The Challenges facing Libraries and Imperative Languages from Massively Paral...
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
 
Hpc 6 7
Hpc 6 7Hpc 6 7
Hpc 6 7
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorial
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
 

Último

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Último (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Esctp snir

  • 1. Marc  Snir   July  2013   Programming Models for HPC
  • 2. Goals of Lecture •  Establish  a  taxonomy  of  HPC  programming  models  and   systems   •  Introduce  the  main  programming  models  and  systems  in   current  use   •  Understand  there  is  no  magic  bullet,  but  trade-­‐offs     2  
  • 4. Some Taxonomy •  Programming  model:  Set  of  operaFons  and  constructs  used  to   express  parallelism   –  Message-­‐passing,  Shared  memory,  Bulk-­‐synchronous…   •  Programming  system:  Library  or  language  that  embodies  one  or   more  programming  models   –  MPI,  OpenMP,  Cuda…   –  C++  and  Java  are  systems  that  implement  the  same   programming  model   –  MPI  supports  mulFple  programing  models:  message-­‐passing,   one-­‐sided  communicaFon,  bulk  synchronous,  asynchronous…   •  Model:  defines  semanFcs  of  execuFon  and  defines  performance  of   execu0on   –  Usually,  semanFcs  are  carefully  defined,  but  not  so  performance   model   4  
  • 5. Implementation Stack 5   Language  &  Libraries   Compiler, autotuner Executable   Run-­‐Fme   Hardware   High-Level Execution model (defined by run-time) Low-Level Execution model (defined by HW) Possibly, more layers
  • 6. Generic Hardware -- Node 6   … Physical   Thread   Cache   Physical   Thread   Physical   Thread   Cache   Physical   Thread   Cache   Physical   Thread   Cache   Physical   Thread   Physical   Thread   Cache   Physical   Thread   Cache   Shared  Memory   Vector  unit   Vector  unit   Vector  unit   Vector  unit  
  • 7. Communication & Synchronization •  Load/store  (from  memory  to  cache  or  cache  to  cache)   •  Hardware  and  so[ware  cache  prefetch   •  Read-­‐modify-­‐write    operaFons   –  Mutual  exclusion  (locks,  atomic  updates)  –  non-­‐ordering,   symmetric  synchronizaFon   –  Monitor/wait  –  ordering,  asymmetric  synchronizaFon     •  CPUs  provide  extensive  support  to  the  former;  HPC  so:ware   has  strong  need  for  the  later.   7  
  • 8. Node Architecture -- Evolution •  Increasing  number  of  physical  threads   •  Increasing  number  of  cache  levels   –  Lower  level  caches  may  be  replaced  with  scratchpads   (local  memory)   •  Shared  memory  may  not  be  coherent  and  will  be  NUMA   –  May  be  parFFoned   •  Physical  threads  may  not  be  all  iden0cal   –  E.g.,  CPU-­‐like  vs.  GPU-­‐like  physical  threads   –  May  have  same  ISA  and  different  performance,  or   different  ISAs   8  
  • 9. Generic Hardware -- System 9   Node   … InterconnecFon  Network   Communication & Synchronization Primitives: •  Point-to-point: send/ receive… •  One-sided (rDMA): put, get, accumulate… •  Collective: barrier, reduce, broadcast… Node   •  Flat network vs. partially or completely exposed topology
  • 11. Low-Level Programming Model •  Fixed  number  of  threads  (1/2/4  per  core)   –  Threads  communicate  via  shared  memory   –  Fixed  amount  of  memory  per  node   •  Fixed  number  of  nodes     –  Nodes  communicate  via  message-­‐passing  (point-­‐to-­‐point,   one-­‐sided,  collecFve)   •  Performance  model:     –  All  threads  compute  at  same  speed  (equal  work  =  equal   Fme)   –  Memory  is  flat   –  CommunicaFon  Fme  predicted  by  simple  model  (e.g.                   a+bm,  where  m  is  message  size   •  Model  can  be  implemented  via  MPI  +  pthreads,  or  MPI  +   restricted  use  of  OpenMP     11  
  • 12. Low-Level Programming Model Pros & Cons ✔ Very  hardware  specific  –  can  achieve  best  performance  on   current  systems   ✗  Very  low-­‐level:  Hard  to  get  correct  programs   ✗  Very  hardware-­‐specific:  number  of  threads,  amount  of   memory  and  number  of  nodes  are  code  parameters   ✗  Does  not  accommodate  heterogeneous  cores     ✗  Performance  model  may  not  work  in  the  future,  because  of   power  management  and  error  handling:  Equal  work  ≠  equal   compute  Fme   12  
  • 13. Higher-level Programming Models 1.  Virtualize  resource  and  have  the  run-­‐Fme  handle  the  mapping  of   virtual  resources  to  physical  resources   –  “Virtual  model  execuFon”  has  well-­‐defined  semanFcs  and   performance  model   2.  Encourages/mandates  use  of  programming  pamerns  that  reduce   the  likelihood  of  errors  and  facilitate  virtualizaFon   –  Ensures  that  high-­‐level  performance  model  gives  correct   predicFons   ✔ Simplifies  programming     ✔ Provides  more  portability:  program  need  not  exactly  match  the   physical  resources  –  e.g.,  number  of  threads  or  nodes   ✔ Can  hide  imperfecFons  in  the  low-­‐level  model  (e.g.,  assumpFon   of  flat  memory  or  fixed  compute  speed)   ✗  Can  be  inefficient   •  Need  to  understand  when  high-­‐level  performance  model  is  valid   13  
  • 14. Example: Dijskra, Go-to Statement Considered Harmful -- Restrict to Facilitate Programming •  If  program  uses  go-­‐to’s  arbitrarily  then  it  is  hard  to   understand  relaFon  between  staFc  program  and  dynamic   execuFon  state     •  If  program  uses  only  well-­‐nested  loops  and  funcFon  calls,  the   relaFon  is  simple:  ExecuFon  state  is  defined  by     –  Program  counter  (what  statement  is  currently  executed)   –  Call  stack  (nesFng  of  funcFon  calls)   –  Values  of  loop  indices  (which  iteraFon  is  currently   executed)   •  Much  easier  to  get  understand  seman0cs  and  performance   •  Not  restric0ve:  General  programs  can  be  transformed  into   well-­‐structured  programs  without  significant  performance   loss.   14  
  • 16. Locality in Shared Memory: Virtualization and Performance Model •  Caching  =  memory  virtualiza0on:  Name  (address)  of  variable   does  not  determine  its  locaFon;  caching  hardware  does  so   •  Performance  model:  Memory  capacity  is  DRAM  size;  memory   latency  is  L1  speed   –  Reasonable  approximaFon  if  cache  hit  rate  is  high   –  Good  cache  hit  rate  is  essen0al  to  performance,  because  of   high  memory  latency   –  >99%  of  power  consumed  in  future  chips  is  spent  moving   data;  reduced  bandwidth  consump0on  is  essen0al   •  For  high  cache  hit  rate,  need  good  locality:   –  Temporal  locality,  spaFal  locality,  thread  locality   •  Performance  model  is  valid  only  for  programs  with  good   locality   16  
  • 17. Dynamic Task Scheduling: Virtualization and Performance Model •  Is  is  convenient  to  express  parallel  computaFons  as  a  set  of   tasks  and  dependencies  between  tasks  (task  graph)   •  E.g.,  reducFon:   17   +   +   +   +   +   +   +   •  Example  has  staFc  task  graph;  in  general  it  can  be  dynamic   (new  tasks  spawned  during  execuFon)  
  • 18. Mapping Tasks to Physical Threads •  Work:  Total  number  of  operaFons  (sequenFal  compute  Fme)   •  Depth:  CriFcal  path  length  (compute  Fme  with  unbounded  number   of  threads)   •  Desired  performance  model:    Time  =  max  (Work/P,  Depth)   –  Can  be  achieved,  with  suitable  restricFons  on  task  graph   •  Restric3ons  on  task  granularity  (size)  –  need  to  be  large  enough  so   as  to  amorFze  cost  of  scheduling  task   –  Cost:  overhead  of  scheduling  +  overhead  of  moving  data  to   place  where  task  is  scheduled  (1000’s  instrucFons  in  shared   memory)   •  Restric3ons  on  task  graph:   –  StaFc  graph   –  “Well-­‐structured”  dynamic  task  graph   18  
  • 19. Well-Structured Program •  Task  can  spawn  children  tasks  and  wait  for  their   compleFon   –  Spawn   –  Parallel  loop   •  No  other  dependencies  exist,  except  parent  to   child   –  Series-­‐parallel  (and/or)  execuFon  graph   •  Parallel  equivalent  of  go-­‐to  less  program   •  Can  be  implemented  in  suitably  restricted   OpenMP  (no  synchronizaFon  between   concurrent  tasks,  etc.)   –  Scheduling  can  be  done  using  work-­‐stealing   19  
  • 20. Work Stealing •  A  thread  appends  a  newly  created  task  to  local  queue   •  An  idle  thread  picks  a  task  from  its  local  queue   •  if  local  queue  is  empty,  it  steals  a  task  from  another  queue   20   ✔ TheoreFcally  opFmal  and  works  well  in   pracFce   ✔ Handles  well  variance  in  thread  speed   ✗  Needs  many  more  tasks  than  physical   threads  (over-­‐decomposi0on)  
  • 21. Can a Shared-Memory Programming Model Work on Distributed Memory? •  Naïve  mapping:  Distributed  virtual  shared  memory  (DVSM):   So[ware  emulaFon  of  HW  coherence   –  Cache  line  =  page   –  Cache  miss  =  page  miss   •  Abysmal  performance  –  lines  too  large  and  coherence   overhead  too  large   •  Next  step:  Give  up  caching  &  coherence;  give  up  dynamic  task   scheduling  –  ParFFoned  Global  Address  Space  (PGAS)   –  Languages:  UPC,  Co-­‐Array  Fortran   –  Libraries:  Global  Arrays   21  
  • 22. PGAS model •  Fixed  number  of  (single-­‐threaded)  locales   •  Private  memory  at  each  locale   •  Shared  memory  and  shared  arrays  are  parFFoned  across  locales   •  Library  implementaFon:  remote  shared  variables  accessed  via  put/ get  operaFon   •  Language  implementaFon:  references  to  shared  memory  variables   are  disFnct  from  references  to  local  variables   22   … Local pointers Global pointers
  • 23. PGAS Performance Model •  Not  same  as  shared  memory:  Remote  accesses  are  much  more   expensive  than  local  accesses   •  Temporal/spaFal/thread  locality  do  not  cure  the  problem:  No   hardware  caching  &  no  efficient  SW  caching  or  aggregaFon  of   mulFple  remote  accesses   •  Performance  model  is  same  as  distributed  memory   ☛   Programming  style  for  performance  must  be  distributed-­‐ memory-­‐like:  In  order  to  obtain  good  performance  need  to   explicitly  get/put  remote  data  in  large  chunks  and  copy  to  local   memory   •  Open  problem:  Can  we  define  a  “shared-­‐memory  like”   programming  model  for  distributed  memory  that  provides  shared   memory  performance  model  (with  reasonable  restricFons  on   programs)?     23  
  • 24. How About the Vector Units? •  Low-­‐level:  vector  instrucFons   •  High-­‐level:  compiler  handles   •  High-­‐level  performance  model:  floaFng-­‐point  performance  is   a  (large)  fracFon  of  vector  unit  performance   –  Requires  suitable  programming  style  to  write  vectorizable   loops   –  Direct  addressing,  aligned  dense  operands…   24  
  • 26. Data Partitioning •  Shared  memory:  program  does  not  control  explicitly  data   locaFon;  focus  is  on  control  parFFoning,  and  data  follows   control.   –  Performs  well  if  code  saFsfies  restricFons  listed  for   efficient  dynamic  tasks  scheduling  and  locality   •  Distributed  memory:  data  locaFon  is  (mostly)  staFc  and   controlled  by  program;  focus  is  on  data  parFFoning  and   control  follows  data   –   Performs  well  if  (mostly)  staFc  control  parFFon  works  and   communicaFon  across  data  parFFons  is  limited   26  
  • 27. Data Partitioning •  Locale  =  physical  locaFon  (usually,  node)   –  Global  address  =  <locale  id,  local  address>   •  Global  view  of  data:  Aggregates  (e.g.  arrays)  are  parFFoned   across  locales   –  ParFFon  is  staFc    (or  slow  changing)   –  Predefined  parFFon  types  and  (possibly)  user-­‐defined   parFFons   –  Data  is  accessed  with  “global  name”  (e.g.,  array  indices);   compiler  translates  to  global  address.   •  Local  view  of  data:  Data  is  accessed  using  global  address  (if   remote)  or  local  address  (if  local)   –  SyntacFc  sugar:  Co-­‐arrays,  with  a  locale  index   A(7,3)[8]  –  entry  with  indices  (7,3)  on  locale  8.   27  
  • 28. Address Computation 28   …. 0 1 P-1 N •  Array index: I •  Block_size = (N+1)/P •  Locale_id = i/Block_size •  Local_address = i%Block_size+base •  Computation is expensive (and even more so for fancier partitions)
  • 29. Control Partitioning   •  Local  view  of  control:  each  local  runs  its  own  (possibly   mulFthreaded)  code;  execuFons  in  disFnct  locales   synchronize  explicitly  (e.g.,  with  barriers).  MPI,  UPC,  etc.   –  Variant:  support  remote  task  invocaFon  (code  on  one   local  can  spawn  computaFon  on  another  locale).  Charm++,   Chapel   •  Global  view  of  control:  one  global  parallel  loop;  compiler  &   run-­‐Fme  maps  each  iterate  to    a  locale  chosen  to  minimize   communicaFon  (e.g.,  owner  compute  rule)   –  User  can  associate  control  with  data  using  “on”   statements   29  
  • 30. Pros & Cons •  Global  view  of  data  &  control   ✔ Program  need  not  change  if  number  of  locales  or  parFFon   is  changed   ✗  Hard  to  idenFfy  and  reason  about  communicaFon  in  the   program  (implicitly  determined  by  parFFon  of  data)   •  Local  view  of  data  &  control   –  Vice-­‐versa   30  
  • 31. Can a Distributed-Memory Programming Model Work on Shared Memory? •  Sure:  ParFFon  memory  into  disFnct  locales,  and  associate   each  locale  with  a  fixed  set  of  physical  threads  (one  or  more)   ✗  Looses  some  flexibility  in  dynamic  scheduling   ✗  Uses  shared  memory  inefficiently  to  emulate  send-­‐ receive  or  put-­‐get   ✔ Achieves  good  locality   1.  Reducing  communicaFon  is  essenFal  in  order  to  reduce     power   2.  We  may  not  have  coherence   ☛   Will  need  shared-­‐memory  programming  model  with  user   control  of  communicaFon   31  
  • 32. Non-Coherent, Shared-Memory-Like Programming Model •  Cache  =   1.  Local  memory   2.  Address  translaFon  (virtualizaFon)   3.  Transfer  is  implicit   4.  coherence   (1)  is  essenFal   (2)  could  be  done  much  more  efficiently,  with  a  good  HW/SW  mix   (e.g.,  pointer  swizzling)   (3)  can  be  done,  if  maintain  many  contexts  –  can  be  augmented/ replaced  with  prefetch   (4)  Is  too  expensive  and  probably  superfluous  in  scienFfic  compuFng   •  Could  be  common  shared  memory  /  distributed  memory  model?   32  
  • 33. Is HPC Converging to One Architecture? •  Punctuated  equilibrium:  RelaFvely  long  periods  with  one   dominant  architecture,  interspersed  with  periods  of  fast   change   –  Bipolar  vector  -­‐>  killer  micros  -­‐>  mulFcore  -­‐>  accelerators   –  We  are  in  a  period  of  fast  change  (hybrid  memory,  PIM,   accelerators…)   –  It  is  not  clear  there  will  be  a  convergence  by  the  exascale   era   –  Need  for  faster  architecture  evoluFon  as  Moore’s  Law   slows  down   –  Different  markets  pulling  in  different  direcFons   –  Possible  divergence  between  HPC  and  commodity   •  Can  we  “hide”  differences  across  different  architectures?   33  
  • 34. Portable Performance •  Can  we  redefine  compilaFon  so  that:   –  It  supports  well  a  human  in  the  loop  (manual  high-­‐level   decisions  vs.  automated  low-­‐level  transformaFons)   –  It  integrates  auto-­‐tuning  and  profile-­‐guided  compilaFon   –  It  preserves  high-­‐level  code  semanFcs   –  It  preserves  high-­‐level  code  “performance  semanFcs”   August  13   MCS    -­‐-­‐  Marc  Snir   34   Principle High-level code Low-level, platform- specific codes “Compilation” Practice Code A Code B Code C Manual conversion “ifdef” spaghetti
  • 35. Grand Dream: Not One Programming Model •  More  levels  and  more   paths  possible   •  Automated  translaFon   available  at  each  level   (performance  feedback   driven)   •  User  coding  supported   at  each  level   •  SemanFcs  and   performance   preserved  by   translaFons   35   Prog  Model  1   Prog  Model  2   Prog  Model  3  Prog  Model  4   HW  1   HW  2   Domain Specific Architecture Specific