SlideShare uma empresa Scribd logo
1 de 53
Baixar para ler offline
Network-­‐aware	
  Data	
  
Management	
  Middleware	
  for	
  
High	
  Throughput	
  Flows	
  
March	
  16,	
  2015	
  
Mehmet	
  Balman	
  
h3p://balman.info	
  
	
  
Performance	
  Engineer	
  at	
  VMware	
  Inc.	
  	
  
Guest	
  ScienFst	
  at	
  Berkeley	
  Lab	
  
1	
  
About	
  me:	
  
Ø 2013:	
  Performance,	
  Central	
  Engineering,	
  VMware,	
  Palo	
  Alto,	
  CA	
  
Ø 2009:	
  ComputaFonal	
  Research	
  Division	
  (CRD)	
  at	
  Lawrence	
  Berkeley	
  
NaFonal	
  Laboratory	
  (LBNL)	
  
Ø 2005:	
  Center	
  for	
  ComputaFon	
  &	
  Technology	
  (CCT),	
  Baton	
  Rouge,	
  LA	
  
v Computer	
  Science,	
  Louisiana	
  State	
  University	
  (2010,2008)	
  
v Bogazici	
  University,	
  Istanbul,	
  Turkey	
  (2006,2000)	
  
	
  
Data	
  Transfer	
  Scheduling	
  with	
  Advance	
  ReservaFon	
  and	
  Provisioning,	
  Ph.D.	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
Failure-­‐Awareness	
  and	
  Dynamic	
  AdaptaFon	
  in	
  Data	
  Scheduling,	
  M.S.	
  
Parallel	
  Tetrahedral	
  Mesh	
  Refinement,	
  M.S.	
  
2	
  
Why	
  Network-­‐aware?	
  
Networking	
  is	
  one	
  of	
  the	
  major	
  components	
  in	
  many	
  of	
  the	
  
soluFons	
  today	
  
•  Distributed	
  data	
  and	
  compute	
  resources	
  
•  CollaboraFon:	
  data	
  to	
  be	
  shared	
  between	
  remote	
  sites	
  
•  Data	
  centers	
  are	
  complex	
  network	
  infrastructures 	
  	
  
ü What	
  further	
  steps	
  are	
  necessary	
  to	
  take	
  full	
  advantage	
  of	
  future	
  
networking	
  infrastructure?	
  
ü How	
  are	
  we	
  going	
  to	
  deal	
  with	
  performance	
  problems?	
  	
  
ü How	
  can	
  we	
  enhance	
  data	
  management	
  services	
  and	
  make	
  them	
  
network-­‐aware?	
  	
  
New	
  collabora>ons	
  between	
  data	
  management	
  and	
  
networking	
  communi>es.	
  
3	
  
Two	
  major	
  players:	
  
• AbstracFon	
  and	
  Programmability	
  
•  Rapid	
  Development,	
  Intelligent	
  services	
  
•  OrchestraFng	
  compute,	
  storage,	
  and	
  network	
  resources	
  together	
  
•  IntegraFon	
  and	
  deployment	
  of	
  complex	
  workflows	
  
•  VirtualizaFon	
  (+containers)	
  	
  
•  Distributed	
  storage	
  (storage	
  wars)	
  
•  Open	
  Source	
  	
  (if	
  you	
  can’t	
  fix	
  it,	
  you	
  don’t	
  own	
  it)	
  
•  Performance	
  Gap:	
  
•  LimitaFon	
  is	
  current	
  system	
  socware	
  and	
  foreseen	
  	
  speed:	
  
•  Hardware	
  is	
  fast,	
  Socware	
  is	
  slow	
  	
  
•  Latency	
  throughput	
  mismatch	
  will	
  lead	
  to	
  new	
  innovaGons	
  
4	
  
Outline	
  
•  Data	
  Streaming	
  in	
  High-­‐bandwidth	
  Networks	
  
•  Climate100:	
  Advance	
  Network	
  IniFaFve	
  and	
  100Gbps	
  Demo	
  
•  MemzNet:	
  Memory-­‐Mapped	
  Network	
  Zero-­‐copy	
  Channels	
  	
  
•  Core	
  Affinity	
  and	
  End	
  System	
  Tuning	
  in	
  High-­‐Throughput	
  
Flows	
  
•  Network	
  Reserva>on	
  and	
  Online	
  Scheduling	
  
•  FlexRes:	
  A	
  Flexible	
  Network	
  ReservaFon	
  Algorithm	
  
•  SchedSim:	
  Online	
  Scheduling	
  with	
  Advance	
  Provisioning	
  	
  
	
  
•  Performance	
  Engineering	
  and	
  Virtualized	
  Solu>ons	
  
•  So,ware	
  Defined	
  Storage	
  
5	
  
100Gbps	
  networking	
  has	
  Finally	
  arrived!	
  
Applica>ons’	
  Perspec>ve	
  
Increasing	
   the	
   bandwidth	
   is	
   not	
   sufficient	
   by	
   itself;	
   we	
   need	
  
careful	
   evaluaFon	
   of	
   high-­‐bandwidth	
   networks	
   from	
   the	
  
applicaFons’	
  perspecFve.	
  	
  
	
  
1Gbps	
  to	
  10Gbps	
  transiFon	
  	
  
(10	
  years	
  ago)	
  
ApplicaFon	
  did	
  not	
  run	
  10	
  Fmes	
  
faster	
  because	
  there	
  was	
  more	
  
bandwidth	
  available	
  
6	
  
ANI	
  
100Gbps	
  
Demo	
  
•  100Gbps	
  demo	
  by	
  ESnet	
  and	
  
Internet2	
  	
  
	
  
•  ApplicaFon	
  design	
  issues	
  and	
  host	
  
tuning	
  strategies	
  to	
  scale	
  to	
  100Gbps	
  
rates	
  
	
  
•  VisualizaFon	
  of	
  remotely	
  located	
  data	
  
(Cosmology)	
  
	
  
•  Data	
  movement	
  of	
  large	
  	
  datasets	
  with	
  
many	
  files	
  (Climate	
  analysis)	
  
	
  
7	
  
Earth	
  System	
  Grid	
  Federation	
  (ESGF)	
  
8	
  
•  Over	
  2,700	
  sites	
  
•  25,000	
  users	
  
	
  
•  IPCC	
  Fich	
  Assessment	
  Report	
  (AR5)	
  2PB	
  	
  
•  IPCC	
  Forth	
  Assessment	
  Report	
  (AR4)	
  35TB	
  
•  Remote	
  	
  Data	
  Analysis	
  
•  Bulk	
  Data	
  Movement	
  
Application’s	
  
Perspective:	
  	
  
Climate	
  Data	
  Analysis	
  
9	
  
 
lots-­‐of-­‐small-­‐*iles	
  problem!	
  
*ile-­‐centric	
  tools?	
  	
  
FTP
RPC
request a file
request a file
send file
send file
request
data
send data
•  Keep	
  the	
  network	
  pipe	
  full	
  
•  We	
  want	
  out-­‐of-­‐order	
  and	
  asynchronous	
  send	
  receive	
  	
  
	
   10	
  
Many	
  Concurrent	
  Streams	
  
(a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface traffic, packages per second (blue) and bytes per second, over a single
NIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbps
pipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents a
different test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).	
  
	
  	
  
11	
  
ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 concurrent jobs - 5min
intervals], TCP buffer size is 50M	

Effects	
  of	
  many	
  concurrent	
  streams	
  
12	
  
Analysis	
  of	
  	
  Core	
  AfFinities	
  
	
  (NUMA	
  Effect)	
  
13	
  Nathan	
  Hanford	
  et	
  al.	
  	
  NDM’13	
  
Sandy	
  Bridge	
  Architecture	
  
Receive	
  process	
  
	
  
14	
  
Analysis	
  of	
  	
  Core	
  AfFinities	
  
	
  (NUMA	
  Effect)	
  
Nathan	
  Hanford	
  et	
  al.	
  
NDM’14	
  
 100Gbps	
  demo	
  environment	
  
RRT:	
  	
  Sea3le	
  –	
  NERSC	
  	
  16ms	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  NERSC	
  –	
  ANL	
  	
  	
  	
  	
  	
  	
  50ms	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  NERSC	
  –	
  ORNL	
  	
  	
  	
  64ms	
  
15	
  
Framework	
  for	
  the	
  Memory-­‐mapped	
  
Network	
  Channel	
  
+	
  SynchronizaFon	
  mechanism	
  for	
  RoCE	
  
-­‐	
  Keep	
  the	
  pipe	
  full	
  for	
  remote	
  analysis	
   16	
  
Moving	
  climate	
  *iles	
  ef*iciently	
  
17	
  
Advantages	
  
•  Decoupling	
  I/O	
  and	
  network	
  operaFons	
  
•  front-­‐end	
  (I/O	
  	
  processing)	
  
•  back-­‐end	
  (networking	
  layer)	
  
	
  
•  Not	
  limited	
  by	
  the	
  characterisFcs	
  of	
  the	
  file	
  sizes	
  
•  On	
  the	
  fly	
  tar	
  approach,	
  	
  bundling	
  and	
  sending	
  	
  many	
  files	
  
together	
  
•  Dynamic	
  data	
  channel	
  management	
  
	
   Can	
   increase/decrease	
   the	
   parallelism	
   level	
   both	
   	
   in	
   the	
   network	
  
communicaFon	
   and	
   I/O	
   read/write	
   operaFons,	
   without	
   closing	
   and	
  
reopening	
   the	
   data	
   channel	
   connecFon	
   (as	
   is	
   done	
   in	
   regular	
   FTP	
  
variants).	
  	
  
MemzNet	
  is	
   	
  is	
  not	
  file-­‐centric.	
  Bookkeeping	
  informaFon	
  is	
  embedded	
  
inside	
  each	
  block.	
  	
  
	
  
18	
  
MemzNet’s	
  Architecture	
  for	
  data	
  
streaming	
  
19	
  
100Gbps	
  Demo	
  
•  CMIP3	
  data	
  (35TB)	
  from	
  the	
  GPFS	
  filesystem	
  at	
  NERSC	
  
•  Block	
  size	
  4MB	
  
•  Each	
  block’s	
  data	
  secFon	
  was	
  aligned	
  according	
  to	
  the	
  
system	
  pagesize.	
  	
  
•  1GB	
  cache	
  both	
  at	
  the	
  client	
  and	
  the	
  server	
  	
  
•  At	
  NERSC,	
  8	
  front-­‐end	
  threads	
  on	
  each	
  host	
  for	
  reading	
  data	
  files	
  
in	
  parallel.	
  
•  	
  At	
  ANL/ORNL,	
  4	
  front-­‐end	
  threads	
  for	
  processing	
  received	
  data	
  
blocks.	
  
•  	
  4	
  parallel	
  TCP	
  streams	
  (four	
  back-­‐end	
  threads)	
  were	
  used	
  for	
  
each	
  host-­‐to-­‐host	
  connecFon.	
  	
  
20	
  
83Gbps	
  	
  
throughput	
  
21	
  
MemzNet’s	
  Performance	
  	
  
TCP	
  buffer	
  size	
  is	
  set	
  to	
  50MB	
  	
  
MemzNetGridFTP
100Gbps demo
ANI Testbed
22	
  
Challenge?	
  
•  High-­‐bandwidth	
  brings	
  new	
  challenges!	
  
•  We	
  need	
  substanFal	
  amount	
  of	
  processing	
  power	
  and	
  involvement	
  of	
  
mulFple	
  cores	
  to	
  fill	
  a	
  40Gbps	
  or	
  100Gbps	
  network	
  	
  
•  Fine-­‐tuning,	
  both	
  in	
  network	
  and	
  applicaFon	
  layers,	
  to	
  take	
  
advantage	
  of	
  the	
  higher	
  network	
  capacity.	
  	
  
•  Incremental	
  improvement	
  in	
  current	
  tools?	
  
•  We	
  cannot	
  expect	
  every	
  applicaFon	
  to	
  tune	
  and	
  improve	
  every	
  Fme	
  we	
  
change	
  the	
  link	
  technology	
  or	
  speed.	
  	
  
	
  
23	
  
MemzNet	
  
•  MemzNet:	
  Memory-­‐mapped	
  Network	
  Channel	
  	
  
•  High-­‐performance	
  data	
  movement	
  
	
  
MemzNet	
  is	
  an	
  iniFal	
  effort	
  to	
  put	
  a	
  new	
  layer	
  
between	
  the	
  applicaFon	
  and	
  the	
  transport	
  layer.	
  
•  Main	
  goal	
  is	
  to	
  define	
  a	
  network	
  channel	
  so	
  applicaFons	
  can	
  
directly	
  use	
  it	
  without	
  the	
  burden	
  of	
  managing/tuning	
  the	
  network	
  
communicaFon.	
  
	
  
24	
  
Tech	
  report:	
  LBNL-­‐6177E	
  
MemzNet	
  =	
  New	
  Execution	
  Model	
  
•  Luigi	
  Rizzo	
  ’s	
  netmap	
  	
  
•  proposes	
  a	
  new	
  API	
  to	
  send/receive	
  data	
  over	
  the	
  
network	
  
• RDMA	
  programming	
  model	
  
•  MemzNet	
  as	
  a	
  memory-­‐management	
  component	
  
• IX:	
  Data	
  Plane	
  OS	
  (Adam	
  Baley	
  et	
  al.	
  @standford	
  –	
  
similar	
  to	
  MemzNet’s	
  model)	
  
•  mTCP	
  (even	
  based	
  /	
  replaces	
  send/receive	
  in	
  user	
  level)	
  
•  Tanenbaum	
  et	
  al.	
  	
  Minimizing	
  context	
  switches:	
  
proposing	
  to	
  use	
  MONITOR/MWAIT	
  for	
  
synchronizaFon	
  
25	
  
Problem	
  Domain:	
  Esnet’s	
  OSCARS	
  
26	
  
ASIA-PACIFIC
(ASGC/Kreonet2/
TWAREN)
ASIA-PACIFIC
(KAREN/KREONET2/
NUS-GP/ODN/
REANNZ/SINET/
TRANSPAC/TWAREN)
AUSTRALIA
(AARnet)
LATIN AMERICA
CLARA/CUDI
CANADA
(CANARIE)
RUSSIA
AND CHINA
(GLORIAD)
US R&E
(DREN/Internet2/NLR)
US R&E
(DREN/Internet2/
NASA)
US R&E
(NASA/NISN/
USDOI)
ASIA-PACIFIC
(BNP/HEPNET)
ASIA-PACIFIC
(ASCC/KAREN/
KREONET2/NUS-GP/
ODN/REANNZ/
SINET/TRANSPAC)
AUSTRALIA
(AARnet)
US R&E
(DREN/Internet2/
NISN/NLR)
US R&E
(Internet2/
NLR)
CERN
US R&E
(DREN/Internet2/
NISN)
CANADA
(CANARIE) LHCONE
CANADA
(CANARIE)
FRANCE
(OpenTransit)
RUSSIA
AND CHINA
(GLORIAD)
CERN
(USLHCNet)
ASIA-PACIFIC
(SINET)
EUROPE
(GÉANT/
NORDUNET)
EUROPE
(GÉANT)
LATIN AMERICA
(AMPATH/CLARA)
LATIN AMERICA
(CLARA/CUDI)
HOUSTON
ALBUQUERQUE
El PASO
SUNNYVALE
BOISE
SEATTLE
KANSAS CITY
NASHVILLE
WASHINGTON DC
NEW YORK
BOSTON
CHICAGO
DENVER
SACRAMENTO
ATLANTA
PNNL
SLAC
AMES PPPL
BNL
ORNL
JLAB
FNAL
ANL
LBNL
•  ConnecFng	
  experimental	
  faciliFes	
  and	
  supercompuFng	
  centers	
  
•  On-­‐Demand	
  Secure	
  Circuits	
  and	
  Advance	
  ReservaFon	
  System	
  	
  
•  Guaranteed	
  between	
  collaboraFng	
  insFtuFons	
  by	
  delivering	
  
network-­‐as-­‐a-­‐service	
  
	
  
•  Co-­‐allocaFon	
  of	
  storage	
  and	
  network	
  resources	
  	
  	
  	
  
(SRM:	
  Storage	
  Resource	
  Manager)	
  
	
  
OSCARS	
  provides	
  yes/no	
  
answers	
  to	
  a	
  reservaFon	
  
request	
  for	
  (bandwidth,	
  
start_Gme,	
  end_Gme)	
  
End-­‐to-­‐end	
  ReservaFon:	
  
	
  Storage+Network	
  	
  
Reservation	
  Request	
  
•  Between	
  edge	
  routers	
  
	
  
Need	
  to	
  ensure	
  availability	
  of	
  the	
  requested	
  bandwidth	
  from	
  source	
  to	
  
desGnaGon	
  for	
  the	
  requested	
  Gme	
  interval	
  
	
  
v  	
  R={	
  nsource,	
  ndesGnaGon,	
  Mbandwidth,	
  tstart,	
  tend}.	
  
v  source/desFnaFon	
  end-­‐points	
  
v  Requested	
  bandwidth	
  
v  start/end	
  Fmes	
  
	
  
Commi3ed	
  reservaFons	
  between	
  tstart	
  and	
  tend	
  are	
  examined	
  
	
  
	
  
The	
  shortest	
  path	
  from	
  source	
  to	
  desFnaFon	
  is	
  calculated	
  based	
  on	
  the	
  
engineering	
  metric	
  on	
  each	
  link,	
  and	
  a	
  bandwidth	
  guaranteed	
  path	
  is	
  set	
  
up	
  to	
  commit	
  and	
  eventually	
  complete	
  the	
  reservaFon	
  request	
  for	
  the	
  
given	
  Fme	
  period	
  
27	
  
Reservation	
  
28	
  
v  Components (Graph):
v node (router), port, link (connecting two ports)
v engineering metric (~latency)
v maximum bandwidth (capacity)
v  Reservation:
v source, destination, path, time
v (time t1, t3) A -> B -> D (900Mbps)
v (time t2, t3) A -> C -> D (400Mbps)
v (time t4, t5) A -> B -> D (800Mpbs)
A	
  
C	
  B	
  
D	
  
800Mbps	
  
900Mbps	
   500Mbps	
  
1000Mbps	
  
300Mbps	
  
ReservaFon	
  1	
  
ReservaFon	
  2	
  
ReservaFon	
  3	
  
t1	
  
t2	
   t3	
  
t4	
   t5	
  
Example	
  
(Fme	
  t1,	
  t2)	
  :	
  
	
  
A	
  to	
  D	
  (600Mbps)	
  NO	
  
	
  
A	
  to	
  D	
  (500Mbps)	
  YES	
  
	
  
	
  
	
  
	
  
A	
  
C	
  B	
  
D	
  
0	
  Mbps	
  /	
  900Mbps	
  (900Mbps)	
  
100	
  Mbps	
  /	
  900Mbps	
  (1000Mbps)	
  
800	
  Mbps	
  /	
  0Mbps	
  (800Mbps)	
  
500	
  Mbps	
  /	
  0Mbps	
  (500Mbps)	
  
300	
  Mbps	
  /	
  	
  0	
  Mbps	
  (300Mbps)	
  
AcFve	
  reservaFon	
  
reservaFon	
  1:	
  (Fme	
  t1,	
  t3)	
  	
  A	
  -­‐>	
  B	
  -­‐>	
  D	
  	
  (900Mbps)	
  
reservaFon	
  2:	
  (Fme	
  t1,	
  t3)	
  	
  A	
  -­‐>	
  C	
  -­‐>	
  D	
  	
  (400Mbps)	
  
reservaFon	
  3:	
  (Fme	
  t4,	
  t5)	
  	
  A	
  -­‐>	
  B	
  -­‐>	
  D	
  	
  (800Mpbs)	
  
available/	
  reserved	
  
(capacity)	
  
	
  
29	
  
Example	
  
A	
  
C	
  B	
  
D	
  
0	
  Mbps	
  /	
  900Mbps	
  (900Mbps)	
  
100	
  Mbps	
  /	
  900Mbps	
  (1000Mbps)	
  
400	
  Mbps	
  /	
  400Mbps	
  (800Mbps)	
  
100	
  Mbps	
  /	
  400Mbps	
  (500Mbps)	
  
300	
  Mbps	
  /	
  	
  0	
  Mbps	
  (300Mbps)	
  
(Fme	
  t1,	
  t3)	
  :	
  
	
  
A	
  to	
  D	
  (500Mbps)	
  NO	
  
	
  
	
  
A	
  to	
  C	
  (500Mbps)	
  No	
  
(not	
  max-­‐FLOW!)	
  
	
  
	
  
	
  
	
  
AcFve	
  reservaFon	
  
reservaFon	
  1:	
  (Fme	
  t1,	
  t3)	
  	
  A	
  -­‐>	
  B	
  -­‐>	
  D	
  	
  (900Mbps)	
  
reservaFon	
  2:	
  (Fme	
  t1,	
  t3)	
  	
  A	
  -­‐>	
  C	
  -­‐>	
  D	
  	
  (400Mbps)	
  
reservaFon	
  3:	
  (Fme	
  t4,	
  t5)	
  	
  A	
  -­‐>	
  B	
  -­‐>	
  D	
  	
  (800Mpbs)	
  
available/	
  reserved	
  
(capacity)	
  
	
  
30	
  
Alternative	
  Approach:	
  Flexible	
  Reservations	
  
•  IF	
  the	
  requested	
  bandwidth	
  can	
  not	
  be	
  guaranteed:	
  
•  Try-­‐and-­‐error	
  unFl	
  get	
  an	
  available	
  reservaFon	
  
•  Client	
  is	
  not	
  given	
  other	
  possible	
  opFons	
  
•  How	
  can	
  we	
  enhance	
  the	
  OSCARS	
  reservaFon	
  system?	
  
•  Be	
  Flexible:	
  
•  Submit	
  constraints	
  and	
  the	
  system	
  suggests	
  possible	
  reservaFon	
  opFons	
  
saFsfying	
  given	
  requirements	
  
31	
  
	
  Rs
'={	
  nsource	
  ,	
  ndesGnaGon,	
  MMAXbandwidth,	
  DdataSize,	
  tEarliestStart,	
  tLatestEnd}	
  
	
  
ReservaFon	
  engine	
  finds	
  out	
  the	
  reservaFon	
  	
  
	
   	
   	
   	
  R={	
  nsource,	
  ndesGnaGon,	
  Mbandwidth,	
  tstart,	
  tend}	
  	
  
for	
  the	
  earliest	
  compleFon	
  or	
  for	
  the	
  shortest	
  duraFon	
  	
  
where	
  Mbandwidth≤	
  MMAXbandwidth	
  and	
  tEarliestStart	
  ≤	
  tstart	
  <	
  tend≤	
  tLatestEnd	
  .	
  
Bandwidth	
  Allocation	
  (time-­‐dependent)	
  
	
  	
  	
  	
  
	
  	
  
Modified	
  Dijstra's	
  
algorithms	
  (max	
  available	
  
bandwidth):	
  
	
  
•  BoPleneck	
  constraint	
  	
  
(not	
  addiFve)	
  
•  QoS	
  constraint	
  is	
  addiFve	
  
in	
  shortest	
  path,	
  etc)	
  
32	
  The	
  maximum	
  bandwidth	
  available	
  for	
  allocaFon	
  from	
  a	
  source	
  node	
  to	
  a	
  desFnaFon	
  
node	
  
t1	
   t2	
   t3	
   t4	
   t5	
   t6	
  
Analogous Example
n  A vehicle travelling from city A to city B
n  There are multiple cities between A and B connected with separate
highways.
n  Each highway has a specific speed limit
–  (maximum bandwidth)
n  But we need to reduce our speed if there is high traffic load on the
road
n  We know the load on each highway for every time period
–  (active reservations)
n  The first question is which path the vehicle should follow in order to
reach city B from city A as early as possible (earliest completion)
•  Or, we can delay our journey and start later if the total travel time
would be reduced. Second question is to find the route along with the
starting time for shortest travel duration (shortest duration)
33	
  
Advance bandwidth reservation: we have to set the speed limit before starting and
cannot change during the journey
	
  
Time steps
n  Time steps between t1 and t13
Fme	
  
t4	
  t2	
   t3	
  t1	
   t5	
   t6	
   t7	
   t8	
   t9	
   t10	
   t11	
   t12	
   t13	
  
ReservaFon	
  1	
  
ReservaFon	
  2	
  
ReservaFon	
  3	
  
Res	
  1	
   Res	
  1,2	
  
Res	
  
2	
  
Res	
  3	
  
t4	
  t1	
   t6	
   t7	
   t9	
   t12	
   t13	
  
Fme	
  
Fme	
  steps	
  
Max (2r+1) time steps,
where r is the number of
reservations
ts1	
   ts2	
   ts3	
   ts4	
  
34	
  
Static Graphs
Res	
  1	
   Res	
  1,2	
   Res	
  2	
  
t4	
  t1	
  
t6	
   t7	
   t9	
  
A	
  
C	
  B	
  
D	
  
0	
  Mbps	
  
100	
  Mbps	
  
800	
  Mbps	
  
500	
  Mbps	
  
300	
  Mbps)	
  
A	
  
C	
  B	
  
D	
  
0	
  Mbps	
  
100	
  Mbps	
  
400	
  Mbps	
  
100	
  Mbps	
  
300	
  Mbps)	
  
A	
  
C	
  B	
  
D	
  
900	
  Mbps	
  
1000	
  Mbps	
  
400	
  Mbps	
  
100	
  Mbps	
  
300	
  Mbps)	
  
A	
  
C	
  B	
  
D	
  
900	
  Mbps	
  
1000	
  Mbps	
  
800	
  Mbps	
  
500	
  Mbps	
  
300	
  Mbps)	
  
t4	
   t6	
  
t7	
  
G(ts3)	
   G(ts4)	
  G(ts2)	
  G(ts1)	
  
35	
  
Time Windows
Res	
  1,2	
   Res	
  2	
  
t1	
  
t6	
   t9	
  
A	
  
C	
  B	
  
D	
  
0	
  Mbps	
  
100	
  Mbps	
  
400	
  Mbps	
  
100	
  Mbps	
  
300	
  Mbps	
  
A	
  
C	
  B	
  
D	
  
900	
  Mbps	
  
1000	
  Mbps	
  
400	
  Mbps	
  
100	
  Mbps	
  
300	
  Mbps	
  
t6	
  
Max (s × (s + 1))/2 time windows, where s is the
number of time steps
G(tw)=G(ts3)	
  x	
  G(ts4)	
  
tw=ts1+ts2	
  
Bo3leneck	
  constraint	
  
G(tw)=G(ts1)	
  x	
  G(ts2)	
  
tw=ts3+ts4	
  
36	
  
Time	
  Window	
  List	
  	
  
	
   	
   	
  (special	
  data	
  structures)	
  
now	
   infinite	
  
Time	
  windows	
  list	
  
new	
  reservaFon:	
  	
  reservaFon	
  1,	
  start	
  t1,	
  end	
  t10	
  
now	
   t1	
   t10	
   infinite	
  
Res	
  1	
  
new	
  reservaFon:	
  	
  reservaFon	
  2,	
  start	
  t12,	
  end	
  t20	
  
now	
   t1	
   t10	
   t12	
  
Res	
  1	
  
t20	
   infinite	
  
Res	
  2	
  
37	
  
Careful	
  socware	
  design	
  makes	
  implementaFon	
  fast	
  and	
  efficient	
  
Performance
max-bandwidth path ~ O(n^2 )
n is the number of nodes in the topology graph
In the worst-case, we may require to search all time
windows, (s × (s + 1))/2, where s is the number of
time steps.
If there are r committed reservations in the search
period, there can be a maximum of 2r + 1 different
time steps in the worst-case.
Overall, the worst-case complexity is bounded
by O(r^2 n^2 )
Note: r is relatively very small compared to the
number of nodes n 38	
  
Example
Reservation 1: (time t1, t6) A -> B -> D
(900Mbps)
Reservation 2: (time t4, t7) A -> C -> D
(400Mbps)
Reservation 3: (time t9, t12) A -> B -> D
(700Mpbs)
A	
  
C	
  B	
  
D	
  
800Mbps	
  
900Mbps	
   500Mbps	
  
1000Mbps	
  
300Mbps	
  
t4	
  t2	
   t3	
  t1	
   t5	
   t6	
   t7	
   t8	
   t9	
   t10	
   t11	
   t12	
   t13	
  
ReservaFon	
  1	
  
ReservaFon	
  2	
  
ReservaFon	
  3	
  
from A to D (earliest completion)
max bandwidth = 200Mbps, volume = 200Mbps x 4 time slots
earliest start = t1, latest finish t13
39	
  
Search Order - Time Windows
Res	
  1	
   Res	
  1,2	
  
Res	
  
2	
  
Res	
  3	
  
t4	
  t1	
   t6	
   t7	
   t9	
   t12	
   t13	
  
Fme	
  
windows	
  
Res	
  1	
  
Res	
  1,	
  2	
  
Res	
  1,	
  2	
  
2	
  
Res	
  1,2	
  	
  
Res	
  1,	
  2	
  
Res	
  2	
  
Res	
  1,	
  2	
  
Res	
  1,	
  2	
  
t1-­‐-­‐t6	
  
t4—t6	
  
t1-­‐-­‐t4	
  
t6—t7	
  
t4—t7	
  
t1—t7	
  
t7—t9	
  
t6—t9	
  
t4—t9	
  
t1—t9	
  
Max	
  bandwidth	
  from	
  A	
  to	
  D	
  
1.  900Mbps	
  	
  (3)	
  
2.  100Mbps	
  	
  (2)	
  
3.  100Mbps	
  	
  (5)	
  
4.  900Mbps	
  	
  (1)	
  
5.  100Mbps	
  	
  (3)	
  
6.  100Mbps	
  	
  (6)	
  
7.  900Mpbs	
  	
  (2)	
  
8.  900Mbps	
  	
  (3)	
  
9.  100Mbps	
  	
  (5)	
  
10.  100Mbps	
  	
  (8)	
  
ReservaFon:	
  (	
  A	
  to	
  D	
  )	
  (100Mbps)	
  start=t1	
  	
  end=t9	
   40	
  
Search Order - Time Windows
Shortest	
  dura>on?	
  	
  
Res	
  1	
   Res	
  1,2	
  
Res	
  
2	
  
Res	
  3	
  
t4	
  t1	
   t6	
   t7	
   t9	
   t12	
   t13	
  
Fme	
  
windows	
  
Res	
  3	
  
Res	
  3	
  t9—t13	
  
t12—t12	
  
t9—t12	
  
Max	
  bandwidth	
  from	
  A	
  to	
  D	
  
1.  200Mbps	
  	
  (3)	
  
2.  900Mbps	
  	
  (1)	
  
3.  200Mbps	
  	
  (4)	
  
	
   	
  ReservaFon:	
  (A	
  to	
  D	
  )	
  (200Mbps)	
  start=t9	
  end=t13	
  
	
   	
  	
  
Ø from	
  A	
  to	
  D,	
  max	
  bandwidth	
  =	
  200Mbps	
  
	
  	
  	
  	
  volume	
  =	
  175Mbps	
  x	
  4	
  Fme	
  slots	
  	
  
	
  	
  	
  	
  earliest	
  start	
  =	
  t1,	
  latest	
  finish	
  t13	
  
	
  
	
   	
  earliest	
  compleFon:	
  	
  (	
  A	
  to	
  D	
  )	
  (100Mbps)	
  start=t1	
  	
  end=t8	
  
	
   	
  shortest	
  duraFon:	
  	
  	
  	
  	
  (	
  A	
  to	
  D	
  )	
  (200Mbps)	
  start=t9	
  	
  end=t12.5	
  
	
  
41	
  
Source	
  >	
  Network	
  >	
  Destination	
  
	
  
A
CB
D
800Mbps	
  
900Mbps	
   500Mbps	
  
1000Mbps	
  
300Mbps	
  
n2	
  
n1	
  
Now	
  we	
  have	
  	
  
mulFple	
  requests	
  
42	
  
With	
  start/end	
  times	
  
•  	
  Each	
  transfer	
  request	
  has	
  start	
  and	
  end	
  Fmes	
  
•  n	
  transfer	
  requests	
  are	
  given	
  (each	
  request	
  has	
  a	
  specific	
  amount	
  of	
  
profit)	
  
•  ObjecFve	
  is	
  to	
  maximize	
  the	
  profit	
  
•  If	
  profit	
  is	
  same	
  for	
  each	
  request,	
  then	
  objecFve	
  is	
  to	
  
maximize	
  the	
  number	
  of	
  jobs	
  in	
  a	
  give	
  Fme	
  period	
  
	
  
•  Unspli3able	
  Flow	
  Problem:	
  
•  An	
  undirected	
  graph,	
  	
  
•  route	
  demand	
  from	
  source(s)	
  to	
  desFnaFons(s)	
  and	
  maximize/minimize	
  
the	
  total	
  profit/cost	
  
	
  
43	
  
	
  The	
  online	
  scheduling	
  method	
  here	
  is	
  inspired	
  from	
  Gale-­‐Shapley	
  algorithm	
  (also	
  
known	
  as	
  stable	
  marriage	
  problem)	
  
Methodology	
  
•  Displace	
  other	
  jobs	
  to	
  open	
  space	
  for	
  the	
  new	
  request	
  
•  	
  we	
  can	
  shic	
  max	
  n	
  jobs?	
  
•  Never	
  accept	
  a	
  job	
  if	
  it	
  causes	
  other	
  commi3ed	
  jobs	
  to	
  break	
  their	
  
criteria	
  
•  Planning	
  ahead	
  (gives	
  opportunity	
  for	
  co-­‐allocaFon)	
  
•  Gives	
  a	
  polynomial	
  approximaFon	
  algorithm	
  
•  The	
  preference	
  converts	
  the	
  UFP	
  problem	
  into	
  Dijkstra	
  path	
  
search	
  
•  UFlizes	
  Fme	
  windows/Fme	
  steps	
  for	
  ranking	
  (be3er	
  than	
  earliest	
  
deadline	
  first)	
  
•  Earliest	
  compleFon	
  +	
  shortest	
  duraFon	
  
•  Minimize	
  concurrency	
  	
  
•  Even	
  random	
  ranking	
  would	
  work	
  (relaxaFon	
  in	
  an	
  NP-­‐hard	
  problem	
  
44	
  
 	
  	
  	
  
45	
  
Recall	
  Time	
  Windows	
  
Res	
  1	
   Res	
  1,2	
  
Res	
  
2	
  
Res	
  3	
  
t4	
  t1	
   t6	
   t7	
   t9	
   t12	
   t13	
  
Fme	
  
windows	
  
Res	
  1	
  
Res	
  1,	
  2	
  
Res	
  1,	
  2	
  
2	
  
Res	
  1,2	
  	
  
Res	
  1,	
  2	
  
Res	
  2	
  
Res	
  1,	
  2	
  
Res	
  1,	
  2	
  
t1-­‐-­‐t6	
  
t4—t6	
  
t1-­‐-­‐t4	
  
t6—t7	
  
t4—t7	
  
t1—t7	
  
t7—t9	
  
t6—t9	
  
t4—t9	
  
t1—t9	
  
Max	
  bandwidth	
  from	
  A	
  to	
  D	
  
1.  900Mbps	
  	
  (3)	
  
2.  100Mbps	
  	
  (2)	
  
3.  100Mbps	
  	
  (5)	
  
4.  900Mbps	
  	
  (1)	
  
5.  100Mbps	
  	
  (3)	
  
6.  100Mbps	
  	
  (6)	
  
7.  900Mpbs	
  	
  (2)	
  
8.  900Mbps	
  	
  	
  (3)	
  
9.  100Mbps	
  	
  (5)	
  
10.  100Mbps	
  	
  (8)	
  
ReservaFon:	
  (	
  A	
  to	
  D	
  )	
  (100Mbps)	
  start=t1	
  	
  end=t9	
   46	
  
Test	
  
	
  
47	
  
In	
  real	
  life,	
  number	
  of	
  
nodes	
  and	
  number	
  of	
  
reservaFon	
  in	
  a	
  given	
  
search	
  interval	
  are	
  
limited	
   See	
  AINA’13	
  paper	
  for	
  results	
  
	
  +	
  comparison	
  with	
  different	
  preference	
  metrics	
  
Autonomic	
  Provisioning	
  System	
  
•  Generate	
  constraints	
  automaFcally	
  (without	
  user	
  input)	
  
•  Volume	
  (elephant	
  flow?)	
  
•  True	
  deadline	
  if	
  applicable	
  
•  End-­‐host	
  resource	
  availability	
  
•  Burst	
  rate	
  (fixed	
  bandwidth,	
  variable	
  bandwidth)	
  
•  Update	
  constraints	
  according	
  to	
  feedback	
  and	
  monitoring	
  
•  Minimize	
  operaFonal	
  cost	
  
•  AlternaFve	
  to	
  manual	
  traffic	
  engineering	
  
	
  
What	
  is	
  the	
  incenFve	
  to	
  make	
  correct	
  reservaFons?	
  
	
  
	
  
48	
  
Data	
  Center	
  1	
  
Data	
  Center	
  2	
  
Data	
  node	
  B	
  
	
  (web	
  access)	
  
Experimental	
  
	
  facility	
  A	
  
*	
  (1)	
  Experimental	
  facility	
  A	
  generates	
  30T	
  of	
  data	
  every	
  day,	
  and	
  it	
  needs	
  to	
  be	
  stored	
  in	
  
data	
  center	
  2,	
  before	
  the	
  next	
  run,	
  since	
  local	
  disk	
  space	
  is	
  limited	
  
*	
  (2)	
  There	
  is	
  a	
  reservaFon	
  made	
  between	
  data	
  center	
  1	
  and	
  2.	
  It	
  is	
  used	
  to	
  replicate	
  
data	
  files,	
  1P	
  total	
  size,	
  when	
  new	
  data	
  is	
  available	
  in	
  data	
  center	
  2	
  
*	
  (3)	
  New	
  results	
  are	
  published	
  at	
  data	
  node	
  B,	
  we	
  expect	
  high	
  traffic	
  to	
  download	
  
new	
  simulaFon	
  files	
  for	
  the	
  next	
  couple	
  of	
  months	
  
Wide-­‐area	
  
SDN	
  
49	
  
Example	
  
•  Experimental	
  facility	
  periodically	
  transfers	
  data	
  (i.e.	
  every	
  night)	
  
•  Data	
  replicaFon	
  happens	
  occasionally,	
  and	
  it	
  will	
  take	
  a	
  week	
  to	
  
move	
  1P	
  of	
  data.	
  If	
  could	
  get	
  delayed	
  couple	
  of	
  hours	
  with	
  no	
  harm	
  
•  Wide-­‐area	
  download	
  traffic	
  will	
  increase	
  gradually,	
  most	
  of	
  the	
  
traffic	
  will	
  be	
  during	
  the	
  day.	
  	
  
•  We	
  can	
  dynamically	
  increase	
  preference	
  for	
  download	
  traffic	
  in	
  the	
  
mornings,	
  give	
  high	
  priority	
  for	
  transferring	
  data	
  from	
  the	
  facility	
  at	
  night,	
  
and	
  use	
  rest	
  of	
  the	
  bandwidth	
  for	
  data	
  replicaFon	
  (and	
  allocate	
  some	
  
bandwidth	
  to	
  confirm	
  that	
  it	
  would	
  finish	
  within	
  a	
  week	
  as	
  usual)	
  
50	
  
Virtual	
  Circuit	
  
ReservaFon	
  Engine	
  
Autonomic	
  provisioning	
  
system	
  
monitoring	
  
Reserva>on	
  Engine	
  
–  Select	
  opFmal	
  path/Fme/bandwidth	
  
–  maximize	
  the	
  number	
  of	
  admi3ed	
  requests	
  
–  	
  increase	
  overall	
  system	
  uFlizaFon	
  and	
  network	
  efficiency	
  
–  Dynamically	
  update	
  the	
  selected	
  rouFng	
  path	
  for	
  network	
  efficiency	
  
–  Modify	
  exisFng	
  reservaFons	
  dynamically	
  to	
  open	
  space/Fme	
  for	
  new	
  
requests	
  
51	
  
Performance	
  Engineer	
  ?	
  
•  Sample	
  projects:	
  
•  VSAN	
  	
  	
  (Virtual	
  SAN)	
  
•  VVOL	
  	
  (Virtual	
  Volumes)	
  
•  Important	
  aspects	
  of	
  performance	
  engineering:	
  
•  Be	
  a	
  part	
  in	
  the	
  iniFal	
  development	
  phase	
  
•  Develop	
  techniques	
  to	
  analyze	
  performance	
  
problems	
  	
  
•  Make	
  sure!	
  performance	
  issues	
  are	
  addresses	
  
correctly	
  
52	
  
THANK	
  YOU	
  
	
  
Any	
  QuesFon/Comment?	
  	
  	
  	
  	
  
Mehmet	
  Balman	
  	
  	
  	
  	
  mbalman@lbl.gov	
  
	
  
h3p://balman.info	
  
	
  
53	
  

Mais conteúdo relacionado

Destaque

Be like Charlie
Be like CharlieBe like Charlie
Be like CharlieeduKazi
 
Work Placement Training Preparation
Work Placement Training PreparationWork Placement Training Preparation
Work Placement Training PreparationNgaire Gardiner
 
Chicago Through Bruce's Eyes: Art and Architecutre
Chicago Through Bruce's Eyes: Art and ArchitecutreChicago Through Bruce's Eyes: Art and Architecutre
Chicago Through Bruce's Eyes: Art and ArchitecutreBruce Fogelson
 
Nice-Ride-Five-Year-Assessment
Nice-Ride-Five-Year-AssessmentNice-Ride-Five-Year-Assessment
Nice-Ride-Five-Year-AssessmentEmily Wade
 
Addressing psychiatric disorder among student-athletes: Challenges facing men...
Addressing psychiatric disorder among student-athletes: Challenges facing men...Addressing psychiatric disorder among student-athletes: Challenges facing men...
Addressing psychiatric disorder among student-athletes: Challenges facing men...Erick Schlimmer
 
Maximize the Output of Free SEO Tools
Maximize the Output of Free SEO ToolsMaximize the Output of Free SEO Tools
Maximize the Output of Free SEO ToolsRebecca Gill
 
Dielectric Properties of ZrO2/ PMMA Nanocomposites
Dielectric Properties of ZrO2/ PMMA NanocompositesDielectric Properties of ZrO2/ PMMA Nanocomposites
Dielectric Properties of ZrO2/ PMMA NanocompositesIOSR Journals
 

Destaque (12)

La Amistad
La AmistadLa Amistad
La Amistad
 
Grad IOSH
Grad IOSHGrad IOSH
Grad IOSH
 
Be like Charlie
Be like CharlieBe like Charlie
Be like Charlie
 
Work Placement Training Preparation
Work Placement Training PreparationWork Placement Training Preparation
Work Placement Training Preparation
 
test
testtest
test
 
Chicago Through Bruce's Eyes: Art and Architecutre
Chicago Through Bruce's Eyes: Art and ArchitecutreChicago Through Bruce's Eyes: Art and Architecutre
Chicago Through Bruce's Eyes: Art and Architecutre
 
Nice-Ride-Five-Year-Assessment
Nice-Ride-Five-Year-AssessmentNice-Ride-Five-Year-Assessment
Nice-Ride-Five-Year-Assessment
 
Addressing psychiatric disorder among student-athletes: Challenges facing men...
Addressing psychiatric disorder among student-athletes: Challenges facing men...Addressing psychiatric disorder among student-athletes: Challenges facing men...
Addressing psychiatric disorder among student-athletes: Challenges facing men...
 
Potafos cana
Potafos canaPotafos cana
Potafos cana
 
Maximize the Output of Free SEO Tools
Maximize the Output of Free SEO ToolsMaximize the Output of Free SEO Tools
Maximize the Output of Free SEO Tools
 
Dielectric Properties of ZrO2/ PMMA Nanocomposites
Dielectric Properties of ZrO2/ PMMA NanocompositesDielectric Properties of ZrO2/ PMMA Nanocomposites
Dielectric Properties of ZrO2/ PMMA Nanocomposites
 
Ruthenfor
RuthenforRuthenfor
Ruthenfor
 

Mais de balmanme

Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...balmanme
 
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...balmanme
 
Hpcwire100gnetworktosupportbigscience 130725203822-phpapp01-1
Hpcwire100gnetworktosupportbigscience 130725203822-phpapp01-1Hpcwire100gnetworktosupportbigscience 130725203822-phpapp01-1
Hpcwire100gnetworktosupportbigscience 130725203822-phpapp01-1balmanme
 
Experiences with High-bandwidth Networks
Experiences with High-bandwidth NetworksExperiences with High-bandwidth Networks
Experiences with High-bandwidth Networksbalmanme
 
A 100 gigabit highway for science: researchers take a 'test drive' on ani tes...
A 100 gigabit highway for science: researchers take a 'test drive' on ani tes...A 100 gigabit highway for science: researchers take a 'test drive' on ani tes...
A 100 gigabit highway for science: researchers take a 'test drive' on ani tes...balmanme
 
Balman stork cw09
Balman stork cw09Balman stork cw09
Balman stork cw09balmanme
 
Available technologies: algorithm for flexible bandwidth reservations for dat...
Available technologies: algorithm for flexible bandwidth reservations for dat...Available technologies: algorithm for flexible bandwidth reservations for dat...
Available technologies: algorithm for flexible bandwidth reservations for dat...balmanme
 
Berkeley lab team develops flexible reservation algorithm for advance network...
Berkeley lab team develops flexible reservation algorithm for advance network...Berkeley lab team develops flexible reservation algorithm for advance network...
Berkeley lab team develops flexible reservation algorithm for advance network...balmanme
 
Dynamic adaptation balman
Dynamic adaptation balmanDynamic adaptation balman
Dynamic adaptation balmanbalmanme
 
Nersc dtn-perf-100121.test_results-nercmeeting-jan21-2010
Nersc dtn-perf-100121.test_results-nercmeeting-jan21-2010Nersc dtn-perf-100121.test_results-nercmeeting-jan21-2010
Nersc dtn-perf-100121.test_results-nercmeeting-jan21-2010balmanme
 
Cybertools stork-2009-cybertools allhandmeeting-poster
Cybertools stork-2009-cybertools allhandmeeting-posterCybertools stork-2009-cybertools allhandmeeting-poster
Cybertools stork-2009-cybertools allhandmeeting-posterbalmanme
 
Presentation summerstudent 2009-aug09-lbl-summer
Presentation summerstudent 2009-aug09-lbl-summerPresentation summerstudent 2009-aug09-lbl-summer
Presentation summerstudent 2009-aug09-lbl-summerbalmanme
 
Lblc sseminar jun09-2009-jun09-lblcsseminar
Lblc sseminar jun09-2009-jun09-lblcsseminarLblc sseminar jun09-2009-jun09-lblcsseminar
Lblc sseminar jun09-2009-jun09-lblcsseminarbalmanme
 
Presentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshopPresentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshopbalmanme
 
Balman dissertation Copyright @ 2010 Mehmet Balman
Balman dissertation Copyright @ 2010 Mehmet BalmanBalman dissertation Copyright @ 2010 Mehmet Balman
Balman dissertation Copyright @ 2010 Mehmet Balmanbalmanme
 
Aug17presentation.v2 2009-aug09-lblc sseminar
Aug17presentation.v2 2009-aug09-lblc sseminarAug17presentation.v2 2009-aug09-lblc sseminar
Aug17presentation.v2 2009-aug09-lblc sseminarbalmanme
 
Pdcs2010 balman-presentation
Pdcs2010 balman-presentationPdcs2010 balman-presentation
Pdcs2010 balman-presentationbalmanme
 
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation NetworksAnalyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation Networksbalmanme
 
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...balmanme
 
Opening ndm2012 sc12
Opening ndm2012 sc12Opening ndm2012 sc12
Opening ndm2012 sc12balmanme
 

Mais de balmanme (20)

Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...
 
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
 
Hpcwire100gnetworktosupportbigscience 130725203822-phpapp01-1
Hpcwire100gnetworktosupportbigscience 130725203822-phpapp01-1Hpcwire100gnetworktosupportbigscience 130725203822-phpapp01-1
Hpcwire100gnetworktosupportbigscience 130725203822-phpapp01-1
 
Experiences with High-bandwidth Networks
Experiences with High-bandwidth NetworksExperiences with High-bandwidth Networks
Experiences with High-bandwidth Networks
 
A 100 gigabit highway for science: researchers take a 'test drive' on ani tes...
A 100 gigabit highway for science: researchers take a 'test drive' on ani tes...A 100 gigabit highway for science: researchers take a 'test drive' on ani tes...
A 100 gigabit highway for science: researchers take a 'test drive' on ani tes...
 
Balman stork cw09
Balman stork cw09Balman stork cw09
Balman stork cw09
 
Available technologies: algorithm for flexible bandwidth reservations for dat...
Available technologies: algorithm for flexible bandwidth reservations for dat...Available technologies: algorithm for flexible bandwidth reservations for dat...
Available technologies: algorithm for flexible bandwidth reservations for dat...
 
Berkeley lab team develops flexible reservation algorithm for advance network...
Berkeley lab team develops flexible reservation algorithm for advance network...Berkeley lab team develops flexible reservation algorithm for advance network...
Berkeley lab team develops flexible reservation algorithm for advance network...
 
Dynamic adaptation balman
Dynamic adaptation balmanDynamic adaptation balman
Dynamic adaptation balman
 
Nersc dtn-perf-100121.test_results-nercmeeting-jan21-2010
Nersc dtn-perf-100121.test_results-nercmeeting-jan21-2010Nersc dtn-perf-100121.test_results-nercmeeting-jan21-2010
Nersc dtn-perf-100121.test_results-nercmeeting-jan21-2010
 
Cybertools stork-2009-cybertools allhandmeeting-poster
Cybertools stork-2009-cybertools allhandmeeting-posterCybertools stork-2009-cybertools allhandmeeting-poster
Cybertools stork-2009-cybertools allhandmeeting-poster
 
Presentation summerstudent 2009-aug09-lbl-summer
Presentation summerstudent 2009-aug09-lbl-summerPresentation summerstudent 2009-aug09-lbl-summer
Presentation summerstudent 2009-aug09-lbl-summer
 
Lblc sseminar jun09-2009-jun09-lblcsseminar
Lblc sseminar jun09-2009-jun09-lblcsseminarLblc sseminar jun09-2009-jun09-lblcsseminar
Lblc sseminar jun09-2009-jun09-lblcsseminar
 
Presentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshopPresentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshop
 
Balman dissertation Copyright @ 2010 Mehmet Balman
Balman dissertation Copyright @ 2010 Mehmet BalmanBalman dissertation Copyright @ 2010 Mehmet Balman
Balman dissertation Copyright @ 2010 Mehmet Balman
 
Aug17presentation.v2 2009-aug09-lblc sseminar
Aug17presentation.v2 2009-aug09-lblc sseminarAug17presentation.v2 2009-aug09-lblc sseminar
Aug17presentation.v2 2009-aug09-lblc sseminar
 
Pdcs2010 balman-presentation
Pdcs2010 balman-presentationPdcs2010 balman-presentation
Pdcs2010 balman-presentation
 
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation NetworksAnalyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
 
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
 
Opening ndm2012 sc12
Opening ndm2012 sc12Opening ndm2012 sc12
Opening ndm2012 sc12
 

Último

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 

Último (20)

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 

Network-aware Data Management Middleware for High Throughput Flows --- AT&T Research Bedminster, NJ – Talk – March 16 2015

  • 1. Network-­‐aware  Data   Management  Middleware  for   High  Throughput  Flows   March  16,  2015   Mehmet  Balman   h3p://balman.info     Performance  Engineer  at  VMware  Inc.     Guest  ScienFst  at  Berkeley  Lab   1  
  • 2. About  me:   Ø 2013:  Performance,  Central  Engineering,  VMware,  Palo  Alto,  CA   Ø 2009:  ComputaFonal  Research  Division  (CRD)  at  Lawrence  Berkeley   NaFonal  Laboratory  (LBNL)   Ø 2005:  Center  for  ComputaFon  &  Technology  (CCT),  Baton  Rouge,  LA   v Computer  Science,  Louisiana  State  University  (2010,2008)   v Bogazici  University,  Istanbul,  Turkey  (2006,2000)     Data  Transfer  Scheduling  with  Advance  ReservaFon  and  Provisioning,  Ph.D.                     Failure-­‐Awareness  and  Dynamic  AdaptaFon  in  Data  Scheduling,  M.S.   Parallel  Tetrahedral  Mesh  Refinement,  M.S.   2  
  • 3. Why  Network-­‐aware?   Networking  is  one  of  the  major  components  in  many  of  the   soluFons  today   •  Distributed  data  and  compute  resources   •  CollaboraFon:  data  to  be  shared  between  remote  sites   •  Data  centers  are  complex  network  infrastructures     ü What  further  steps  are  necessary  to  take  full  advantage  of  future   networking  infrastructure?   ü How  are  we  going  to  deal  with  performance  problems?     ü How  can  we  enhance  data  management  services  and  make  them   network-­‐aware?     New  collabora>ons  between  data  management  and   networking  communi>es.   3  
  • 4. Two  major  players:   • AbstracFon  and  Programmability   •  Rapid  Development,  Intelligent  services   •  OrchestraFng  compute,  storage,  and  network  resources  together   •  IntegraFon  and  deployment  of  complex  workflows   •  VirtualizaFon  (+containers)     •  Distributed  storage  (storage  wars)   •  Open  Source    (if  you  can’t  fix  it,  you  don’t  own  it)   •  Performance  Gap:   •  LimitaFon  is  current  system  socware  and  foreseen    speed:   •  Hardware  is  fast,  Socware  is  slow     •  Latency  throughput  mismatch  will  lead  to  new  innovaGons   4  
  • 5. Outline   •  Data  Streaming  in  High-­‐bandwidth  Networks   •  Climate100:  Advance  Network  IniFaFve  and  100Gbps  Demo   •  MemzNet:  Memory-­‐Mapped  Network  Zero-­‐copy  Channels     •  Core  Affinity  and  End  System  Tuning  in  High-­‐Throughput   Flows   •  Network  Reserva>on  and  Online  Scheduling   •  FlexRes:  A  Flexible  Network  ReservaFon  Algorithm   •  SchedSim:  Online  Scheduling  with  Advance  Provisioning       •  Performance  Engineering  and  Virtualized  Solu>ons   •  So,ware  Defined  Storage   5  
  • 6. 100Gbps  networking  has  Finally  arrived!   Applica>ons’  Perspec>ve   Increasing   the   bandwidth   is   not   sufficient   by   itself;   we   need   careful   evaluaFon   of   high-­‐bandwidth   networks   from   the   applicaFons’  perspecFve.       1Gbps  to  10Gbps  transiFon     (10  years  ago)   ApplicaFon  did  not  run  10  Fmes   faster  because  there  was  more   bandwidth  available   6  
  • 7. ANI   100Gbps   Demo   •  100Gbps  demo  by  ESnet  and   Internet2       •  ApplicaFon  design  issues  and  host   tuning  strategies  to  scale  to  100Gbps   rates     •  VisualizaFon  of  remotely  located  data   (Cosmology)     •  Data  movement  of  large    datasets  with   many  files  (Climate  analysis)     7  
  • 8. Earth  System  Grid  Federation  (ESGF)   8   •  Over  2,700  sites   •  25,000  users     •  IPCC  Fich  Assessment  Report  (AR5)  2PB     •  IPCC  Forth  Assessment  Report  (AR4)  35TB   •  Remote    Data  Analysis   •  Bulk  Data  Movement  
  • 9. Application’s   Perspective:     Climate  Data  Analysis   9  
  • 10.   lots-­‐of-­‐small-­‐*iles  problem!   *ile-­‐centric  tools?     FTP RPC request a file request a file send file send file request data send data •  Keep  the  network  pipe  full   •  We  want  out-­‐of-­‐order  and  asynchronous  send  receive       10  
  • 11. Many  Concurrent  Streams   (a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface traffic, packages per second (blue) and bytes per second, over a single NIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbps pipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents a different test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).       11  
  • 12. ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 concurrent jobs - 5min intervals], TCP buffer size is 50M Effects  of  many  concurrent  streams   12  
  • 13. Analysis  of    Core  AfFinities    (NUMA  Effect)   13  Nathan  Hanford  et  al.    NDM’13   Sandy  Bridge  Architecture   Receive  process    
  • 14. 14   Analysis  of    Core  AfFinities    (NUMA  Effect)   Nathan  Hanford  et  al.   NDM’14  
  • 15.  100Gbps  demo  environment   RRT:    Sea3le  –  NERSC    16ms                      NERSC  –  ANL              50ms                      NERSC  –  ORNL        64ms   15  
  • 16. Framework  for  the  Memory-­‐mapped   Network  Channel   +  SynchronizaFon  mechanism  for  RoCE   -­‐  Keep  the  pipe  full  for  remote  analysis   16  
  • 17. Moving  climate  *iles  ef*iciently   17  
  • 18. Advantages   •  Decoupling  I/O  and  network  operaFons   •  front-­‐end  (I/O    processing)   •  back-­‐end  (networking  layer)     •  Not  limited  by  the  characterisFcs  of  the  file  sizes   •  On  the  fly  tar  approach,    bundling  and  sending    many  files   together   •  Dynamic  data  channel  management     Can   increase/decrease   the   parallelism   level   both     in   the   network   communicaFon   and   I/O   read/write   operaFons,   without   closing   and   reopening   the   data   channel   connecFon   (as   is   done   in   regular   FTP   variants).     MemzNet  is    is  not  file-­‐centric.  Bookkeeping  informaFon  is  embedded   inside  each  block.       18  
  • 19. MemzNet’s  Architecture  for  data   streaming   19  
  • 20. 100Gbps  Demo   •  CMIP3  data  (35TB)  from  the  GPFS  filesystem  at  NERSC   •  Block  size  4MB   •  Each  block’s  data  secFon  was  aligned  according  to  the   system  pagesize.     •  1GB  cache  both  at  the  client  and  the  server     •  At  NERSC,  8  front-­‐end  threads  on  each  host  for  reading  data  files   in  parallel.   •   At  ANL/ORNL,  4  front-­‐end  threads  for  processing  received  data   blocks.   •   4  parallel  TCP  streams  (four  back-­‐end  threads)  were  used  for   each  host-­‐to-­‐host  connecFon.     20  
  • 22. MemzNet’s  Performance     TCP  buffer  size  is  set  to  50MB     MemzNetGridFTP 100Gbps demo ANI Testbed 22  
  • 23. Challenge?   •  High-­‐bandwidth  brings  new  challenges!   •  We  need  substanFal  amount  of  processing  power  and  involvement  of   mulFple  cores  to  fill  a  40Gbps  or  100Gbps  network     •  Fine-­‐tuning,  both  in  network  and  applicaFon  layers,  to  take   advantage  of  the  higher  network  capacity.     •  Incremental  improvement  in  current  tools?   •  We  cannot  expect  every  applicaFon  to  tune  and  improve  every  Fme  we   change  the  link  technology  or  speed.       23  
  • 24. MemzNet   •  MemzNet:  Memory-­‐mapped  Network  Channel     •  High-­‐performance  data  movement     MemzNet  is  an  iniFal  effort  to  put  a  new  layer   between  the  applicaFon  and  the  transport  layer.   •  Main  goal  is  to  define  a  network  channel  so  applicaFons  can   directly  use  it  without  the  burden  of  managing/tuning  the  network   communicaFon.     24   Tech  report:  LBNL-­‐6177E  
  • 25. MemzNet  =  New  Execution  Model   •  Luigi  Rizzo  ’s  netmap     •  proposes  a  new  API  to  send/receive  data  over  the   network   • RDMA  programming  model   •  MemzNet  as  a  memory-­‐management  component   • IX:  Data  Plane  OS  (Adam  Baley  et  al.  @standford  –   similar  to  MemzNet’s  model)   •  mTCP  (even  based  /  replaces  send/receive  in  user  level)   •  Tanenbaum  et  al.    Minimizing  context  switches:   proposing  to  use  MONITOR/MWAIT  for   synchronizaFon   25  
  • 26. Problem  Domain:  Esnet’s  OSCARS   26   ASIA-PACIFIC (ASGC/Kreonet2/ TWAREN) ASIA-PACIFIC (KAREN/KREONET2/ NUS-GP/ODN/ REANNZ/SINET/ TRANSPAC/TWAREN) AUSTRALIA (AARnet) LATIN AMERICA CLARA/CUDI CANADA (CANARIE) RUSSIA AND CHINA (GLORIAD) US R&E (DREN/Internet2/NLR) US R&E (DREN/Internet2/ NASA) US R&E (NASA/NISN/ USDOI) ASIA-PACIFIC (BNP/HEPNET) ASIA-PACIFIC (ASCC/KAREN/ KREONET2/NUS-GP/ ODN/REANNZ/ SINET/TRANSPAC) AUSTRALIA (AARnet) US R&E (DREN/Internet2/ NISN/NLR) US R&E (Internet2/ NLR) CERN US R&E (DREN/Internet2/ NISN) CANADA (CANARIE) LHCONE CANADA (CANARIE) FRANCE (OpenTransit) RUSSIA AND CHINA (GLORIAD) CERN (USLHCNet) ASIA-PACIFIC (SINET) EUROPE (GÉANT/ NORDUNET) EUROPE (GÉANT) LATIN AMERICA (AMPATH/CLARA) LATIN AMERICA (CLARA/CUDI) HOUSTON ALBUQUERQUE El PASO SUNNYVALE BOISE SEATTLE KANSAS CITY NASHVILLE WASHINGTON DC NEW YORK BOSTON CHICAGO DENVER SACRAMENTO ATLANTA PNNL SLAC AMES PPPL BNL ORNL JLAB FNAL ANL LBNL •  ConnecFng  experimental  faciliFes  and  supercompuFng  centers   •  On-­‐Demand  Secure  Circuits  and  Advance  ReservaFon  System     •  Guaranteed  between  collaboraFng  insFtuFons  by  delivering   network-­‐as-­‐a-­‐service     •  Co-­‐allocaFon  of  storage  and  network  resources         (SRM:  Storage  Resource  Manager)     OSCARS  provides  yes/no   answers  to  a  reservaFon   request  for  (bandwidth,   start_Gme,  end_Gme)   End-­‐to-­‐end  ReservaFon:    Storage+Network    
  • 27. Reservation  Request   •  Between  edge  routers     Need  to  ensure  availability  of  the  requested  bandwidth  from  source  to   desGnaGon  for  the  requested  Gme  interval     v   R={  nsource,  ndesGnaGon,  Mbandwidth,  tstart,  tend}.   v  source/desFnaFon  end-­‐points   v  Requested  bandwidth   v  start/end  Fmes     Commi3ed  reservaFons  between  tstart  and  tend  are  examined       The  shortest  path  from  source  to  desFnaFon  is  calculated  based  on  the   engineering  metric  on  each  link,  and  a  bandwidth  guaranteed  path  is  set   up  to  commit  and  eventually  complete  the  reservaFon  request  for  the   given  Fme  period   27  
  • 28. Reservation   28   v  Components (Graph): v node (router), port, link (connecting two ports) v engineering metric (~latency) v maximum bandwidth (capacity) v  Reservation: v source, destination, path, time v (time t1, t3) A -> B -> D (900Mbps) v (time t2, t3) A -> C -> D (400Mbps) v (time t4, t5) A -> B -> D (800Mpbs) A   C  B   D   800Mbps   900Mbps   500Mbps   1000Mbps   300Mbps   ReservaFon  1   ReservaFon  2   ReservaFon  3   t1   t2   t3   t4   t5  
  • 29. Example   (Fme  t1,  t2)  :     A  to  D  (600Mbps)  NO     A  to  D  (500Mbps)  YES           A   C  B   D   0  Mbps  /  900Mbps  (900Mbps)   100  Mbps  /  900Mbps  (1000Mbps)   800  Mbps  /  0Mbps  (800Mbps)   500  Mbps  /  0Mbps  (500Mbps)   300  Mbps  /    0  Mbps  (300Mbps)   AcFve  reservaFon   reservaFon  1:  (Fme  t1,  t3)    A  -­‐>  B  -­‐>  D    (900Mbps)   reservaFon  2:  (Fme  t1,  t3)    A  -­‐>  C  -­‐>  D    (400Mbps)   reservaFon  3:  (Fme  t4,  t5)    A  -­‐>  B  -­‐>  D    (800Mpbs)   available/  reserved   (capacity)     29  
  • 30. Example   A   C  B   D   0  Mbps  /  900Mbps  (900Mbps)   100  Mbps  /  900Mbps  (1000Mbps)   400  Mbps  /  400Mbps  (800Mbps)   100  Mbps  /  400Mbps  (500Mbps)   300  Mbps  /    0  Mbps  (300Mbps)   (Fme  t1,  t3)  :     A  to  D  (500Mbps)  NO       A  to  C  (500Mbps)  No   (not  max-­‐FLOW!)           AcFve  reservaFon   reservaFon  1:  (Fme  t1,  t3)    A  -­‐>  B  -­‐>  D    (900Mbps)   reservaFon  2:  (Fme  t1,  t3)    A  -­‐>  C  -­‐>  D    (400Mbps)   reservaFon  3:  (Fme  t4,  t5)    A  -­‐>  B  -­‐>  D    (800Mpbs)   available/  reserved   (capacity)     30  
  • 31. Alternative  Approach:  Flexible  Reservations   •  IF  the  requested  bandwidth  can  not  be  guaranteed:   •  Try-­‐and-­‐error  unFl  get  an  available  reservaFon   •  Client  is  not  given  other  possible  opFons   •  How  can  we  enhance  the  OSCARS  reservaFon  system?   •  Be  Flexible:   •  Submit  constraints  and  the  system  suggests  possible  reservaFon  opFons   saFsfying  given  requirements   31    Rs '={  nsource  ,  ndesGnaGon,  MMAXbandwidth,  DdataSize,  tEarliestStart,  tLatestEnd}     ReservaFon  engine  finds  out  the  reservaFon            R={  nsource,  ndesGnaGon,  Mbandwidth,  tstart,  tend}     for  the  earliest  compleFon  or  for  the  shortest  duraFon     where  Mbandwidth≤  MMAXbandwidth  and  tEarliestStart  ≤  tstart  <  tend≤  tLatestEnd  .  
  • 32. Bandwidth  Allocation  (time-­‐dependent)               Modified  Dijstra's   algorithms  (max  available   bandwidth):     •  BoPleneck  constraint     (not  addiFve)   •  QoS  constraint  is  addiFve   in  shortest  path,  etc)   32  The  maximum  bandwidth  available  for  allocaFon  from  a  source  node  to  a  desFnaFon   node   t1   t2   t3   t4   t5   t6  
  • 33. Analogous Example n  A vehicle travelling from city A to city B n  There are multiple cities between A and B connected with separate highways. n  Each highway has a specific speed limit –  (maximum bandwidth) n  But we need to reduce our speed if there is high traffic load on the road n  We know the load on each highway for every time period –  (active reservations) n  The first question is which path the vehicle should follow in order to reach city B from city A as early as possible (earliest completion) •  Or, we can delay our journey and start later if the total travel time would be reduced. Second question is to find the route along with the starting time for shortest travel duration (shortest duration) 33   Advance bandwidth reservation: we have to set the speed limit before starting and cannot change during the journey  
  • 34. Time steps n  Time steps between t1 and t13 Fme   t4  t2   t3  t1   t5   t6   t7   t8   t9   t10   t11   t12   t13   ReservaFon  1   ReservaFon  2   ReservaFon  3   Res  1   Res  1,2   Res   2   Res  3   t4  t1   t6   t7   t9   t12   t13   Fme   Fme  steps   Max (2r+1) time steps, where r is the number of reservations ts1   ts2   ts3   ts4   34  
  • 35. Static Graphs Res  1   Res  1,2   Res  2   t4  t1   t6   t7   t9   A   C  B   D   0  Mbps   100  Mbps   800  Mbps   500  Mbps   300  Mbps)   A   C  B   D   0  Mbps   100  Mbps   400  Mbps   100  Mbps   300  Mbps)   A   C  B   D   900  Mbps   1000  Mbps   400  Mbps   100  Mbps   300  Mbps)   A   C  B   D   900  Mbps   1000  Mbps   800  Mbps   500  Mbps   300  Mbps)   t4   t6   t7   G(ts3)   G(ts4)  G(ts2)  G(ts1)   35  
  • 36. Time Windows Res  1,2   Res  2   t1   t6   t9   A   C  B   D   0  Mbps   100  Mbps   400  Mbps   100  Mbps   300  Mbps   A   C  B   D   900  Mbps   1000  Mbps   400  Mbps   100  Mbps   300  Mbps   t6   Max (s × (s + 1))/2 time windows, where s is the number of time steps G(tw)=G(ts3)  x  G(ts4)   tw=ts1+ts2   Bo3leneck  constraint   G(tw)=G(ts1)  x  G(ts2)   tw=ts3+ts4   36  
  • 37. Time  Window  List          (special  data  structures)   now   infinite   Time  windows  list   new  reservaFon:    reservaFon  1,  start  t1,  end  t10   now   t1   t10   infinite   Res  1   new  reservaFon:    reservaFon  2,  start  t12,  end  t20   now   t1   t10   t12   Res  1   t20   infinite   Res  2   37   Careful  socware  design  makes  implementaFon  fast  and  efficient  
  • 38. Performance max-bandwidth path ~ O(n^2 ) n is the number of nodes in the topology graph In the worst-case, we may require to search all time windows, (s × (s + 1))/2, where s is the number of time steps. If there are r committed reservations in the search period, there can be a maximum of 2r + 1 different time steps in the worst-case. Overall, the worst-case complexity is bounded by O(r^2 n^2 ) Note: r is relatively very small compared to the number of nodes n 38  
  • 39. Example Reservation 1: (time t1, t6) A -> B -> D (900Mbps) Reservation 2: (time t4, t7) A -> C -> D (400Mbps) Reservation 3: (time t9, t12) A -> B -> D (700Mpbs) A   C  B   D   800Mbps   900Mbps   500Mbps   1000Mbps   300Mbps   t4  t2   t3  t1   t5   t6   t7   t8   t9   t10   t11   t12   t13   ReservaFon  1   ReservaFon  2   ReservaFon  3   from A to D (earliest completion) max bandwidth = 200Mbps, volume = 200Mbps x 4 time slots earliest start = t1, latest finish t13 39  
  • 40. Search Order - Time Windows Res  1   Res  1,2   Res   2   Res  3   t4  t1   t6   t7   t9   t12   t13   Fme   windows   Res  1   Res  1,  2   Res  1,  2   2   Res  1,2     Res  1,  2   Res  2   Res  1,  2   Res  1,  2   t1-­‐-­‐t6   t4—t6   t1-­‐-­‐t4   t6—t7   t4—t7   t1—t7   t7—t9   t6—t9   t4—t9   t1—t9   Max  bandwidth  from  A  to  D   1.  900Mbps    (3)   2.  100Mbps    (2)   3.  100Mbps    (5)   4.  900Mbps    (1)   5.  100Mbps    (3)   6.  100Mbps    (6)   7.  900Mpbs    (2)   8.  900Mbps    (3)   9.  100Mbps    (5)   10.  100Mbps    (8)   ReservaFon:  (  A  to  D  )  (100Mbps)  start=t1    end=t9   40  
  • 41. Search Order - Time Windows Shortest  dura>on?     Res  1   Res  1,2   Res   2   Res  3   t4  t1   t6   t7   t9   t12   t13   Fme   windows   Res  3   Res  3  t9—t13   t12—t12   t9—t12   Max  bandwidth  from  A  to  D   1.  200Mbps    (3)   2.  900Mbps    (1)   3.  200Mbps    (4)      ReservaFon:  (A  to  D  )  (200Mbps)  start=t9  end=t13         Ø from  A  to  D,  max  bandwidth  =  200Mbps          volume  =  175Mbps  x  4  Fme  slots            earliest  start  =  t1,  latest  finish  t13        earliest  compleFon:    (  A  to  D  )  (100Mbps)  start=t1    end=t8      shortest  duraFon:          (  A  to  D  )  (200Mbps)  start=t9    end=t12.5     41  
  • 42. Source  >  Network  >  Destination     A CB D 800Mbps   900Mbps   500Mbps   1000Mbps   300Mbps   n2   n1   Now  we  have     mulFple  requests   42  
  • 43. With  start/end  times   •   Each  transfer  request  has  start  and  end  Fmes   •  n  transfer  requests  are  given  (each  request  has  a  specific  amount  of   profit)   •  ObjecFve  is  to  maximize  the  profit   •  If  profit  is  same  for  each  request,  then  objecFve  is  to   maximize  the  number  of  jobs  in  a  give  Fme  period     •  Unspli3able  Flow  Problem:   •  An  undirected  graph,     •  route  demand  from  source(s)  to  desFnaFons(s)  and  maximize/minimize   the  total  profit/cost     43    The  online  scheduling  method  here  is  inspired  from  Gale-­‐Shapley  algorithm  (also   known  as  stable  marriage  problem)  
  • 44. Methodology   •  Displace  other  jobs  to  open  space  for  the  new  request   •   we  can  shic  max  n  jobs?   •  Never  accept  a  job  if  it  causes  other  commi3ed  jobs  to  break  their   criteria   •  Planning  ahead  (gives  opportunity  for  co-­‐allocaFon)   •  Gives  a  polynomial  approximaFon  algorithm   •  The  preference  converts  the  UFP  problem  into  Dijkstra  path   search   •  UFlizes  Fme  windows/Fme  steps  for  ranking  (be3er  than  earliest   deadline  first)   •  Earliest  compleFon  +  shortest  duraFon   •  Minimize  concurrency     •  Even  random  ranking  would  work  (relaxaFon  in  an  NP-­‐hard  problem   44  
  • 45.         45  
  • 46. Recall  Time  Windows   Res  1   Res  1,2   Res   2   Res  3   t4  t1   t6   t7   t9   t12   t13   Fme   windows   Res  1   Res  1,  2   Res  1,  2   2   Res  1,2     Res  1,  2   Res  2   Res  1,  2   Res  1,  2   t1-­‐-­‐t6   t4—t6   t1-­‐-­‐t4   t6—t7   t4—t7   t1—t7   t7—t9   t6—t9   t4—t9   t1—t9   Max  bandwidth  from  A  to  D   1.  900Mbps    (3)   2.  100Mbps    (2)   3.  100Mbps    (5)   4.  900Mbps    (1)   5.  100Mbps    (3)   6.  100Mbps    (6)   7.  900Mpbs    (2)   8.  900Mbps      (3)   9.  100Mbps    (5)   10.  100Mbps    (8)   ReservaFon:  (  A  to  D  )  (100Mbps)  start=t1    end=t9   46  
  • 47. Test     47   In  real  life,  number  of   nodes  and  number  of   reservaFon  in  a  given   search  interval  are   limited   See  AINA’13  paper  for  results    +  comparison  with  different  preference  metrics  
  • 48. Autonomic  Provisioning  System   •  Generate  constraints  automaFcally  (without  user  input)   •  Volume  (elephant  flow?)   •  True  deadline  if  applicable   •  End-­‐host  resource  availability   •  Burst  rate  (fixed  bandwidth,  variable  bandwidth)   •  Update  constraints  according  to  feedback  and  monitoring   •  Minimize  operaFonal  cost   •  AlternaFve  to  manual  traffic  engineering     What  is  the  incenFve  to  make  correct  reservaFons?       48  
  • 49. Data  Center  1   Data  Center  2   Data  node  B    (web  access)   Experimental    facility  A   *  (1)  Experimental  facility  A  generates  30T  of  data  every  day,  and  it  needs  to  be  stored  in   data  center  2,  before  the  next  run,  since  local  disk  space  is  limited   *  (2)  There  is  a  reservaFon  made  between  data  center  1  and  2.  It  is  used  to  replicate   data  files,  1P  total  size,  when  new  data  is  available  in  data  center  2   *  (3)  New  results  are  published  at  data  node  B,  we  expect  high  traffic  to  download   new  simulaFon  files  for  the  next  couple  of  months   Wide-­‐area   SDN   49  
  • 50. Example   •  Experimental  facility  periodically  transfers  data  (i.e.  every  night)   •  Data  replicaFon  happens  occasionally,  and  it  will  take  a  week  to   move  1P  of  data.  If  could  get  delayed  couple  of  hours  with  no  harm   •  Wide-­‐area  download  traffic  will  increase  gradually,  most  of  the   traffic  will  be  during  the  day.     •  We  can  dynamically  increase  preference  for  download  traffic  in  the   mornings,  give  high  priority  for  transferring  data  from  the  facility  at  night,   and  use  rest  of  the  bandwidth  for  data  replicaFon  (and  allocate  some   bandwidth  to  confirm  that  it  would  finish  within  a  week  as  usual)   50  
  • 51. Virtual  Circuit   ReservaFon  Engine   Autonomic  provisioning   system   monitoring   Reserva>on  Engine   –  Select  opFmal  path/Fme/bandwidth   –  maximize  the  number  of  admi3ed  requests   –   increase  overall  system  uFlizaFon  and  network  efficiency   –  Dynamically  update  the  selected  rouFng  path  for  network  efficiency   –  Modify  exisFng  reservaFons  dynamically  to  open  space/Fme  for  new   requests   51  
  • 52. Performance  Engineer  ?   •  Sample  projects:   •  VSAN      (Virtual  SAN)   •  VVOL    (Virtual  Volumes)   •  Important  aspects  of  performance  engineering:   •  Be  a  part  in  the  iniFal  development  phase   •  Develop  techniques  to  analyze  performance   problems     •  Make  sure!  performance  issues  are  addresses   correctly   52  
  • 53. THANK  YOU     Any  QuesFon/Comment?           Mehmet  Balman          mbalman@lbl.gov     h3p://balman.info     53