SlideShare a Scribd company logo
1 of 87
Download to read offline
Disco workshop
From zero to CDN log processing
2	
  
1.  Intro	
  to	
  parallel	
  compu1ng	
  
•  Algorithms	
  
•  Programming	
  model	
  
•  Applica1ons	
  
2.  Intro	
  to	
  MapReduce	
  
•  History	
  
•  (in)applicability	
  
•  Examples	
  
•  Execu1on	
  overview	
  
3.  Wri1ng	
  MapReduce	
  jobs	
  with	
  Disco	
  
•  Disco	
  &	
  DDFS	
  
•  Python	
  
•  Your	
  first	
  disco	
  job	
  
•  Disco	
  @	
  SpilGames	
  
4.  CDN	
  log	
  processing	
  
•  Architecture	
  
•  Availability	
  &	
  Performance	
  monitoring	
  
•  Steps	
  to	
  get	
  to	
  our	
  Disco	
  landscape	
  
Overview
3	
  
Introduction to
Parallel Computing
4	
  
Tradi1onally	
  (Neumann	
  model),	
  soUware	
  has	
  been	
  wriVen	
  for	
  
serial	
  computa1on:	
  
•  To	
  be	
  run	
  on	
  a	
  single	
  computer	
  having	
  a	
  single	
  CPU	
  
•  A	
  problem	
  is	
  broken	
  into	
  discrete	
  series	
  of	
  instruc1ons	
  
•  Instruc1ons	
  are	
  executed	
  one	
  aUer	
  another	
  
•  Only	
  on	
  instruc1on	
  may	
  execute	
  at	
  any	
  moment	
  in	
  1me	
  
Serial computations
5	
  
A	
  parallel	
  computer	
  is	
  of	
  liVle	
  use	
  unless	
  efficient	
  
parallel	
  algorithms	
  are	
  available	
  
	
  
•  The	
  issues	
  in	
  designing	
  parallel	
  algorithms	
  are	
  very	
  
different	
  from	
  those	
  in	
  designing	
  their	
  sequen1al	
  
counterparts	
  
•  A	
  significant	
  amount	
  of	
  work	
  is	
  being	
  done	
  to	
  
develop	
  efficient	
  parallel	
  algorithms	
  for	
  a	
  variety	
  of	
  
parallel	
  architectures	
  
Design of efficient algorithms
6	
  
Fibonacci series
(1,1,2,3,5,8,13,21…) by F(n) = F(n-1) + F(n-2)
Sequential algorithm, not parallelizable
7	
  
Parallel	
  compu1ng	
  is	
  the	
  simultaneous	
  use	
  of	
  mul1ple	
  compu1ng	
  
resources	
  to	
  solve	
  a	
  computa1onal	
  problem:	
  
•  To	
  be	
  run	
  using	
  mul1ple	
  CPUs	
  
•  A	
  problem	
  is	
  broken	
  down	
  into	
  discrete	
  parts	
  that	
  can	
  be	
  
solved	
  concurrently	
  
•  Each	
  part	
  is	
  further	
  broken	
  down	
  to	
  a	
  series	
  of	
  instruc1ons	
  
•  Instruc1ons	
  from	
  each	
  part	
  execute	
  simultaneously	
  on	
  
different	
  CPUs	
  
Parallel computations
8	
  
Summation of numbers
9	
  
•  Descrip1on	
  
•  The	
  mental	
  model	
  the	
  programmer	
  has	
  about	
  the	
  detailed	
  
execu1on	
  of	
  their	
  applica1ons	
  
•  Purpose	
  
•  Improve	
  programmer	
  produc1vity	
  
•  Evalua1on	
  
•  Expression	
  
•  Simplicity	
  
•  Performance	
  
Programming Model
10	
  
•  Message	
  passing	
  
•  Independent	
  tasks	
  encapsula1ng	
  local	
  data	
  
•  Tasks	
  interact	
  by	
  exchanging	
  messages	
  
•  Shared	
  memory	
  
•  Tasks	
  share	
  a	
  common	
  address	
  space	
  
•  Tasks	
  interact	
  by	
  reading	
  and	
  wri1ng	
  this	
  space	
  
asynchronously	
  
•  Data	
  paralleliza1on	
  
•  Tasks	
  execute	
  a	
  sequence	
  of	
  independent	
  opera1ons	
  
•  Data	
  usually	
  evenly	
  par11oned	
  across	
  tasks	
  
•  Also	
  referred	
  to	
  as	
  “Embarrassingly	
  parallel”	
  
Parallel Programming Models
11	
  
•  Historically	
  used	
  for	
  large	
  scale	
  problems	
  in	
  science	
  and	
  
Engineering	
  
•  Physics	
  –	
  applied,	
  nuclear,	
  par1cle,	
  fusion,	
  photonics	
  
•  Bioscience,	
  Biotechnology,	
  Gene1cs,	
  Sequencing	
  
•  Chemistry,	
  Molecular	
  sciences	
  
•  Mechanical	
  Engineering	
  –	
  from	
  prosthe1cs	
  to	
  spacecraU	
  
•  Electrical	
  Engineering,	
  Circuit	
  Design,	
  Microelectronics	
  
•  Computer	
  Science,	
  Mathema1cs	
  
Applications (Scientific)
12	
  
•  Commercial	
  applica1ons	
  also	
  provide	
  the	
  driving	
  force	
  in	
  the	
  
parallel	
  compu1ng.	
  These	
  applica1ons	
  require	
  the	
  processing	
  
of	
  large	
  amounts	
  of	
  data	
  
•  Databases,	
  data	
  mining	
  
•  Oil	
  explora1on	
  
•  Web	
  search	
  engines,	
  web	
  based	
  business	
  services	
  
•  Medical	
  imaging	
  and	
  diagnosis	
  
•  Pharmaceu1cal	
  design	
  
•  Management	
  of	
  na1onal	
  and	
  mul1-­‐na1onal	
  corpora1ons	
  
•  Financial	
  and	
  economic	
  modeling	
  
•  Advanced	
  graphics	
  &	
  VR	
  
•  Networked	
  video	
  and	
  mul1-­‐media	
  technologies	
  
Applications (Commercial)
13	
  
•  Parallelize	
  
•  Distribute	
  
•  Problems?	
  
•  Concurrency	
  problems	
  
•  Coordina1on	
  
•  Scalability	
  
•  Fault	
  Tolerance	
  
What if my job is too “big”?
14	
  
•  Applica1on	
  is	
  modeled	
  as	
  Directed	
  Acyclic	
  Graph	
  
•  DAG	
  defines	
  the	
  dataflow	
  
•  Computa1onal	
  ver1ces	
  
•  Ver1ces	
  of	
  the	
  graph	
  defines	
  the	
  opera1on	
  on	
  data	
  
•  Channels	
  
•  File	
  
•  TCP	
  pipe	
  
•  SHM	
  FIFO	
  
•  Not	
  as	
  restric1ve	
  as	
  MapReduce	
  
•  Mul1ple	
  Input	
  and	
  Output	
  
•  Allows	
  developers	
  to	
  define	
  communica1on	
  between	
  ver1ces	
  
Microsoft: MSN search group: DRYAD
15	
  
“A	
  simple	
  and	
  powerful	
  interface	
  that	
  enables	
  
automa1c	
  paralleliza1on	
  and	
  distribu1on	
  of	
  large-­‐scale	
  
computa1ons,	
  combined	
  with	
  an	
  implementa1on	
  of	
  
this	
  interface	
  that	
  achieves	
  high	
  performance	
  on	
  large	
  
clusters	
  of	
  commodity	
  PCs.”	
  
Google
Deen and Ghermawat, “MapReduce: Simplified Data Processing on Large Clusters”,
Google Inc.
16	
  
Introduction to
MapReduce
17	
  
I	
  have	
  a	
  ques1on	
  which	
  a	
  data	
  set	
  can	
  answer.	
  
	
  
I	
  have	
  lots	
  of	
  data	
  and	
  I	
  have	
  of	
  a	
  cluster	
  of	
  nodes.	
  
MapReduce	
  is	
  a	
  parallel	
  framework	
  which	
  takes	
  advantage	
  
of	
  my	
  cluster	
  by	
  distribu1ng	
  the	
  work	
  across	
  each	
  node.	
  
	
  
Specifically,	
  MapReduce	
  maps	
  data	
  in	
  the	
  form	
  of	
  key-­‐value	
  
pairs	
  which	
  are	
  then	
  par11oned	
  into	
  buckets.	
  The	
  buckets	
  
can	
  be	
  spread	
  easily	
  over	
  all	
  the	
  nodes	
  in	
  the	
  cluster	
  and	
  
each	
  node	
  or	
  Reducer,	
  reduces	
  the	
  data	
  to	
  an	
  “answer”	
  or	
  a	
  
list	
  of	
  “answers”.	
  
What is MapReduce?
18	
  
•  Published	
  in	
  2004	
  by	
  Google	
  
MapReduce history
19	
  
•  Published	
  in	
  2004	
  by	
  Google	
  
•  Func1onal	
  programming	
  (eg.	
  Lisp,	
  Erlang)	
  
•  map()	
  func1on	
  
•  Applies	
  a	
  func1on	
  to	
  each	
  value	
  of	
  a	
  sequence	
  
•  reduce()	
  func1on	
  (fold())	
  
•  Combines	
  all	
  elements	
  of	
  a	
  sequence	
  using	
  a	
  
binary	
  operator	
  
MapReduce history
20	
  
•  Published	
  in	
  2004	
  by	
  Google	
  
MapReduce history
21	
  
•  Restric1ve	
  seman1cs	
  
•  Pipelining	
  Map/Reduce	
  stages	
  possibly	
  inefficient	
  
•  Solvers	
  problems	
  within	
  a	
  narrow	
  programming	
  domain	
  well	
  
•  DB	
  community:	
  our	
  parallel	
  RMDBSs	
  have	
  been	
  doing	
  this	
  
forever…	
  
•  Data	
  scale	
  maVers:	
  Use	
  MapReduce	
  if	
  you	
  truly	
  have	
  large	
  
data	
  sets	
  that	
  are	
  difficult	
  to	
  process	
  using	
  simpler	
  solu1ons	
  
•  Its	
  not	
  always	
  a	
  high	
  performance	
  solu1on.	
  Straight	
  python,	
  
simple	
  batch	
  scheduled	
  Python,	
  and	
  C	
  core	
  can	
  all	
  outperform	
  
MR	
  by	
  and	
  order	
  of	
  magnitude	
  or	
  two	
  on	
  a	
  single	
  node	
  for	
  
many	
  problems,	
  even	
  for	
  so-­‐called	
  big	
  data	
  problems	
  
Why NOT MapReduce?
22	
  
•  Distributed	
  grep,	
  sort,	
  word	
  frequency	
  
•  Inverted	
  index	
  construc1on	
  
•  Page	
  Rank	
  
•  Web	
  link-­‐graph	
  traversal	
  
•  Large-­‐scale	
  PDF	
  genera1on,	
  image	
  conversion	
  
•  Ar1ficial	
  Intelligence,	
  Machine	
  Learning	
  
•  Geographical	
  data,	
  Google	
  Maps	
  
•  Log	
  querying	
  
•  Sta1s1cal	
  Machine	
  Transla1on	
  
•  Analyzing	
  similari1es	
  of	
  user’s	
  behavior	
  
•  Process	
  clickstream	
  and	
  demographic	
  data	
  
•  Research	
  for	
  Ad	
  systems	
  
•  Ver1cal	
  search	
  engine	
  for	
  trustworthy	
  wine	
  informa1on	
  
What it is good for?
23	
  
•  Google	
  (proprietary	
  implementa1on	
  in	
  C++)	
  
•  Hadoop	
  (Open	
  Source	
  implementa1on	
  in	
  JAVA)	
  
•  Disco	
  (erlang,	
  python)	
  
•  Skynet	
  (ruby)	
  
•  BashReduce	
  (last.fm)	
  
•  Spark	
  (Scala,	
  func1onal	
  OO	
  lang.	
  on	
  JVM)	
  
•  Plasma	
  MapReduce	
  (OCaml)	
  
•  Storm	
  (The	
  hadoop	
  of	
  Real1me	
  Processing)	
  
cat	
  a_bunch_of_files	
  |	
  ./mapper.py	
  |	
  sort	
  |	
  ./reducer.py	
  
Flavors of MapReduce
24	
  
•  Process	
  data	
  using	
  special	
  map()	
  and	
  reduce()	
  
func1ons	
  
•  The	
  map()	
  func1on	
  is	
  called	
  on	
  every	
  item	
  in	
  the	
  
input	
  and	
  emit	
  a	
  series	
  of	
  intermediate	
  key/value	
  
pairs	
  
•  All	
  values	
  associated	
  with	
  a	
  given	
  key	
  are	
  grouped	
  
together	
  
•  The	
  reduce()	
  func1on	
  is	
  called	
  on	
  every	
  unique	
  
key,	
  and	
  its	
  values	
  list,	
  and	
  emits	
  a	
  value	
  that	
  is	
  
added	
  to	
  the	
  output	
  
The MR programming model
25	
  
•  More	
  formally	
  
•  Map(k1,	
  v1)	
  -­‐>	
  list(k2,	
  v2)	
  
•  Reduce(k2,	
  list(v2))	
  -­‐>	
  list(v2)	
  
The MR programming model
26	
  
•  Greatly	
  reduces	
  parallel	
  programming	
  complexity	
  
•  Reduces	
  synchroniza1on	
  complexity	
  
•  Automa1cally	
  par11ons	
  data	
  
•  Provides	
  failure	
  transparency	
  
•  Prac1cal	
  
•  Hundreds	
  of	
  jobs	
  every	
  day	
  
MapReduce benefits
27	
  
•  Par11ons	
  input	
  data	
  
•  Schedules	
  execu1on	
  across	
  a	
  set	
  of	
  machines	
  
•  Handles	
  machine	
  failure	
  
•  Manages	
  IPC	
  
The MR runtime system
28	
  
•  Distributed	
  grep	
  
•  Map	
  func1on	
  emits	
  <word,	
  line_number>	
  	
  
if	
  a	
  word	
  matches	
  search	
  criteria	
  
•  Reduce	
  func1on	
  is	
  iden1ty	
  func1on	
  
•  URL	
  access	
  frequency	
  
•  Map	
  func1on	
  processing	
  web	
  logs,	
  emits	
  <url,	
  1>	
  
•  Reduce	
  func1on	
  summing	
  values,	
  emits	
  <url,	
  total>	
  
MR Examples
29	
  
•  Geospa1al	
  Query	
  processing	
  
•  Given	
  an	
  intersec1on,	
  find	
  all	
  roads	
  connec1ng	
  to	
  it	
  
•  Rendering	
  the	
  1les	
  in	
  the	
  map	
  
•  Finding	
  the	
  nearest	
  feature	
  to	
  a	
  given	
  address	
  
MR Examples
30	
  
•  “Learning	
  the	
  right	
  abstrac1on	
  will	
  simplify	
  your	
  
life.”	
  –	
  Travis	
  Oliphant	
  
MR Examples
Program	
   Map()	
   Reduce()	
  
Distributed	
  grep	
   Matched	
  lines	
   pass	
  
Reverse	
  web	
  link	
  graph	
   <target,	
  source>	
   <target,	
  list(src)>	
  
URL	
  count	
   <url,	
  1>	
   <url,	
  total_count)	
  
Term-­‐vector	
  per	
  host	
   <hostname,	
  term-­‐vector>	
   <hostname,	
  all-­‐term-­‐vector>	
  
Inverted	
  Index	
   <word,	
  doc	
  id>	
   <word,	
  list(doc_id)>	
  
Distributed	
  Sort	
   <key,	
  value>	
   pass	
  
31	
  
•  The	
  user	
  program,	
  via	
  the	
  MR	
  library,	
  shards	
  the	
  
input	
  data	
  
MR Execution 1/8
32	
  
•  The	
  user	
  program	
  creates	
  process	
  copies	
  (workers)	
  
distributed	
  on	
  a	
  machine	
  cluster.	
  
•  One	
  copy	
  will	
  be	
  the	
  “Master”	
  and	
  the	
  others	
  will	
  be	
  
worker	
  threads	
  
MR Execution 2/8
33	
  
•  The	
  master	
  distributes	
  M	
  map	
  and	
  R	
  reduce	
  	
  
tasks	
  to	
  idle	
  workers.	
  
•  M	
  ==	
  number	
  of	
  shards	
  
•  R	
  ==	
  the	
  key	
  space	
  is	
  divided	
  into	
  R	
  parts	
  
MR Execution 3/8
34	
  
•  Each	
  map-­‐task	
  worker	
  reads	
  assigned	
  input	
  shard	
  
and	
  outputs	
  intermediate	
  key/value	
  pairs	
  
•  Output	
  buffered	
  in	
  RAM	
  
MR Execution 4/8
35	
  
•  Each	
  worker	
  flushes	
  intermediate	
  values,	
  	
  
par11oned	
  into	
  R	
  regions,	
  to	
  disk	
  and	
  no1fies	
  	
  
the	
  Master	
  process	
  
MR Execution 5/8
36	
  
•  Master	
  process	
  gives	
  disk	
  loca1on	
  to	
  an	
  available	
  
reduce-­‐task	
  worker	
  who	
  reads	
  all	
  associated	
  
intermediate	
  data	
  
MR Execution 6/8
37	
  
•  Each	
  reduce-­‐task	
  worker	
  sorts	
  its	
  intermediate	
  data.	
  
Calls	
  the	
  reduce()	
  func1on,	
  passing	
  unique	
  keys	
  and	
  
associated	
  key	
  values.	
  Reduce	
  func1on	
  output	
  
appended	
  to	
  reduce-­‐task’s	
  par11on	
  output	
  file	
  
MR Execution 7/8
38	
  
•  Master	
  process	
  wakes	
  up	
  user	
  process	
  when	
  	
  
all	
  tasks	
  have	
  completed.	
  	
  
•  Output	
  contained	
  in	
  R	
  output	
  files.	
  
MR Execution 8/8
39	
  
•  An	
  input	
  reader	
  
•  A	
  map()	
  func1on	
  
•  A	
  par11on	
  func1on	
  
•  A	
  compare	
  func1on	
  (sort)	
  
•  A	
  reduce()	
  func1on	
  
•  An	
  output	
  writer	
  
Hot spots
40	
  
MR Execution Overview
41	
  
•  Fault	
  Tolerance	
  
•  Master	
  process	
  periodically	
  pings	
  workers	
  
•  Map-­‐task	
  failure	
  
–  Re-­‐execute	
  
»  All	
  output	
  was	
  stored	
  locally	
  
•  Reduce-­‐task	
  failure	
  
–  Only	
  re-­‐execute	
  par1ally	
  completed	
  tasks	
  
»  All	
  output	
  stored	
  in	
  the	
  global	
  file	
  system	
  
MR Execution Overview
42	
  
•  Don’t	
  move	
  data	
  to	
  workers…	
  Move	
  workers	
  to	
  the	
  data!	
  
•  Store	
  data	
  on	
  local	
  disks	
  for	
  nodes	
  in	
  the	
  cluster	
  
•  Start	
  up	
  the	
  workers	
  on	
  the	
  node	
  that	
  has	
  data	
  local	
  
•  Why?	
  
•  Not	
  enough	
  RAM	
  to	
  hold	
  all	
  the	
  data	
  in	
  memory	
  
•  Disk	
  access	
  is	
  slow,	
  disk	
  throughput	
  is	
  good	
  
•  A	
  distributed	
  file	
  system	
  is	
  the	
  answer	
  
•  GFS	
  (Google	
  File	
  System)	
  (=	
  Big	
  File	
  System)	
  
•  HDFS	
  (Hadoop	
  DFS)	
  =	
  GFS	
  clone	
  
•  DDFS	
  (Disco	
  DFS)	
  
Distributed File System
43	
  
•  Sequen1al	
  -­‐>	
  Parallel	
  -­‐>	
  Distributed	
  
•  Hype	
  aUer	
  Google	
  published	
  the	
  paper	
  in	
  2004	
  
•  A	
  very	
  narrow	
  set	
  of	
  problems	
  
•  Big-­‐data	
  is	
  a	
  marke1ng	
  buzzword	
  
Summary for Part I.
44	
  
•  MapReduce	
  is	
  a	
  paradigm	
  for	
  distributed	
  compu1ng	
  
developed	
  (patented…)	
  by	
  Google	
  for	
  performing	
  
analysis	
  on	
  large	
  amounts	
  of	
  data	
  distributed	
  across	
  
thousands	
  of	
  commodity	
  computers	
  
•  The	
  Map	
  phase	
  processes	
  the	
  input	
  one	
  element	
  at	
  a	
  
1me	
  and	
  returns	
  a	
  (key,	
  value)	
  pair	
  for	
  each	
  element	
  
•  An	
  op1onal	
  Par11on	
  step	
  par11ons	
  Map	
  results	
  into	
  
groups	
  based	
  on	
  a	
  par11on	
  func1on	
  on	
  the	
  key.	
  
•  The	
  engine	
  merges	
  par11ons	
  and	
  sorts	
  all	
  the	
  map	
  
results.	
  
•  The	
  merged	
  results	
  are	
  passed	
  to	
  the	
  Reduce	
  phase.	
  
One	
  or	
  more	
  reduce	
  jobs	
  reduce	
  the	
  (key,	
  value)	
  pairs	
  
to	
  produce	
  the	
  final	
  results.	
  
Summary for Part I (cont.)
45	
  
Writing MapReduce jobs
with Disco
46	
  
•  Wri1ng	
  MapReduce	
  jobs	
  can	
  be	
  VERY	
  1me	
  consuming	
  
•  MapReduce	
  paVerns	
  
•  Debugging	
  a	
  failure	
  is	
  a	
  nightmare	
  
•  Large	
  clusters	
  require	
  a	
  dedicated	
  team	
  to	
  keep	
  it	
  running	
  
•  Wri1ng	
  a	
  Disco	
  job	
  becomes	
  a	
  soUware	
  engineering	
  task	
  
•  …rather	
  than	
  a	
  data	
  analysis	
  task	
  
Take a deep breath
47	
  
Disco
	
  	
  	
  	
  	
  	
  
48	
  
•  “Massive	
  data	
  –	
  Minimal	
  code”	
  –	
  by	
  Nokia	
  Research	
  Center	
  
•  hVp://discoproject.org	
  	
  
•  WriVen	
  in	
  Erlang	
  
•  Orchestra1ng	
  control	
  
•  Robust	
  fault-­‐tolerant	
  distributed	
  applica1ons	
  
•  Python	
  for	
  opera1ng	
  on	
  data	
  
•  Easy	
  to	
  learn	
  
•  Complex	
  algorithms	
  with	
  very	
  liVle	
  code	
  
•  U1lize	
  favorite	
  python	
  libraries	
  
•  The	
  complexity	
  is	
  hidden,	
  but…	
  
About Disco
49	
  
•  Distributed	
  
•  Increase	
  storage	
  capacity	
  by	
  adding	
  nodes	
  
•  Processing	
  on	
  nodes	
  without	
  transferring	
  data	
  
•  Replicated	
  
•  Chunked	
  data	
  stored	
  in	
  gzip	
  compressed	
  chunks	
  
•  Tag	
  based	
  
•  AVributes	
  
•  CLI	
  
•  $	
  ddfs	
  ls	
  data:log	
  
•  $	
  ddfs	
  chunk	
  data:bigtxt	
  ./bigtxt	
  
•  $	
  ddfs	
  blobs	
  data:bigtxt	
  
•  $	
  ddfs	
  xcat	
  data:bigtxt	
  
Disco Distributed “filesystem”
	
  	
  
50	
  
•  Everything	
  is	
  preinstalled	
  
•  Disco	
  localhost	
  setup:	
  
hVps://github.com/spilgames/disco-­‐development-­‐workflow	
  
	
  
Sandbox environment
51	
  
•  www.pythonforbeginners.com	
  -­‐	
  by	
  Magnus	
  
•  Import	
  
•  Data	
  structures:	
  {}	
  dict,	
  []	
  list,	
  ()	
  tuple	
  
•  Defining	
  func1ons	
  and	
  classes	
  
•  Control	
  flow	
  primi1ves	
  and	
  structures:	
  for,	
  if,	
  …	
  
•  Excep1on	
  handling	
  
•  Regular	
  expressions	
  
•  GeoIP,	
  MySQLdb,	
  …	
  
•  To	
  understand	
  what	
  yield	
  does,	
  you	
  must	
  understand	
  what	
  
generators	
  are.	
  And	
  before	
  generators	
  come	
  iterables.	
  
Python – What you’ll need
52	
  
When	
  you	
  create	
  a	
  list,	
  you	
  can	
  read	
  its	
  items	
  one	
  by	
  one,	
  
and	
  it’s	
  called	
  itera1on:	
  
	
  
>>>	
  mylist	
  =	
  [1,	
  2,	
  3]	
  
>>>	
  for	
  i	
  in	
  mylist:	
  
… 	
  print	
  i	
  
	
  
1	
  
2	
  
3	
  
Python Lists
53	
  
Mylist	
  is	
  an	
  iterable.	
  When	
  you	
  use	
  a	
  comprehension	
  list,	
  you	
  
create	
  a	
  list	
  and	
  so	
  an	
  iterable:	
  
	
  
>>>	
  mylist	
  =	
  [x*x	
  for	
  x	
  in	
  range(3)]	
  
>>>	
  for	
  i	
  in	
  mylist:	
  
… 	
  print	
  i	
  
	
  
0	
  
1	
  
4	
  
	
  
Python Iterables
54	
  
Generators	
  are	
  iterables,	
  but	
  you	
  can	
  read	
  them	
  once.	
  It’s	
  because	
  
they	
  do	
  not	
  store	
  all	
  the	
  values	
  in	
  memory,	
  they	
  generate	
  the	
  values	
  
on	
  the	
  fly:	
  
	
  
>>>	
  mygenerator	
  =	
  (x*x	
  for	
  x	
  in	
  range(3))	
  
>>>	
  for	
  i	
  in	
  mygenerator:	
  
… 	
  print	
  i	
  	
  
	
  
0	
  
1	
  
4	
  
	
  
I	
  just	
  the	
  same	
  except	
  you	
  used	
  ()	
  instead	
  of	
  [].	
  But,	
  you	
  can	
  not	
  
perform	
  for	
  i	
  in	
  mygenerator	
  a	
  second	
  1me	
  since	
  generators	
  can	
  only	
  
be	
  used	
  once:	
  they	
  calculate	
  0,	
  then	
  forget	
  about	
  it	
  and	
  calculate	
  1	
  
and	
  ends	
  calcula1ng	
  4,	
  one	
  by	
  one.	
  
Python Generators
55	
  
Yield	
  is	
  a	
  keyword	
  that	
  is	
  used	
  like	
  return,	
  except	
  the	
  func1on	
  will	
  return	
  a	
  
generator.	
  
	
  
>>>	
  def	
  createGenerator():	
  
… 	
  mylist	
  =	
  range(3)	
  
… 	
  for	
  i	
  in	
  mylist:	
  
… 	
   	
  yield	
  i*i	
  
…	
  
>>>	
  mygenerator	
  =	
  createGenerator()	
  
>>>	
  print	
  mygenerator	
  
<generator	
  object	
  createGenerator	
  at	
  0xb7555c34>	
  
>>>	
  for	
  I	
  in	
  mygenerator:	
  
… 	
  print	
  i	
  
	
  
0	
  
1	
  
4	
  
Python Yield
56	
  
•  What	
  is	
  the	
  total	
  count	
  for	
  each	
  unique	
  word	
  in	
  the	
  text?	
  
•  Word	
  coun1ng	
  is	
  the	
  Hello	
  World!	
  of	
  MapReduce	
  
•  We	
  need	
  to	
  write	
  map()	
  and	
  reduce()	
  func1ons	
  
•  Map(rec)	
  -­‐>	
  list(k,	
  v)	
  
•  Reduce(k,	
  v)	
  -­‐>	
  list(res)	
  
•  Your	
  applica1on	
  communicates	
  with	
  Disco	
  API	
  
•  from	
  disco.core	
  import	
  Job,	
  result_iterator	
  
Your first disco job
57	
  
•  Spli€ng	
  file	
  (related	
  chunks)	
  to	
  lines	
  
•  Map(line,	
  params)	
  
•  Split	
  line	
  to	
  words	
  
•  Emit	
  k,v	
  tuple:	
  <word,	
  1>	
  
•  Reduce(iter,	
  params)	
  
•  OUen,	
  this	
  is	
  an	
  algebraic	
  expression	
  
•  <word,	
  [1,1,1]>	
  -­‐>	
  <word,	
  3>	
  
Word count
58	
  
•  Modules	
  to	
  import	
  
•  Se€ng	
  the	
  master	
  host	
  
•  DDFS	
  
•  Job()	
  
•  Result_iterator(Job.wait())	
  
•  Job.purge()	
  
Word count: Your application
59	
  
def	
  fun_map(line,	
  params):	
  
	
  for	
  word	
  in	
  line.split():	
  
	
   	
  yield	
  word,	
  1	
  
Word count: Your map
60	
  
def	
  fun_reduce(iter,	
  params):	
  
	
  from	
  disco.u1l	
  import	
  kvgroup	
  
	
  for	
  word,	
  counts	
  in	
  kvgroup(sorted(iter)):	
  
	
   	
  yield	
  word,	
  sum(counts)	
  
	
  
	
  
	
  
	
  
Built-­‐in	
  disco.worker.classic.func.sum_reduce()	
  
Word count: Your reduce
61	
  
job	
  =	
  Job().run(input=…,	
  map=fun_map,	
  reduce=fun_reduce)	
  
	
  
for	
  word,	
  count	
  in	
  result_iterator(job.wait(show=True)):	
  
	
  print	
  (word,	
  count)	
  
	
  
job.purge()	
  
	
  
Word count: Your results
62	
  
Class	
  MyJob1(Job):	
  
	
  @classmethod	
  
	
  def	
  map(self,	
  data,	
  params):	
  
	
   	
  …	
  
	
  	
  
	
  @classmethod	
  
	
  def	
  reduce(self,	
  iter,	
  params):	
  
	
   	
  …	
  
	
  
…	
  
MyJob2.run(input=MyJob1.wait())	
  	
  	
  	
  #	
  <-­‐	
  Job	
  chaining	
  
Word count: More advanced
63	
  
•  Event	
  Tracking	
  &	
  Adver1sing	
  related	
  jobs	
  
•  Heatmap:	
  page	
  clicks	
  -­‐>	
  2D	
  density	
  distribu1ons	
  
•  Reconstruc1ng	
  sessions	
  
•  Ad	
  research	
  
•  Behavioral	
  modeling	
  
•  Log	
  crunching	
  
•  Gameplays	
  per	
  country	
  	
  
•  Frontend	
  performance	
  (CDN)	
  
•  404s,	
  Response	
  code	
  tracking	
  
•  Intrusion	
  detec1on	
  #security	
  
Disco @ SpilGames
64	
  
•  Calculate	
  your	
  resource	
  need	
  es1mates	
  
•  Deploy	
  in	
  workflow	
  
•  We	
  have	
  
•  Git	
  
•  Package	
  repository	
  /	
  Deployment	
  Orchestra1on	
  
•  Disco-­‐tools:	
  hVp://github.com/spilgames/disco-­‐tools/	
  
•  Job	
  runner:	
  hVp://jobrunner/	
  
•  Data	
  warehouse	
  
•  Interac1ve,	
  graphical	
  report	
  genera1on	
  
Disco @ SpilGames
65	
  
66	
  
CDN log processing
67	
  
•  Ques1on?	
  
•  Availability	
  of	
  each	
  CDN	
  provider	
  
•  Data	
  source	
  
•  Javascript	
  sampler	
  on	
  client	
  side	
  
•  LoadBalancer	
  -­‐>	
  HA	
  logging	
  endpoints	
  	
  
-­‐>	
  Access	
  logs	
  -­‐>	
  Disco	
  Distributed	
  FS	
  
CDN Availability monitoring
68	
  
CDN Availability monitoring
69	
  
•  Input	
  
•  URI	
  parsing	
  
•  /res.ext?v=o,1|e,1|os,1|ce,1|hw,1|c,0|l,1	
  
•  Expected	
  output	
  
•  ProviderO 	
   	
  98.7537%	
  
•  ProviderE 	
   	
  57.8851%	
  
•  ProviderC 	
   	
  99.4584%	
  
•  ProviderL 	
   	
  99.4847%	
  
CDN Availability monitoring
70	
  
#	
  cdnData:	
  “o,1|e,1|os,1|ce,1|hw,1|c,0|l,1“	
  
	
  
•  Parse	
  a	
  log	
  entry	
  
•  Yield	
  samples	
  
•  <o,	
  1>	
  
•  <e,	
  1>	
  
•  <os,	
  1>	
  
•  <ce,	
  1>	
  
•  <hw,	
  1>	
  
•  <c,	
  0>	
  
•  <l,	
  1>	
  
CDN Availability monitoring (map)
71	
  
def	
  map_cdnAvailability(line,	
  params):	
  
	
  	
  	
  	
  import	
  urlparse	
  
	
  	
  	
  	
  try:	
  
	
  	
  	
  	
  	
  	
  	
  	
  (1mestamp,	
  data)	
  =	
  line.split(‘,’,	
  1)	
  
	
  	
  	
  	
  	
  	
  	
  	
  data	
  =	
  dict(urlparse.parse_qsl(data,	
  False))	
  
	
  	
  	
  	
  	
  	
  	
  	
  for	
  cdnData	
  in	
  data[‘a’].split(‘|’)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  try:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  cdnName	
  =	
  cdnData.split(‘,’)[0]	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  cdnAvailable	
  =	
  int(cdnData.split(‘,’)[1])	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  yield	
  cdnName,	
  cdnAvailabe	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  except:	
  pass	
  
	
  	
  	
  	
  except:	
  pass	
  
CDN Availability monitoring (map)
72	
  
Availability	
  of	
  <hw,	
  [1,1,1,0,1,1,1,0,1,1,0,1]>	
  
	
  
•  kvgroup(iter)	
  
•  The	
  trick:	
  
•  Samples	
  =	
  […]	
  
•  len(samples)	
  -­‐>	
  number	
  of	
  all	
  samples	
  
•  sum(samples)	
  -­‐>	
  number	
  of	
  available	
  
•  A	
  =	
  sum()/len()	
  *	
  100.0	
  
CDN Availability monitoring (reduce)
73	
  
def	
  reduce_cdnAvailability(iter,	
  params):	
  
	
  	
  	
  	
  from	
  disco.u1l	
  import	
  kvgroup	
  
	
  
	
  	
  	
  	
  for	
  cdnName,	
  cdnAvailabili1es	
  in	
  kvgroup(sorted(iter)):	
  
	
  	
  	
  	
  	
  	
  	
  	
  try:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  cdnAvailabili1es	
  =	
  list(cdnAvailabili1es)	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  totalSamples	
  =	
  len(cdnAvailabili1es)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  totalAvailable	
  =	
  sum(cdnAvailabili1es)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  totalUnavailable	
  =	
  totalSamples	
  –	
  totalAvailable	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  yield	
  cdnName,	
  (round(float(totalAvailable)	
  /	
  totalSamples	
  *	
  100.0,	
  4))	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  except:	
  pass	
  
	
  
	
  
CDN Availability monitoring (reduce)
74	
  
•  DDFS	
  
•  tag://logs:cdn:la010:12345678900	
  
•  disco.ddfs.list(tag)	
  
•  disco.ddfs.[get|set]aVr(tag,aVr,value)	
  
•  Job(name,master).run(input,map,reduce)	
  
•  par11ons	
  =	
  R	
  
•  map_reader	
  =	
  disco.worker.classic.func.chain_reader	
  
•  save	
  =	
  true	
  
	
  
Advanced usage
75	
  
CDN Performance
95th percentile with per country breakdown
76	
  
•  Ques1on	
  
•  95th	
  percen1le	
  of	
  response	
  1mes	
  per	
  CDN	
  per	
  country	
  
•  Data	
  source	
  
•  Javascript	
  sampler	
  on	
  client	
  side	
  
•  LB	
  -­‐>	
  HA	
  Logging	
  endpoints	
  -­‐>	
  Access	
  logs	
  -­‐>	
  DDFS	
  
•  Input	
  
•  /res.ext?v=o,1234|l,2345|c,3456&ipaddr=127.0.0.1	
  
•  Expected	
  output	
  
•  ProviderN	
  	
  	
  	
  CountryA:	
  3891	
  ms 	
  CountryB:	
  1198	
  ms	
  …	
  
•  ProviderC	
  	
  	
  	
  CountryA:	
  3793	
  ms 	
  CountryB:	
  1397	
  ms	
  …	
  
•  ProviderE 	
  	
  	
  	
  CountryA:	
  3676	
  ms 	
  CountryB:	
  1676	
  ms	
  …	
  
•  ProviderL 	
  	
  	
  	
  CountryA:	
  4332	
  ms 	
  CountryB:	
  1233	
  ms…	
  
	
  
CDN Performance
77	
  
The 95th percentile
A 95th percentile says that 95% of the time data points are below that
value and 5% of the time they are above that value.
95 is a magic number used in networking because you have to plan for
the most-of-the-time case.
78	
  
v=o,1234|l,2345|c,3456&ipaddr=127.0.0.1	
  
	
  
•  Line	
  parsing	
  is	
  about	
  the	
  same	
  
•  Advanced	
  key:	
  <cdn:country,	
  performance>	
  
•  How	
  to	
  get	
  country	
  from	
  IP?	
  
•  Job().run(…required_modules=[“GeoIP”]…)	
  
•  No	
  global	
  variables!	
  Within	
  map()	
  –	
  Why?	
  
•  Use	
  Job().run(…params={}…)	
  instead	
  
•  yield	
  “%s:%s“	
  %	
  (cdnName,	
  country),	
  cdnPerf	
  
CDN Performance (map)
79	
  
#	
  <hw,	
  [123,	
  234,	
  345,	
  456,	
  567,	
  678,	
  798]>	
  
	
  
def	
  percen1le(N,	
  percent,	
  key=lambda	
  x:x):	
  
	
  	
  	
  	
  import	
  math	
  
	
  	
  	
  	
  if	
  not	
  N:	
  
	
  	
  	
  	
  	
  	
  	
  	
  return	
  None	
  
	
  	
  	
  	
  k	
  =	
  (len(N)	
  -­‐	
  1)	
  *	
  percent	
  
	
  	
  	
  	
  f	
  =	
  math.floor(k)	
  
	
  	
  	
  	
  c	
  =	
  math.ceil(k)	
  
	
  	
  	
  	
  if	
  f	
  ==	
  c:	
  
	
  	
  	
  	
  	
  	
  	
  	
  return	
  key(N[int(k)])	
  
	
  	
  	
  	
  d0	
  =	
  key(N[int(f)])	
  *	
  (c	
  -­‐	
  k)	
  
	
  	
  	
  	
  d1	
  =	
  key(N[int(c)])	
  *	
  (k	
  -­‐	
  f)	
  
	
  
	
  	
  	
  	
  return	
  d0	
  +	
  d1	
  
CDN Performance (reduce)
80	
  
•  Outputs	
  
•  Print	
  to	
  screen	
  
•  Write	
  to	
  a	
  file	
  
•  Write	
  to	
  DDFS	
  –	
  Why	
  not?	
  
•  An	
  other	
  MR	
  job	
  with	
  chaining	
  
•  Email	
  it	
  
•  Write	
  to	
  MySQL	
  
•  Write	
  to	
  Ver1ca	
  
•  Zip	
  and	
  upload	
  to	
  Spil	
  OOSS	
  
Other goodies
81	
  
1.  Ques1on	
  &	
  Data	
  source	
  
•  Javascript	
  code	
  
•  Nginx	
  endpoint	
  
•  Logrotate	
  
•  (de-­‐personalize)	
  
•  DDFS	
  load	
  scripts	
  
2.  MR	
  jobs	
  
3.  Jobrunner	
  jobs	
  
4.  Present	
  your	
  results	
  
Steps to get to our Disco landscape
82	
  
•  Edi1ng	
  on	
  live	
  servers	
  
•  No	
  version	
  control	
  
•  No	
  staging	
  environment	
  
•  Not	
  using	
  deployment	
  mechanism	
  
•  Not	
  using	
  Con1nuous	
  Integra1on	
  
•  Poor	
  parsing	
  
•  No	
  redundancy	
  for	
  MC	
  applica1ons	
  
•  Not	
  purging	
  your	
  job	
  
•  Not	
  documen1ng	
  your	
  job	
  	
  
•  Using	
  hard	
  coded	
  configura1on	
  inside	
  MR	
  code	
  
Bad habits
83	
  
•  No	
  peer	
  review	
  
•  Not	
  ge€ng	
  back	
  events	
  from	
  slaves	
  
•  Using	
  job.wait()	
  
•  Job().run(par11ons=1)	
  
Bad habits cont.
84	
  
•  Wri1ng	
  Disco	
  jobs	
  can	
  be	
  easy	
  
•  Finding	
  the	
  right	
  abstrac1on	
  for	
  a	
  problem	
  is	
  not…	
  
•  Framework	
  is	
  on	
  the	
  way	
  -­‐>	
  DRY	
  
•  You	
  can	
  find	
  a	
  lot	
  of	
  good	
  paVerns	
  in	
  SET	
  and	
  other	
  
jobs	
  
You	
  successfully	
  took	
  a	
  step	
  to	
  understand	
  how	
  to	
  
•  Process	
  large	
  amount	
  of	
  data	
  
•  Solve	
  some	
  specific	
  problems	
  with	
  MR	
  
Summary
85	
  
•  Ecosystems	
  
•  DiscoDB:	
  lightning-­‐fast	
  key-­‐>value	
  mapping	
  
•  Discodex:	
  disco	
  +	
  ddfs	
  +	
  discodb	
  
•  Disco	
  vs.	
  Hadoop	
  
•  HDFS,	
  Hadoop	
  ecosystem	
  
•  NoSQL	
  result	
  stores	
  
Bonus: Outlook
Questions?
87	
  
•  Presenta1on	
  can	
  be	
  found	
  at:	
  
hVp://spil.com/discoworkshop2013	
  
	
  
	
  
•  You	
  can	
  contact	
  me	
  at:	
  	
  
zsolt.fabian@spilgames.com	
  
Thank you!

More Related Content

What's hot

Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time CassandraAcunu
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemGyula Fóra
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Jon Haddad
 
Tale of ISUCON and Its Bench Tools
Tale of ISUCON and Its Bench ToolsTale of ISUCON and Its Bench Tools
Tale of ISUCON and Its Bench ToolsSATOSHI TAGOMORI
 
Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015polo li
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure DataTaro L. Saito
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Guy Harrison
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Percona tool kit for MySQL DBA's
Percona tool kit for MySQL DBA'sPercona tool kit for MySQL DBA's
Percona tool kit for MySQL DBA'sKarthik .P.R
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreDataStax Academy
 
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of DatastaxGetting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of DatastaxData Con LA
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
 

What's hot (20)

Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time Cassandra
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)
 
Tale of ISUCON and Its Bench Tools
Tale of ISUCON and Its Bench ToolsTale of ISUCON and Its Bench Tools
Tale of ISUCON and Its Bench Tools
 
Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Percona tool kit for MySQL DBA's
Percona tool kit for MySQL DBA'sPercona tool kit for MySQL DBA's
Percona tool kit for MySQL DBA's
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
 
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of DatastaxGetting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of Datastax
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 

Viewers also liked

valet parking presentation
valet parking presentationvalet parking presentation
valet parking presentationMohamed Zaki
 
Upgradation in Hotel & Guest Security
Upgradation in Hotel & Guest SecurityUpgradation in Hotel & Guest Security
Upgradation in Hotel & Guest SecurityMudit Grover
 
Geust safety and security in Hotel
Geust safety and security in HotelGeust safety and security in Hotel
Geust safety and security in HotelSuman Subedi
 
CCTV Camera Presentation
CCTV Camera PresentationCCTV Camera Presentation
CCTV Camera PresentationBasith JM
 

Viewers also liked (6)

valet parking presentation
valet parking presentationvalet parking presentation
valet parking presentation
 
Upgradation in Hotel & Guest Security
Upgradation in Hotel & Guest SecurityUpgradation in Hotel & Guest Security
Upgradation in Hotel & Guest Security
 
Geust safety and security in Hotel
Geust safety and security in HotelGeust safety and security in Hotel
Geust safety and security in Hotel
 
Burs in dentisty ashish
Burs in dentisty ashishBurs in dentisty ashish
Burs in dentisty ashish
 
CCTV Camera Presentation
CCTV Camera PresentationCCTV Camera Presentation
CCTV Camera Presentation
 
Cctv presentation
Cctv presentationCctv presentation
Cctv presentation
 

Similar to Disco workshop

سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرdatastack
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learningArnaud Rachez
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...KRamasamy2
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...VAISHNAVI MADHAN
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming ModelAdarshaDhakal
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processingjins0618
 
Map reduce advantages over parallel databases
Map reduce advantages over parallel databases Map reduce advantages over parallel databases
Map reduce advantages over parallel databases Ahmad El Tawil
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...Feng Li
 
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptxPPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptxneju3
 

Similar to Disco workshop (20)

try
trytry
try
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابر
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming Model
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
Map reduce advantages over parallel databases
Map reduce advantages over parallel databases Map reduce advantages over parallel databases
Map reduce advantages over parallel databases
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...
 
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptxPPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
 

Recently uploaded

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Disco workshop

  • 1. Disco workshop From zero to CDN log processing
  • 2. 2   1.  Intro  to  parallel  compu1ng   •  Algorithms   •  Programming  model   •  Applica1ons   2.  Intro  to  MapReduce   •  History   •  (in)applicability   •  Examples   •  Execu1on  overview   3.  Wri1ng  MapReduce  jobs  with  Disco   •  Disco  &  DDFS   •  Python   •  Your  first  disco  job   •  Disco  @  SpilGames   4.  CDN  log  processing   •  Architecture   •  Availability  &  Performance  monitoring   •  Steps  to  get  to  our  Disco  landscape   Overview
  • 4. 4   Tradi1onally  (Neumann  model),  soUware  has  been  wriVen  for   serial  computa1on:   •  To  be  run  on  a  single  computer  having  a  single  CPU   •  A  problem  is  broken  into  discrete  series  of  instruc1ons   •  Instruc1ons  are  executed  one  aUer  another   •  Only  on  instruc1on  may  execute  at  any  moment  in  1me   Serial computations
  • 5. 5   A  parallel  computer  is  of  liVle  use  unless  efficient   parallel  algorithms  are  available     •  The  issues  in  designing  parallel  algorithms  are  very   different  from  those  in  designing  their  sequen1al   counterparts   •  A  significant  amount  of  work  is  being  done  to   develop  efficient  parallel  algorithms  for  a  variety  of   parallel  architectures   Design of efficient algorithms
  • 6. 6   Fibonacci series (1,1,2,3,5,8,13,21…) by F(n) = F(n-1) + F(n-2) Sequential algorithm, not parallelizable
  • 7. 7   Parallel  compu1ng  is  the  simultaneous  use  of  mul1ple  compu1ng   resources  to  solve  a  computa1onal  problem:   •  To  be  run  using  mul1ple  CPUs   •  A  problem  is  broken  down  into  discrete  parts  that  can  be   solved  concurrently   •  Each  part  is  further  broken  down  to  a  series  of  instruc1ons   •  Instruc1ons  from  each  part  execute  simultaneously  on   different  CPUs   Parallel computations
  • 9. 9   •  Descrip1on   •  The  mental  model  the  programmer  has  about  the  detailed   execu1on  of  their  applica1ons   •  Purpose   •  Improve  programmer  produc1vity   •  Evalua1on   •  Expression   •  Simplicity   •  Performance   Programming Model
  • 10. 10   •  Message  passing   •  Independent  tasks  encapsula1ng  local  data   •  Tasks  interact  by  exchanging  messages   •  Shared  memory   •  Tasks  share  a  common  address  space   •  Tasks  interact  by  reading  and  wri1ng  this  space   asynchronously   •  Data  paralleliza1on   •  Tasks  execute  a  sequence  of  independent  opera1ons   •  Data  usually  evenly  par11oned  across  tasks   •  Also  referred  to  as  “Embarrassingly  parallel”   Parallel Programming Models
  • 11. 11   •  Historically  used  for  large  scale  problems  in  science  and   Engineering   •  Physics  –  applied,  nuclear,  par1cle,  fusion,  photonics   •  Bioscience,  Biotechnology,  Gene1cs,  Sequencing   •  Chemistry,  Molecular  sciences   •  Mechanical  Engineering  –  from  prosthe1cs  to  spacecraU   •  Electrical  Engineering,  Circuit  Design,  Microelectronics   •  Computer  Science,  Mathema1cs   Applications (Scientific)
  • 12. 12   •  Commercial  applica1ons  also  provide  the  driving  force  in  the   parallel  compu1ng.  These  applica1ons  require  the  processing   of  large  amounts  of  data   •  Databases,  data  mining   •  Oil  explora1on   •  Web  search  engines,  web  based  business  services   •  Medical  imaging  and  diagnosis   •  Pharmaceu1cal  design   •  Management  of  na1onal  and  mul1-­‐na1onal  corpora1ons   •  Financial  and  economic  modeling   •  Advanced  graphics  &  VR   •  Networked  video  and  mul1-­‐media  technologies   Applications (Commercial)
  • 13. 13   •  Parallelize   •  Distribute   •  Problems?   •  Concurrency  problems   •  Coordina1on   •  Scalability   •  Fault  Tolerance   What if my job is too “big”?
  • 14. 14   •  Applica1on  is  modeled  as  Directed  Acyclic  Graph   •  DAG  defines  the  dataflow   •  Computa1onal  ver1ces   •  Ver1ces  of  the  graph  defines  the  opera1on  on  data   •  Channels   •  File   •  TCP  pipe   •  SHM  FIFO   •  Not  as  restric1ve  as  MapReduce   •  Mul1ple  Input  and  Output   •  Allows  developers  to  define  communica1on  between  ver1ces   Microsoft: MSN search group: DRYAD
  • 15. 15   “A  simple  and  powerful  interface  that  enables   automa1c  paralleliza1on  and  distribu1on  of  large-­‐scale   computa1ons,  combined  with  an  implementa1on  of   this  interface  that  achieves  high  performance  on  large   clusters  of  commodity  PCs.”   Google Deen and Ghermawat, “MapReduce: Simplified Data Processing on Large Clusters”, Google Inc.
  • 17. 17   I  have  a  ques1on  which  a  data  set  can  answer.     I  have  lots  of  data  and  I  have  of  a  cluster  of  nodes.   MapReduce  is  a  parallel  framework  which  takes  advantage   of  my  cluster  by  distribu1ng  the  work  across  each  node.     Specifically,  MapReduce  maps  data  in  the  form  of  key-­‐value   pairs  which  are  then  par11oned  into  buckets.  The  buckets   can  be  spread  easily  over  all  the  nodes  in  the  cluster  and   each  node  or  Reducer,  reduces  the  data  to  an  “answer”  or  a   list  of  “answers”.   What is MapReduce?
  • 18. 18   •  Published  in  2004  by  Google   MapReduce history
  • 19. 19   •  Published  in  2004  by  Google   •  Func1onal  programming  (eg.  Lisp,  Erlang)   •  map()  func1on   •  Applies  a  func1on  to  each  value  of  a  sequence   •  reduce()  func1on  (fold())   •  Combines  all  elements  of  a  sequence  using  a   binary  operator   MapReduce history
  • 20. 20   •  Published  in  2004  by  Google   MapReduce history
  • 21. 21   •  Restric1ve  seman1cs   •  Pipelining  Map/Reduce  stages  possibly  inefficient   •  Solvers  problems  within  a  narrow  programming  domain  well   •  DB  community:  our  parallel  RMDBSs  have  been  doing  this   forever…   •  Data  scale  maVers:  Use  MapReduce  if  you  truly  have  large   data  sets  that  are  difficult  to  process  using  simpler  solu1ons   •  Its  not  always  a  high  performance  solu1on.  Straight  python,   simple  batch  scheduled  Python,  and  C  core  can  all  outperform   MR  by  and  order  of  magnitude  or  two  on  a  single  node  for   many  problems,  even  for  so-­‐called  big  data  problems   Why NOT MapReduce?
  • 22. 22   •  Distributed  grep,  sort,  word  frequency   •  Inverted  index  construc1on   •  Page  Rank   •  Web  link-­‐graph  traversal   •  Large-­‐scale  PDF  genera1on,  image  conversion   •  Ar1ficial  Intelligence,  Machine  Learning   •  Geographical  data,  Google  Maps   •  Log  querying   •  Sta1s1cal  Machine  Transla1on   •  Analyzing  similari1es  of  user’s  behavior   •  Process  clickstream  and  demographic  data   •  Research  for  Ad  systems   •  Ver1cal  search  engine  for  trustworthy  wine  informa1on   What it is good for?
  • 23. 23   •  Google  (proprietary  implementa1on  in  C++)   •  Hadoop  (Open  Source  implementa1on  in  JAVA)   •  Disco  (erlang,  python)   •  Skynet  (ruby)   •  BashReduce  (last.fm)   •  Spark  (Scala,  func1onal  OO  lang.  on  JVM)   •  Plasma  MapReduce  (OCaml)   •  Storm  (The  hadoop  of  Real1me  Processing)   cat  a_bunch_of_files  |  ./mapper.py  |  sort  |  ./reducer.py   Flavors of MapReduce
  • 24. 24   •  Process  data  using  special  map()  and  reduce()   func1ons   •  The  map()  func1on  is  called  on  every  item  in  the   input  and  emit  a  series  of  intermediate  key/value   pairs   •  All  values  associated  with  a  given  key  are  grouped   together   •  The  reduce()  func1on  is  called  on  every  unique   key,  and  its  values  list,  and  emits  a  value  that  is   added  to  the  output   The MR programming model
  • 25. 25   •  More  formally   •  Map(k1,  v1)  -­‐>  list(k2,  v2)   •  Reduce(k2,  list(v2))  -­‐>  list(v2)   The MR programming model
  • 26. 26   •  Greatly  reduces  parallel  programming  complexity   •  Reduces  synchroniza1on  complexity   •  Automa1cally  par11ons  data   •  Provides  failure  transparency   •  Prac1cal   •  Hundreds  of  jobs  every  day   MapReduce benefits
  • 27. 27   •  Par11ons  input  data   •  Schedules  execu1on  across  a  set  of  machines   •  Handles  machine  failure   •  Manages  IPC   The MR runtime system
  • 28. 28   •  Distributed  grep   •  Map  func1on  emits  <word,  line_number>     if  a  word  matches  search  criteria   •  Reduce  func1on  is  iden1ty  func1on   •  URL  access  frequency   •  Map  func1on  processing  web  logs,  emits  <url,  1>   •  Reduce  func1on  summing  values,  emits  <url,  total>   MR Examples
  • 29. 29   •  Geospa1al  Query  processing   •  Given  an  intersec1on,  find  all  roads  connec1ng  to  it   •  Rendering  the  1les  in  the  map   •  Finding  the  nearest  feature  to  a  given  address   MR Examples
  • 30. 30   •  “Learning  the  right  abstrac1on  will  simplify  your   life.”  –  Travis  Oliphant   MR Examples Program   Map()   Reduce()   Distributed  grep   Matched  lines   pass   Reverse  web  link  graph   <target,  source>   <target,  list(src)>   URL  count   <url,  1>   <url,  total_count)   Term-­‐vector  per  host   <hostname,  term-­‐vector>   <hostname,  all-­‐term-­‐vector>   Inverted  Index   <word,  doc  id>   <word,  list(doc_id)>   Distributed  Sort   <key,  value>   pass  
  • 31. 31   •  The  user  program,  via  the  MR  library,  shards  the   input  data   MR Execution 1/8
  • 32. 32   •  The  user  program  creates  process  copies  (workers)   distributed  on  a  machine  cluster.   •  One  copy  will  be  the  “Master”  and  the  others  will  be   worker  threads   MR Execution 2/8
  • 33. 33   •  The  master  distributes  M  map  and  R  reduce     tasks  to  idle  workers.   •  M  ==  number  of  shards   •  R  ==  the  key  space  is  divided  into  R  parts   MR Execution 3/8
  • 34. 34   •  Each  map-­‐task  worker  reads  assigned  input  shard   and  outputs  intermediate  key/value  pairs   •  Output  buffered  in  RAM   MR Execution 4/8
  • 35. 35   •  Each  worker  flushes  intermediate  values,     par11oned  into  R  regions,  to  disk  and  no1fies     the  Master  process   MR Execution 5/8
  • 36. 36   •  Master  process  gives  disk  loca1on  to  an  available   reduce-­‐task  worker  who  reads  all  associated   intermediate  data   MR Execution 6/8
  • 37. 37   •  Each  reduce-­‐task  worker  sorts  its  intermediate  data.   Calls  the  reduce()  func1on,  passing  unique  keys  and   associated  key  values.  Reduce  func1on  output   appended  to  reduce-­‐task’s  par11on  output  file   MR Execution 7/8
  • 38. 38   •  Master  process  wakes  up  user  process  when     all  tasks  have  completed.     •  Output  contained  in  R  output  files.   MR Execution 8/8
  • 39. 39   •  An  input  reader   •  A  map()  func1on   •  A  par11on  func1on   •  A  compare  func1on  (sort)   •  A  reduce()  func1on   •  An  output  writer   Hot spots
  • 40. 40   MR Execution Overview
  • 41. 41   •  Fault  Tolerance   •  Master  process  periodically  pings  workers   •  Map-­‐task  failure   –  Re-­‐execute   »  All  output  was  stored  locally   •  Reduce-­‐task  failure   –  Only  re-­‐execute  par1ally  completed  tasks   »  All  output  stored  in  the  global  file  system   MR Execution Overview
  • 42. 42   •  Don’t  move  data  to  workers…  Move  workers  to  the  data!   •  Store  data  on  local  disks  for  nodes  in  the  cluster   •  Start  up  the  workers  on  the  node  that  has  data  local   •  Why?   •  Not  enough  RAM  to  hold  all  the  data  in  memory   •  Disk  access  is  slow,  disk  throughput  is  good   •  A  distributed  file  system  is  the  answer   •  GFS  (Google  File  System)  (=  Big  File  System)   •  HDFS  (Hadoop  DFS)  =  GFS  clone   •  DDFS  (Disco  DFS)   Distributed File System
  • 43. 43   •  Sequen1al  -­‐>  Parallel  -­‐>  Distributed   •  Hype  aUer  Google  published  the  paper  in  2004   •  A  very  narrow  set  of  problems   •  Big-­‐data  is  a  marke1ng  buzzword   Summary for Part I.
  • 44. 44   •  MapReduce  is  a  paradigm  for  distributed  compu1ng   developed  (patented…)  by  Google  for  performing   analysis  on  large  amounts  of  data  distributed  across   thousands  of  commodity  computers   •  The  Map  phase  processes  the  input  one  element  at  a   1me  and  returns  a  (key,  value)  pair  for  each  element   •  An  op1onal  Par11on  step  par11ons  Map  results  into   groups  based  on  a  par11on  func1on  on  the  key.   •  The  engine  merges  par11ons  and  sorts  all  the  map   results.   •  The  merged  results  are  passed  to  the  Reduce  phase.   One  or  more  reduce  jobs  reduce  the  (key,  value)  pairs   to  produce  the  final  results.   Summary for Part I (cont.)
  • 45. 45   Writing MapReduce jobs with Disco
  • 46. 46   •  Wri1ng  MapReduce  jobs  can  be  VERY  1me  consuming   •  MapReduce  paVerns   •  Debugging  a  failure  is  a  nightmare   •  Large  clusters  require  a  dedicated  team  to  keep  it  running   •  Wri1ng  a  Disco  job  becomes  a  soUware  engineering  task   •  …rather  than  a  data  analysis  task   Take a deep breath
  • 47. 47   Disco            
  • 48. 48   •  “Massive  data  –  Minimal  code”  –  by  Nokia  Research  Center   •  hVp://discoproject.org     •  WriVen  in  Erlang   •  Orchestra1ng  control   •  Robust  fault-­‐tolerant  distributed  applica1ons   •  Python  for  opera1ng  on  data   •  Easy  to  learn   •  Complex  algorithms  with  very  liVle  code   •  U1lize  favorite  python  libraries   •  The  complexity  is  hidden,  but…   About Disco
  • 49. 49   •  Distributed   •  Increase  storage  capacity  by  adding  nodes   •  Processing  on  nodes  without  transferring  data   •  Replicated   •  Chunked  data  stored  in  gzip  compressed  chunks   •  Tag  based   •  AVributes   •  CLI   •  $  ddfs  ls  data:log   •  $  ddfs  chunk  data:bigtxt  ./bigtxt   •  $  ddfs  blobs  data:bigtxt   •  $  ddfs  xcat  data:bigtxt   Disco Distributed “filesystem”    
  • 50. 50   •  Everything  is  preinstalled   •  Disco  localhost  setup:   hVps://github.com/spilgames/disco-­‐development-­‐workflow     Sandbox environment
  • 51. 51   •  www.pythonforbeginners.com  -­‐  by  Magnus   •  Import   •  Data  structures:  {}  dict,  []  list,  ()  tuple   •  Defining  func1ons  and  classes   •  Control  flow  primi1ves  and  structures:  for,  if,  …   •  Excep1on  handling   •  Regular  expressions   •  GeoIP,  MySQLdb,  …   •  To  understand  what  yield  does,  you  must  understand  what   generators  are.  And  before  generators  come  iterables.   Python – What you’ll need
  • 52. 52   When  you  create  a  list,  you  can  read  its  items  one  by  one,   and  it’s  called  itera1on:     >>>  mylist  =  [1,  2,  3]   >>>  for  i  in  mylist:   …  print  i     1   2   3   Python Lists
  • 53. 53   Mylist  is  an  iterable.  When  you  use  a  comprehension  list,  you   create  a  list  and  so  an  iterable:     >>>  mylist  =  [x*x  for  x  in  range(3)]   >>>  for  i  in  mylist:   …  print  i     0   1   4     Python Iterables
  • 54. 54   Generators  are  iterables,  but  you  can  read  them  once.  It’s  because   they  do  not  store  all  the  values  in  memory,  they  generate  the  values   on  the  fly:     >>>  mygenerator  =  (x*x  for  x  in  range(3))   >>>  for  i  in  mygenerator:   …  print  i       0   1   4     I  just  the  same  except  you  used  ()  instead  of  [].  But,  you  can  not   perform  for  i  in  mygenerator  a  second  1me  since  generators  can  only   be  used  once:  they  calculate  0,  then  forget  about  it  and  calculate  1   and  ends  calcula1ng  4,  one  by  one.   Python Generators
  • 55. 55   Yield  is  a  keyword  that  is  used  like  return,  except  the  func1on  will  return  a   generator.     >>>  def  createGenerator():   …  mylist  =  range(3)   …  for  i  in  mylist:   …    yield  i*i   …   >>>  mygenerator  =  createGenerator()   >>>  print  mygenerator   <generator  object  createGenerator  at  0xb7555c34>   >>>  for  I  in  mygenerator:   …  print  i     0   1   4   Python Yield
  • 56. 56   •  What  is  the  total  count  for  each  unique  word  in  the  text?   •  Word  coun1ng  is  the  Hello  World!  of  MapReduce   •  We  need  to  write  map()  and  reduce()  func1ons   •  Map(rec)  -­‐>  list(k,  v)   •  Reduce(k,  v)  -­‐>  list(res)   •  Your  applica1on  communicates  with  Disco  API   •  from  disco.core  import  Job,  result_iterator   Your first disco job
  • 57. 57   •  Spli€ng  file  (related  chunks)  to  lines   •  Map(line,  params)   •  Split  line  to  words   •  Emit  k,v  tuple:  <word,  1>   •  Reduce(iter,  params)   •  OUen,  this  is  an  algebraic  expression   •  <word,  [1,1,1]>  -­‐>  <word,  3>   Word count
  • 58. 58   •  Modules  to  import   •  Se€ng  the  master  host   •  DDFS   •  Job()   •  Result_iterator(Job.wait())   •  Job.purge()   Word count: Your application
  • 59. 59   def  fun_map(line,  params):    for  word  in  line.split():      yield  word,  1   Word count: Your map
  • 60. 60   def  fun_reduce(iter,  params):    from  disco.u1l  import  kvgroup    for  word,  counts  in  kvgroup(sorted(iter)):      yield  word,  sum(counts)           Built-­‐in  disco.worker.classic.func.sum_reduce()   Word count: Your reduce
  • 61. 61   job  =  Job().run(input=…,  map=fun_map,  reduce=fun_reduce)     for  word,  count  in  result_iterator(job.wait(show=True)):    print  (word,  count)     job.purge()     Word count: Your results
  • 62. 62   Class  MyJob1(Job):    @classmethod    def  map(self,  data,  params):      …        @classmethod    def  reduce(self,  iter,  params):      …     …   MyJob2.run(input=MyJob1.wait())        #  <-­‐  Job  chaining   Word count: More advanced
  • 63. 63   •  Event  Tracking  &  Adver1sing  related  jobs   •  Heatmap:  page  clicks  -­‐>  2D  density  distribu1ons   •  Reconstruc1ng  sessions   •  Ad  research   •  Behavioral  modeling   •  Log  crunching   •  Gameplays  per  country     •  Frontend  performance  (CDN)   •  404s,  Response  code  tracking   •  Intrusion  detec1on  #security   Disco @ SpilGames
  • 64. 64   •  Calculate  your  resource  need  es1mates   •  Deploy  in  workflow   •  We  have   •  Git   •  Package  repository  /  Deployment  Orchestra1on   •  Disco-­‐tools:  hVp://github.com/spilgames/disco-­‐tools/   •  Job  runner:  hVp://jobrunner/   •  Data  warehouse   •  Interac1ve,  graphical  report  genera1on   Disco @ SpilGames
  • 65. 65  
  • 66. 66   CDN log processing
  • 67. 67   •  Ques1on?   •  Availability  of  each  CDN  provider   •  Data  source   •  Javascript  sampler  on  client  side   •  LoadBalancer  -­‐>  HA  logging  endpoints     -­‐>  Access  logs  -­‐>  Disco  Distributed  FS   CDN Availability monitoring
  • 69. 69   •  Input   •  URI  parsing   •  /res.ext?v=o,1|e,1|os,1|ce,1|hw,1|c,0|l,1   •  Expected  output   •  ProviderO    98.7537%   •  ProviderE    57.8851%   •  ProviderC    99.4584%   •  ProviderL    99.4847%   CDN Availability monitoring
  • 70. 70   #  cdnData:  “o,1|e,1|os,1|ce,1|hw,1|c,0|l,1“     •  Parse  a  log  entry   •  Yield  samples   •  <o,  1>   •  <e,  1>   •  <os,  1>   •  <ce,  1>   •  <hw,  1>   •  <c,  0>   •  <l,  1>   CDN Availability monitoring (map)
  • 71. 71   def  map_cdnAvailability(line,  params):          import  urlparse          try:                  (1mestamp,  data)  =  line.split(‘,’,  1)                  data  =  dict(urlparse.parse_qsl(data,  False))                  for  cdnData  in  data[‘a’].split(‘|’)                          try:                                  cdnName  =  cdnData.split(‘,’)[0]                                  cdnAvailable  =  int(cdnData.split(‘,’)[1])                                  yield  cdnName,  cdnAvailabe                          except:  pass          except:  pass   CDN Availability monitoring (map)
  • 72. 72   Availability  of  <hw,  [1,1,1,0,1,1,1,0,1,1,0,1]>     •  kvgroup(iter)   •  The  trick:   •  Samples  =  […]   •  len(samples)  -­‐>  number  of  all  samples   •  sum(samples)  -­‐>  number  of  available   •  A  =  sum()/len()  *  100.0   CDN Availability monitoring (reduce)
  • 73. 73   def  reduce_cdnAvailability(iter,  params):          from  disco.u1l  import  kvgroup            for  cdnName,  cdnAvailabili1es  in  kvgroup(sorted(iter)):                  try:                          cdnAvailabili1es  =  list(cdnAvailabili1es)                            totalSamples  =  len(cdnAvailabili1es)                          totalAvailable  =  sum(cdnAvailabili1es)                          totalUnavailable  =  totalSamples  –  totalAvailable                            yield  cdnName,  (round(float(totalAvailable)  /  totalSamples  *  100.0,  4))                    except:  pass       CDN Availability monitoring (reduce)
  • 74. 74   •  DDFS   •  tag://logs:cdn:la010:12345678900   •  disco.ddfs.list(tag)   •  disco.ddfs.[get|set]aVr(tag,aVr,value)   •  Job(name,master).run(input,map,reduce)   •  par11ons  =  R   •  map_reader  =  disco.worker.classic.func.chain_reader   •  save  =  true     Advanced usage
  • 75. 75   CDN Performance 95th percentile with per country breakdown
  • 76. 76   •  Ques1on   •  95th  percen1le  of  response  1mes  per  CDN  per  country   •  Data  source   •  Javascript  sampler  on  client  side   •  LB  -­‐>  HA  Logging  endpoints  -­‐>  Access  logs  -­‐>  DDFS   •  Input   •  /res.ext?v=o,1234|l,2345|c,3456&ipaddr=127.0.0.1   •  Expected  output   •  ProviderN        CountryA:  3891  ms  CountryB:  1198  ms  …   •  ProviderC        CountryA:  3793  ms  CountryB:  1397  ms  …   •  ProviderE        CountryA:  3676  ms  CountryB:  1676  ms  …   •  ProviderL        CountryA:  4332  ms  CountryB:  1233  ms…     CDN Performance
  • 77. 77   The 95th percentile A 95th percentile says that 95% of the time data points are below that value and 5% of the time they are above that value. 95 is a magic number used in networking because you have to plan for the most-of-the-time case.
  • 78. 78   v=o,1234|l,2345|c,3456&ipaddr=127.0.0.1     •  Line  parsing  is  about  the  same   •  Advanced  key:  <cdn:country,  performance>   •  How  to  get  country  from  IP?   •  Job().run(…required_modules=[“GeoIP”]…)   •  No  global  variables!  Within  map()  –  Why?   •  Use  Job().run(…params={}…)  instead   •  yield  “%s:%s“  %  (cdnName,  country),  cdnPerf   CDN Performance (map)
  • 79. 79   #  <hw,  [123,  234,  345,  456,  567,  678,  798]>     def  percen1le(N,  percent,  key=lambda  x:x):          import  math          if  not  N:                  return  None          k  =  (len(N)  -­‐  1)  *  percent          f  =  math.floor(k)          c  =  math.ceil(k)          if  f  ==  c:                  return  key(N[int(k)])          d0  =  key(N[int(f)])  *  (c  -­‐  k)          d1  =  key(N[int(c)])  *  (k  -­‐  f)            return  d0  +  d1   CDN Performance (reduce)
  • 80. 80   •  Outputs   •  Print  to  screen   •  Write  to  a  file   •  Write  to  DDFS  –  Why  not?   •  An  other  MR  job  with  chaining   •  Email  it   •  Write  to  MySQL   •  Write  to  Ver1ca   •  Zip  and  upload  to  Spil  OOSS   Other goodies
  • 81. 81   1.  Ques1on  &  Data  source   •  Javascript  code   •  Nginx  endpoint   •  Logrotate   •  (de-­‐personalize)   •  DDFS  load  scripts   2.  MR  jobs   3.  Jobrunner  jobs   4.  Present  your  results   Steps to get to our Disco landscape
  • 82. 82   •  Edi1ng  on  live  servers   •  No  version  control   •  No  staging  environment   •  Not  using  deployment  mechanism   •  Not  using  Con1nuous  Integra1on   •  Poor  parsing   •  No  redundancy  for  MC  applica1ons   •  Not  purging  your  job   •  Not  documen1ng  your  job     •  Using  hard  coded  configura1on  inside  MR  code   Bad habits
  • 83. 83   •  No  peer  review   •  Not  ge€ng  back  events  from  slaves   •  Using  job.wait()   •  Job().run(par11ons=1)   Bad habits cont.
  • 84. 84   •  Wri1ng  Disco  jobs  can  be  easy   •  Finding  the  right  abstrac1on  for  a  problem  is  not…   •  Framework  is  on  the  way  -­‐>  DRY   •  You  can  find  a  lot  of  good  paVerns  in  SET  and  other   jobs   You  successfully  took  a  step  to  understand  how  to   •  Process  large  amount  of  data   •  Solve  some  specific  problems  with  MR   Summary
  • 85. 85   •  Ecosystems   •  DiscoDB:  lightning-­‐fast  key-­‐>value  mapping   •  Discodex:  disco  +  ddfs  +  discodb   •  Disco  vs.  Hadoop   •  HDFS,  Hadoop  ecosystem   •  NoSQL  result  stores   Bonus: Outlook
  • 87. 87   •  Presenta1on  can  be  found  at:   hVp://spil.com/discoworkshop2013       •  You  can  contact  me  at:     zsolt.fabian@spilgames.com   Thank you!